OkCupid: 60,000 online dating profiles
(Kim, Escobedo-Land, Journal of Statistics Education, Vol 23, Number 2 (2015) Paper
Profiles (csv file, 150 MB)
Codebook/Explanation of Variables
R
code to get started
Continuous Variables and their Distributions:
Code
Relationships and Joint Distributions for Two Continuous Variables:
Code
mammogram: information about 816 patients with mammographic masses (tumors in breast tissue). The goal of the original analysis was to develop classification models for whether or not a mass is malignant without invasive biopsies.
M.Elter, R. Schulz-Wendtland, and T. Wittenberg (2007).
The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process, Medical Physics 34(11), p. 4164-4172.
Data
Hipparcos Stars: information collected by the Hipparcos satellite on 2719 stars; variables include the visual band magnitude (Vmag), the parallactic angle (Plx), and the color of the star.
Note that log luminosity (relative to the sun) is calculated: logL = (15-Vmag - 5logPlx)/2.5
Source: Penn State Center for Astrostatistics
Freeman, Richards, Schafer, Lee (2008).
Astrostatistics: The Final Frontier, Chance, Vol 21, No. 3, p.31-35.
Paper
Data      
Variable List       Hertzsprung-Russell Diagram
Info      
Code used in class
2010 World Cup: statistics for all players and all countries for the 2010 World Cup provided by Opta (world-leading live sports data company) and the Guardian Data Blog
Data  
Variable List
CMU Class data: some demonstration data generated by a group of Carnegie Mellon undergraduates in a regression course; goal was to predict the average amount of time spent watching TV/movies per day; just a convenience sample from a very imprecise, not validated survey. Should only be used for demos and not for research.
Data  
Original Survey  
Example Code
Banknote Authentication: famous classifier dataset. Images were taken of 1372 banknotes, some counterfeit and some genuine. Wavelet tranformation tools were used to extract the following descriptive features of the images:
Variance,
Skewness,
Kurtosis,
Entropy. We also have the true label for whether or not a banknote is genuine (Yes = 1, No = 0).
Data
Generating Curved Data: code to generate randomly scatter data (with appropriate, selected error) around splines created by the user through point-and-click
curvelib.R      
generatesplinedata.R
Olive Oil Data: a famous olive oil data set used in clustering and classification. We have eight different chemical composition measurements on 572 Italian olive oils. Additionally, there are two sets of labels, the Region and the Area. There are three different Regions (Southern Italy, Sardinia, and Northern Italy). Each region is comprised of a group of area: Southern Italy = North Apulia, Calabria, South Apulia, and Sicily; Sardinia = Inland Sardinia, Costal Sardinia; Northern Italy = Umbrian, East Liguria, and West Liguria.
Data
"Aggregation": example clustering data set from the University of Eastern Finland Computer Science Department (Gionis, et al 2007)
788 2-D observations, seven groups (labels in the third column)
Data