36-350 Data Mining (Fall 2008)

Introduction to the Course

(as prepared and roughly as given)

What Is "Data Mining"?

Extracting useful predictive patterns from large collections of data

a.k.a. "Knowledge discovery in data bases"

Examples:

Information retrieval: search engines most prominently
Recommendation systems: Firefly (of happy memory), Amazon, LibraryThing
Credit: FICO, automated mortgage underwriting; fraud detection
Finance: statistical arbitrage, LTCM
Marketing: identifying demographic sub-groups, targeted advertising and promotions; rewards programs
Biology: gene identification, disease identification
Insurance/HMOs: how much to charge whom, how much to pay

Precursors/impulses go back a long time

"We have always been an information society": control revolution of the 19th century
Industrial revolution: all this stuff, and people, to keep track of
Technologies of keeping-track: forms, standards, job descriptions/requirements, schedules, exams, inspections, categories, reports, files, "your permanent record"
machine- readable and -processable data: Hollerith machines (from automatic looms), leading to IBM and the rest of the pre-computer information-processing industry
statistics: knowing/finding resources, finding patterns, making plans

Limited by cost: collecting, storing, examining data all expensive

especially when it must be done by hand
- people are slow
- people are expensive (time, training)
- people don't scale (can't just copy programs)
and when data have to be specially made rather than a by-product of normal activity

Computers drastically lower the cost of collecting, storing, accessing and examining data

Data-mining is about automating parts of the analysis process

psychiatrists are worse at predicting patient outcomes than simple decision rules
... but it turns out no profession is better than simple rules (though some are as good)
what to do when there are no good professionals

Exploratory data analysis, descriptive statistics, visualization

Inferential statistics, especially non-parametric methods

Expensive analyses meant it was worth thinking very hard about your models first
but also encouraged totally unrealistic simplifying assumptions, especially linear dependence and Gaussian distributions
we don't have to make those assumptions (so much) any more

Machine learning

Optimization

Databases

We are going to skimp on the last two

Extremely important
Huge issues arise with really big data
- with 2 million customers, there are 2 trillion customer pairs, finding the closest match takes 23 days at 1 microsecond per pair
but we can't cover everything and this is a statistics class, not computer science

Choice of representation/abstraction is important

Choices within method are important

Results depend sensitively on such choices

Choices have to be justified as helping you meet specific goals

The importance of not fooling yourself and/or programming the machine to fool you

Any new technology produces con-artists, quacks, and excess ambition

Will try to point out some ways data mining can go wrong

Institutional context in which you mine data

Serious data collection happens within big organizations, and data rarely leaves them
- logistics
- privacy
- competitive advantage
Keeping track of what the organization is trying to do (e.g., "make arrests" vs. "reduce crime")
Deciding whether you want any part of what is being attempted (e.g., many businesses would like to identify gullible customers)