36-350 Data Mining (Fall 2009)

Introduction to the Course

(as prepared and roughly as given)

What Is "Data Mining"?

Extracting useful predictive patterns from large collections of data

a.k.a. "Knowledge discovery in data bases"

Examples:

Why now?

Precursors/impulses go back a long time

Limited by cost: collecting, storing, examining data all expensive

Computers drastically lower the cost of collecting, storing, accessing and examining data

Data-mining is about automating parts of the analysis process

Clinical vs. actuarial judgment as proof-of-concept

Sources and Methods

Exploratory data analysis, descriptive statistics, visualization

Inferential statistics, especially non-parametric methods

Machine learning: blurs in to inferential statistics

Optimization

Databases

We are going to skimp on the last two

Some Themes

Choice of representation/abstraction is important

Choices within method are important

Methods and representations are interdependent

Choices have to be justified as helping you meet specific goals; beware of optimality criteria!

The importance of not fooling yourself and/or programming the machine to fool you: using predictions and perturbations

Technical theme: bias/variance or accuracy/precision trade-off

Technical theme: adaptability is a partial substitute for knowledge

Technical theme: successive approximation/iterative algorithms

Waste, Fraud and abuse

Any new technology produces con-artists, quacks, and excess ambition

Will try to point out some ways data mining can go wrong

Institutional context in which you mine data