Linear Classifiers and Logistic Regression

36-462/36-662, Spring 2020

4 February 2020

\[ \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \]

Context

We’ve seen a bunch of different ways of doing classification
- kNN: Vote among the nearest neighbors
- Trees: Find combinations of features associated with (relatively) pure distributions
- Bagging and forests: Vote among trees
Today:
- Start thinking about decision boundaries
- Protoype method
- Other linear classifiers
- Revisit logistic regression as a classifier

The prototypical case for the prototype method

The prototype method

Find the prototype feature vector for each class
- Prototypes are usually the means: for each class \(c\), with \(n_c\) data points in the training data, \[ \vec{m}_c = \frac{1}{n_c}\sum_{i: y_i = c}{\vec{x}_i} \]
- Sometimes use medians, trimmed means, etc.
Classify a new point, \(\vec{x}_0\), by seeing which class has the closest prototype to \(\vec{x}_0\): \[ \hat{y}(\vec{x}_0) = \argmin_{c}{\| \vec{x}_0 - \vec{m}_c\|} \]

The prototype method in action

The prototype method…

Is like 1-nearest neighbors, but where we replace the actual training data points with the class centers
In a case like this, the performance of the prototype method will be very similar to that of kNN, but instead of having to remember \(n\) training data points, we just need to remember the 2 class centers
The prototype method will tend to work well when each class forms a clump that’s widely separated from the other classes’ clumps

The boundary

Where is the boundary where our classification switches over?
- Clearly, it’s the place where the distance to the two class centers is equal.
We remember how to find that from high school geometry:
- Draw the line segment connecting the two centers
- Then draw the perpendicular bisector of that line segment

The boundary, with algebra

\[\begin{eqnarray*} \| \vec{x}_0 - \vec{m}_1 \| & = & \|\vec{x}_0 - \vec{m}_0\| \\ \| \vec{x}_0 - \vec{m}_1 \|^2 & = & \|\vec{x}_0 - \vec{m}_0\|^2 \\ \|\vec{x}_0\|^2 - 2\vec{x}\cdot\vec{m}_1 + \|\vec{m}_1\|^2 & = & \|\vec{x}_0\|^2 - 2\vec{x}\cdot\vec{m}_0 + \|\vec{m}_0\|^2 \\ \|\vec{m}_1\|^2 - \|\vec{m}_0\|^2 & = & \vec{x}_0 \cdot 2(\vec{m}_1 - \vec{m}_0)\\ \end{eqnarray*}\]

So the prototype method is equivalent to the rule \[ \hat{y}(\vec{x}_0) = \left\{\begin{array}{cc} 1 & \mathrm{if} ~ \left(\|\vec{m}_0\|^2 - \|\vec{m}_1\|^2\right) + \vec{x}_0 \cdot 2(\vec{m}_1 - \vec{m}_0) \geq 0\\ 0 & \mathrm{otherwise} \end{array}\right. \]
Notice we’ve got one equation for the boundary, with \(p\) unknown coordinates in \(\vec{x}_0\), so we get a \((p-1)\)-dimensional set of solutions (a line if \(p=2\), a plane if \(p=3\), etc.)
- The boundary is going to be perpendicular to the difference vector \(\vec{m}_1 - \vec{m}_0\) (why?)

Linear classifiers

A linear classifier takes the form \[ \hat{y}(\vec{x}_0) = \Indicator{\beta_0 + \vec{\beta} \cdot \vec{x}_0 \geq 0} \]
- \(\vec{\beta}\) is perpendicular to the decision boundary (the normal vector of the decision boundary), and the offset says how far the decision boundary is from going through the origin
- (Some people instead write \(\Indicator{b + \vec{w} \cdot \vec{x}_0 \geq 0}\), etc.)
Every prototype classifier is a linear classifier, but not vice versa
- We just saw how to extract the offset and the coefficient vector from the location of the centers
The prototype method doesn’t work well when the two classes inter-penetrate or overlap
- Over-lap pulls the two class centers together
- We can still try to find a good linear classifier
Multiple linear classifiers can give the same
- \((\beta_0, \vec{\beta})\) works just the same as \((a\beta_0, a\vec{\beta})\) for any \(a > 0\)
- Often (but not always) we standardize so \(\|\vec{\beta}\| = 1\)
Example code:

linear.classifier = function(x, coefficients, offset) {
  # The following is actually a (multiple of) the directed distance
  distance.from.plane = function(z) { offset + z %*% coefficients }
  directed.distances = apply(x, 1, directed.distance.from.plane)
  return(ifelse(directed.distances >= 0, 1, 0))
}

Margin

Once we have a classifier with a decision boundary, the margin of point \(\vec{x}_i\) is the distance of \(\vec{x}_i\) from the boundary
- We count the distance positively if \(\vec{x}_i\) is correctly classified, and negatively if \(\vec{x}_i\) is mis-classified
In symbols, \[ \gamma_i(\beta_0, \vec{\beta}) = (2 y_i - 1)\left(\frac{\beta_0}{\|\vec{\beta}\|} + \vec{x}_i \cdot \frac{\vec{\beta}}{\|\vec{\beta}\|}\right) \]
- Stuff inside the parentheses \(=\) the directed distance of \(\vec{x}_i\) from the boundary (positive if \(\vec{x}_i\) is on the positive side)
- \((2 y_i - 1)\) is \(+1\) if \(y_i=1\) and \(-1\) if \(y_i=0\)
- So \(\gamma_i\) is, as promised, positive for correctly-classified points and negative for mis-classified points
The margin of the classifier is the smallest margin of any point: \[ \gamma(\beta_0, \vec{\beta}) = \min_{i \in 1:n}{\gamma_i(\beta_0, \vec{\beta})} \]
\(\gamma > 0\) if and only if all the points are correctly classified
Notice that \(\gamma_i\), and so \(\gamma\), is continuous in the parameters
- The number of mis-classifications is dis-continuous…

Estimating a linear classifier

Prototypes
Maximize the classification accuracy
- Combinatorial optimization is hard
- Usually many linear boundaries with equal accuracy
Maximize the margin
- Continuous optimization is much nicer
- Prefers saner-looking boundaries
- Can control out-of-sample error rates in terms of in-sample margin
“Perceptron” (1956): go over the data points one at a time
- If the current boundary classifies the current point correctly, change nothing and go on to the next point
- Otherwise, move the boundary in the direction of accommodating the current point, and go on to the next
- Repeat until nothing changes

Working probabilities back in

If a point is very close to the boundary, we shouldn’t have much confidence in the classification
If a point is far away from the boundary, we should have more confidence
- Unless maybe it’s also far away from the training data?
Let’s try to connect probabilities to classifications

First try: linear probability models

Since \(Y=0\) or \(Y=1\), \(\Expect{Y|\vec{X}=\vec{x}} = \Prob{Y=1|\vec{X}=\vec{x}}\), so just use linear regression; the decision boundary will be where the predicted response is \(0.5\)
Pro: requires no knowledge beyond how to type lm
Con: cheerfully predicts probabilities \(> 1\) or \(< 0\)
- It is embarrassing when you say “people in such-and-such county had a 200% probability of voting for the Republic Party” (or it should be)
Don’t do this
- Unless you are very, very sure that your features are always going to be in the range where the predicted probability is sensible
- And you are very, very sure that your later users are never going to extrapolate outside that range
- And you are comfortable using a linear regression here (because why, exactly?)

Second try: find a transformation of the probability that’s linear

We care about \(p(\vec{x}) \equiv \Prob{Y=1|\vec{X}=\vec{x}}\)
Find a transformation of \(p\) that’s linear in \(\vec{x}\)
- \(\log{p}\) won’t work (why not?)
There are many functions which map \([0,1]\) to \((-\infty, +\infty)\) and are continuous and invertible
- Inverse of the standard Gaussian CDF (“probit”)
- Inverse of any CDF for a distribution with unbounded range

Think about the likelihood

We observe data \((\vec{x}_1, y_1), (\vec{x}_2, y_2), \ldots (\vec{x}_n, y_n)\)
Assuming independent responses (given features), the likelihood will be \[ \prod_{i=1}^{n}{p(\vec{x}_i)^{y_i} (1-p(\vec{x}_i))^{1-y_i}} \]
- This is like a Bernoulli or binomial, but now each trial gets its own success probability that’s a function of the features
Re-arrange terms: \[ \prod_{i=1}^{n}{(1-p(\vec{x}_i)) {\left( \frac{p(\vec{x}_i)}{1-p(\vec{x}_i)}\right)} ^{y_i}} \]
- \(\frac{p}{1-p}\) is the odds ratio; the likelihood only depends on the outcomes (\(y_i\)) through the odds ratios
Take the log: \[ L = \sum_{i=1}^{n}{\log{(1-p(\vec{x}_i))} + y_i \log{\left( \frac{p(\vec{x}_i)}{1-p(\vec{x}_i)}\right)}} \]
- The log-likelihood only depends on the outcomes \(y_i\) through the log odds ratio \(\log{\frac{p}{1-p}}\)
- The log odds ratio maps \([0,1]\) to \((-\infty,\infty)\) continuously and invertibly
- Any model we use is going to fundamentally involve the log odds ratio, so why don’t we make that linear in the features?

Logistic regression

Assume that \[ \log{\left( \frac{p(\vec{x})}{1-p(\vec{x}_i)}\right)} = \beta_0 + \vec{\beta}\cdot\vec{x} \]
- Interpretation: unit change in feature \(x_j\) adds \(\beta_j\) to the log odds of \(Y=1\)
Equivalently, assume that \[ p(\vec{x}) = \frac{e^{\beta_0 + \vec{\beta}\cdot\vec{x}}}{1+e^{\beta_0 + \vec{\beta}\cdot\vec{x}}} = \frac{1}{1+e^{-(\beta_0 + \vec{\beta}\cdot\vec{x})}} \]
- No easy interpretation of how changing \(x_j\) changes the probability that \(Y=1\)
Jargon: people call \(\log{p/(1-p)}\) the logit transform of \(p\) (probability \(\mapsto\) log-odds), and \(e^q/(1+e^q)\) is the inverse logit transform (log-odds \(\mapsto\) probability)
- The faraway library implements these as the logit() and ilogit() functions

The logistic curve

Very large negative values of \(\beta_0 + \vec{\beta}\cdot \vec{x}\): probabilities driven to 0
Exponential take-off as \(\beta_0 + \vec{\beta}\cdot \vec{x}\) increases
Probability \(1/2\) as \(\beta_0 + \vec{\beta}\cdot \vec{x}\) crosses 0
Very large positive values of \(\beta_0 + \vec{\beta}\cdot \vec{x}\) : probabilities driven to 1
“Diminishing returns” or “saturation”
- If we start from log-odds of 0 (probability 1/2), adding or subtracting 1 to the log-odds changes the probability to \(0.7310586\) or \(0.2689414\)…
- … but if we start from a log-odds of 10 (probability \(0.9999546\)), adding or subtracting one from log-odds barely matters (probabilities \(0.9999833\) or \(0.9998766\))
- Similarly for if the initial log-odds were \(-10\)

Thinking through logistic regression

Suppose we’ve estimated a logistic regression here and gotten \(\hat{\beta}_0 = 0, \hat{\beta}_1 = 1, \hat{\beta}_2 = 1\)
The classification boundary will be the place where the log-odds \(=0\), so the probability \(=1/2\), or the line \(x_2 = -x_1\) (dashed, above)
Movement perpendicular to the boundary changes the log odds
Movement parallel to the boundary does not change the log odds
- The points \(A\) and \(B\) will have equal log odds for \(Y=1\) (and those log odds will be \(> 0\))
- The point \(C\) will have slightly lower log odds than either \(A\) or \(B\)
- The point \(D\) will have much lower log odds than either \(A\) or \(B\)
- … even though \(D\) is geometrically closer to \(A\) and \(B\) than \(C\) is

How do we estimate logistic regression?

Maximize the log-likelihood!
Take derivatives w.r.t. parameters, set equal to zero…
There is no closed-form solution
- This is the usual story with maximum likelihood; linear regression with Gaussian noise is kind of weird in that you can write out the maximum explicitly
- Fortunately, we can optimize numerically
- See backup on how we do that
- WARNING: Maximum likelihood estimation of a logistic regression will diverge when you give it linearly-separable data
  - Can you explain why?
  - This is what we have in our running example
  - You should be so lucky as to have this problem in real life

How do we estimate logistic regression?

In R:

glm(y ~ x1 + x2, data = df, family = "binomial")

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     0.07124      1.66586      1.44060  
## 
## Degrees of Freedom: 199 Total (i.e. Null);  197 Residual
## Null Deviance:       277.3 
## Residual Deviance: 1.849e-09     AIC: 6

The family="binomial" option tells glm() that we’re trying to estimate the probability that y \(==1\), and the default “link function” is the logistic
- if you really want another transformation of the probability (like the “probit”), that’s an additional option
Notice the warning about not converging due to perfect separation of the classes

Why logistic regression?

Tradition / recycling of methods and algorithms from linear regression
- Tables of coefficients, \(p\)-values, confidence intervals…
- We can do very familiar statistical inference when the model’s right
- Less of a concern for data mining
  - “in the predictive setting, all parameters are nuisance parameters” (Butler 1986, 1)
It often works pretty well, especially if you give it good features
- As the proverb says: “When you’re fund-raising, it’s ‘artificial intelligence’; when you’re hiring, it’s ‘machine learning’; when you’re implementing, it’s ‘logistic regression’”
- … but lots of things work pretty well, if you give them good features

Summing up

The prototype method works well for classification if each class comes from a well-separated clump
The prototype method is a special case of linear classification, where we try to find a linear boundary between the classes
We can often get good performance by looking for classifiers with large margins
Logistic regression extends linear classifiers to an actual probability model
- We can apply any probability threshold we like
- We can check then model
- … all of which may be superfluous if we just want to classifty

Going beyond linear classification

Can’t separate the “x” points from the “o” points with a linear boundary
In fact, no linear classifier will do much better than chance here
So should we just give up on linear methods, or is there some way to adapt these ideas? A hint:

Backup: Optimizing the log-likelihood

We have some function \(L(\theta)\) which we want to maximize
- For logistic regression, this is the log-likelihood of the responses \(y_i\) conditional on the features \(\vec{x}_i\), and \(\theta =\) the whole vector of parameters
Gradient ascent: start with a guess \(\theta_0\) for the whole vector of parameters, find the gradient \(\nabla L(\theta_0)\), try \(\theta_1 = \theta_0 + \eta \nabla L(\theta_0)\), keep going until the gradient is \(\approx 0\)
- “Find the direction in which the log-likelihood is increasing most rapidly; take a small step in that direction”
- Need to control the step size \(\eta\)
- Sometimes make \(\eta\) shrink as we take more steps
Newton’s method: start with guess \(\theta_0\), find \(\nabla L(\theta_0)\) and the Hessian \(\nabla \nabla L(\theta_0)\), then try \(\theta_1 = \theta_0 - (\nabla\nabla L(\theta_0))^{-1} \nabla L(\theta_0)\), keep going until the gradient is \(\approx 0\)
- Uses the second derivatives (Hessians) to say what the step size should be
- Needs fewer optimization steps than gradient ascent, but each step is more expensive (work out 2nd derivatives and invert a matrix)
- There are tricks for approximating the Hessian matrix to speed it up (like “Fisher scoring”)
- For logistic regression, each Newton’s method step ends up looking like a weighted linear least-squares problem
- The under-the-hood default method for the glm() function in R
Stochastic gradient descent: At each step, randomly sample a small number (10, or even 1) of the data points, and calculate the gradient using just those data points
- Unbiased estimate of the log-likelihood and so of the gradient
- Adds some noise due to sampling but much faster when \(n\) is really big
- Stochastic Newton’s method: obvious modifications are obvious
Simulated annealing: start with a guess \(\theta_0\), then add a small random disturbance to \(\theta_0\) to get \(\theta^{\prime}\); if \(L(\theta^{\prime}) > L(\theta_0)\), set \(\theta_1=\theta^{\prime}\) and go on; if \(L(\theta^{\prime}) \leq L(\theta_0)\), make the switch with probability \(e^{(L(\theta^{\prime}) - L(\theta_0))/T}\); decrease \(T\) as you go on
- “If the new position is better, go to it; otherwise, switch anyway if it’s only a little worse, to explore your options; get more picky as you go”
- (Inspired by an analogy to how cooling materials [“annealing”] in physics drives the substance to its minimum-energy state)
  - Energy works like negative log-likelihood

Why “logistic”?

The function \(e^{t}/(1+e^{t})\) is called the logistic function or logistic curve
It first showed up in models of population growth against a fixed resource base: starts small, exponential growth, then saturates as the population approaches what resources can support
- And “logistics” is the art of supplying an army (from the French word loger, “to lodge”)
So in one sense this is just re-cycling bits of math because the ancestors could…
… but the log-odds ratio does legitimately show up when we try to do maximum likelihood for any model with binary responses
- Making the log-odds linear in \(\vec{x}\) is less obvious

References

Butler, Ronald W. 1986. “Predictive Likelihood Inference with Applications.” Journal of the Royal Statistical Society B 48:1–38. http://www.jstor.org/stable/2345635.