Random Feature Machines, a.k.a. Random Kitchen Sinks

36-465/665, Spring 2021

22 April 2021 (Lecture 22)

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\Margin}{M} \]

Reminders

Reminders (2)

Kernel machines are great

Kernel machines are not so great

Random feature machines, a.k.a. kitchen sinks

What does this have to do with kernels?

Bochner’s theorem: If \(K(x,z)=K(x-z)\) is a kernel over \(\mathbb{R}^d\), then there’s a probability distribution \(\rho\) over \(\mathbb{R}^d\) and a constant \(C>0\) where \[ K(x,z) = C \int_{\mathbb{R}^d}{\rho(w)\cos{(w\cdot(x-z))} dw} \] + Note: \(K(x,z)/C\) is another kernel with the same feature space, so we can set \(C=1\) “without loss of generality”

What does this have to do with kernels? (2)

Random feature approximation to kernel machines

We don’t really need a kernel

Kitchen sinks in practice

A demo in R

library(expandFunctions)
my.rFfs <- raptMake(p = 2, q = 30, WdistOpt = list(sd = 1), bDistOpt = list(min = -pi, 
    max = pi))
dim(rapt(as.matrix(df[, c("x1", "x2")]), my.rFfs))
## [1] 200  30
df.augmented <- data.frame(y = df$y, cos(rapt(as.matrix(df[, c("x1", "x2")]), my.rFfs)))

Here are the first few \(W_j\)

##            [,1]        [,2]
## [1,]  1.0917135 -1.83151364
## [2,]  0.4475455  1.16703196
## [3,]  0.7942617 -1.64220647
## [4,] -0.6990022 -0.43673732
## [5,] -1.3940621 -0.48644337
## [6,]  0.9985419 -0.09824253

and the corresponding \(b_j\)

##            [,1]
## [1,]  0.4192803
## [2,] -1.4453380
## [3,]  0.4489669
## [4,] -2.5730839
## [5,] -2.2053458
## [6,]  0.4401365

glm.rks <- glm(y ~ ., data = df.augmented, family = "binomial")
table(predict(glm.rks, type = "response") >= 0.5, df.augmented$y)
##        
##           0   1
##   FALSE 100   0
##   TRUE    0 100

Summing up

References

Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” E-print, arxiv:2012.00152. https://arxiv.org/abs/2012.00152.

Rahimi, Ali, and Benjamin Recht. 2008. “Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems 20 (Nips 2007), edited by John C. Platt, Daphne Koller, Yoram Singer, and Samuel T. Roweis, 1177–84. Red Hook, New York: Curran Associates. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.

———. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems 21 [Nips 2008], edited by Daphne Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 1313–20. Red Hook, New York: Curran Associates, Inc. https://papers.nips.cc/paper/2008/hash/0efe32849d230d7f53049ddc4a4b0c60-Abstract.html.