36-465/665, Spring 2021
22 April 2021 (Lecture 22)
\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[1]{\mathrm{Cov}\left[ #1 \right]} \newcommand{\Risk}{r} \newcommand{\EmpRisk}{\hat{\Risk}} \newcommand{\Loss}{\ell} \newcommand{\OptimalStrategy}{\sigma} \DeclareMathOperator*{\argmin}{argmin} \newcommand{\ModelClass}{S} \newcommand{\OptimalModel}{s^*} \DeclareMathOperator{\tr}{tr} \newcommand{\Indicator}[1]{\mathbb{1}\left\{ #1 \right\}} \newcommand{\myexp}[1]{\exp{\left( #1 \right)}} \newcommand{\eqdist}{\stackrel{d}{=}} \newcommand{\Rademacher}{\mathcal{R}} \newcommand{\EmpRademacher}{\hat{\Rademacher}} \newcommand{\Growth}{\Pi} \newcommand{\VCD}{\mathrm{VCdim}} \newcommand{\OptDomain}{\Theta} \newcommand{\OptDim}{p} \newcommand{\optimand}{\theta} \newcommand{\altoptimand}{\optimand^{\prime}} \newcommand{\ObjFunc}{M} \newcommand{\outputoptimand}{\optimand_{\mathrm{out}}} \newcommand{\Hessian}{\mathbf{h}} \newcommand{\Penalty}{\Omega} \newcommand{\Lagrangian}{\mathcal{L}} \newcommand{\HoldoutRisk}{\tilde{\Risk}} \DeclareMathOperator{\sgn}{sgn} \newcommand{\Margin}{M} \]
Bochner’s theorem: If \(K(x,z)=K(x-z)\) is a kernel over \(\mathbb{R}^d\), then there’s a probability distribution \(\rho\) over \(\mathbb{R}^d\) and a constant \(C>0\) where \[ K(x,z) = C \int_{\mathbb{R}^d}{\rho(w)\cos{(w\cdot(x-z))} dw} \] + Note: \(K(x,z)/C\) is another kernel with the same feature space, so we can set \(C=1\) “without loss of generality”
library(expandFunctions)
my.rFfs <- raptMake(p = 2, q = 30, WdistOpt = list(sd = 1), bDistOpt = list(min = -pi,
max = pi))
dim(rapt(as.matrix(df[, c("x1", "x2")]), my.rFfs))
## [1] 200 30
df.augmented <- data.frame(y = df$y, cos(rapt(as.matrix(df[, c("x1", "x2")]), my.rFfs)))
Here are the first few \(W_j\)
## [,1] [,2]
## [1,] 1.0917135 -1.83151364
## [2,] 0.4475455 1.16703196
## [3,] 0.7942617 -1.64220647
## [4,] -0.6990022 -0.43673732
## [5,] -1.3940621 -0.48644337
## [6,] 0.9985419 -0.09824253
and the corresponding \(b_j\)
## [,1]
## [1,] 0.4192803
## [2,] -1.4453380
## [3,] 0.4489669
## [4,] -2.5730839
## [5,] -2.2053458
## [6,] 0.4401365
glm.rks <- glm(y ~ ., data = df.augmented, family = "binomial")
table(predict(glm.rks, type = "response") >= 0.5, df.augmented$y)
##
## 0 1
## FALSE 100 0
## TRUE 0 100
Domingos, Pedro. 2020. “Every Model Learned by Gradient Descent Is Approximately a Kernel Machine.” E-print, arxiv:2012.00152. https://arxiv.org/abs/2012.00152.
Rahimi, Ali, and Benjamin Recht. 2008. “Random Features for Large-Scale Kernel Machines.” In Advances in Neural Information Processing Systems 20 (Nips 2007), edited by John C. Platt, Daphne Koller, Yoram Singer, and Samuel T. Roweis, 1177–84. Red Hook, New York: Curran Associates. http://papers.nips.cc/paper/3182-random-features-for-large-scale-kernel-machines.
———. 2009. “Weighted Sums of Random Kitchen Sinks: Replacing Minimization with Randomization in Learning.” In Advances in Neural Information Processing Systems 21 [Nips 2008], edited by Daphne Koller, D. Schuurmans, Y. Bengio, and L. Bottou, 1313–20. Red Hook, New York: Curran Associates, Inc. https://papers.nips.cc/paper/2008/hash/0efe32849d230d7f53049ddc4a4b0c60-Abstract.html.