0. Introduction

On this page: formulations for optimization problems and the reason to focus on convex online gradient descent. {Part of the Convex Optimization (know your Gradient Descents!) series}

2 min readNov 28, 2020

--

Machine learning optimization, the formulations

x^* \in \underset{x\in X}{\arg\min} \sum_{n=1}^N f_n(x) + \lambda g(x) — The optimization problem for machine learning.

N, the sample size, and D, the dimension of the considered functions and examples are quite large for many deep learning applications (e.g. D = pixel size 1024 × 1024). We consider more mathematical formulas for more optimization schemes.

For Linear Regression optimization, least-square regression suggests:

\hat{w}_\text{LS} \in \underset{w\in \mathbb{R}^D}{\arg\min} \sum_{n=1}^N (y_n — \langle x_n, w\rangle)^2 — Least-square regression problem.

lasso (least absolute shrinkage and selection operator) suggests (c>0):

\hat{w}_\text{lasso} \in \arg\min_{w\in c\mathcal{B}_1\subset\mathbb{R}^D} \sum_{n=1}^N (y_n — \langle x_n, w\rangle)², — Lasso regression problem.

𝓁₁-penalized least squares estimator (λ>0):

\hat{w}_{PLS}\in\underset{w\in\mathbb{R}^D}{\arg\min}\sum_{n=1}^N (y_n — \langle x_n, w\rangle)^2+\lambda\lVert w\rVert_1 — 𝓁₁-penalized least squares estimator.

The above examples are ℝ^D → ℝ problems, for binary regression ℝ^D → {−1, +1}, logistic regression suggests:

\hat{w}_\text{log}\in\underset{w\in\mathbb{R}^D}{\arg\min}\sum_{n=1}^N \log(1+e^{-y_n\langle x_n, w\rangle}). — logistic regression minimizer.

Whereas the support vector machine suggests:

\hat{w}_\text{SVM}&\in\underset{w\in\mathbb{R}^D}{\arg\min}\sum_{n=1}^Nh(y_n\langle x_n,w\rangle)+\lambda\lVert w\rVert²_2, — SVM formulation, notice hinge loss h(u) is not differentiable.

Why focus on convex, online, gradient descent.

Convexity

Convexity very useful: Theoretical tractability; very prevalent in applications; Theories for non-convex are scarce, ad hoc, and based on convex.

The scalability concern: online+gradient

The above optimization formulations are to be approximated via iterative algorithms. The time complexity to have low dependence on N and D.

Compare online gradient descent, gradient descent, and Newton’s method.

x_{t+1} =-\nabla f_{n_t}(x_t),= -\sum_{n=1}^N\nabla f_n,(\sum_{n=1}^N \nabla²f_n(x_t))^{-1}\sum_{n=1}^N\nabla f_n(x_t) — From top to bottom: online gradient descent, gradient descent, Newton’s method.

We find online > gradient >> Newtons method, in terms of scalability.
(Nevertheless, this is a superficial comparison.)

0. Introduction

On this page: formulations for optimization problems and the reason to focus on convex online gradient descent. {Part of the Convex Optimization (know your Gradient Descents!) series}

Machine learning optimization, the formulations

Why focus on convex, online, gradient descent.

Convexity

The scalability concern: online+gradient

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Jimmy Yeh

No responses yet