0. Introduction

On this page: formulations for optimization problems and the reason to focus on convex online gradient descent. {Part of the Convex Optimization (know your Gradient Descents!) series}

Jimmy Yeh
2 min readNov 28, 2020

Machine learning optimization, the formulations

x^* \in \underset{x\in X}{\arg\min} \sum_{n=1}^N f_n(x) + \lambda g(x)
The optimization problem for machine learning.

N, the sample size, and D, the dimension of the considered functions and examples are quite large for many deep learning applications (e.g. D = pixel size 1024 × 1024). We consider more mathematical formulas for more optimization schemes.

For Linear Regression optimization, least-square regression suggests:

\hat{w}_\text{LS} \in \underset{w\in \mathbb{R}^D}{\arg\min} \sum_{n=1}^N (y_n — \langle x_n, w\rangle)^2
Least-square regression problem.

lasso (least absolute shrinkage and selection operator) suggests (c>0):

\hat{w}_\text{lasso} \in \arg\min_{w\in c\mathcal{B}_1\subset\mathbb{R}^D} \sum_{n=1}^N (y_n — \langle x_n, w\rangle)²,
Lasso regression problem.

𝓁₁-penalized least squares estimator (λ>0):

\hat{w}_{PLS}\in\underset{w\in\mathbb{R}^D}{\arg\min}\sum_{n=1}^N (y_n — \langle x_n, w\rangle)^2+\lambda\lVert w\rVert_1
𝓁₁-penalized least squares estimator.

The above examples are ℝ^D → ℝ problems, for binary regression ℝ^D → {1, +1}, logistic regression suggests:

\hat{w}_\text{log}\in\underset{w\in\mathbb{R}^D}{\arg\min}\sum_{n=1}^N \log(1+e^{-y_n\langle x_n, w\rangle}).
logistic regression minimizer.

Whereas the support vector machine suggests:

\hat{w}_\text{SVM}&\in\underset{w\in\mathbb{R}^D}{\arg\min}\sum_{n=1}^Nh(y_n\langle x_n,w\rangle)+\lambda\lVert w\rVert²_2,
SVM formulation, notice hinge loss h(u) is not differentiable.

Why focus on convex, online, gradient descent.

Convexity

Convexity very useful: Theoretical tractability; very prevalent in applications; Theories for non-convex are scarce, ad hoc, and based on convex.

The scalability concern: online+gradient

The above optimization formulations are to be approximated via iterative algorithms. The time complexity to have low dependence on N and D.

Compare online gradient descent, gradient descent, and Newton’s method.

x_{t+1} =-\nabla f_{n_t}(x_t),= -\sum_{n=1}^N\nabla f_n,(\sum_{n=1}^N \nabla²f_n(x_t))^{-1}\sum_{n=1}^N\nabla f_n(x_t)
From top to bottom: online gradient descent, gradient descent, Newton’s method.

We find online > gradient >> Newtons method, in terms of scalability.
(Nevertheless, this is a superficial comparison.)

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Jimmy Yeh
Jimmy Yeh

Written by Jimmy Yeh

no time to write will finish latter

No responses yet

Write a response