EN.553.662: Optimization for Data Science
Previous offerings: Fall 2024 · Fall 2023
Spring 2026
Instructor: Nicolas Loizou
Lecture times: Monday, Wednesday 12:00 pm – 01:15 pm (01/20 – 04/27)
Location: Homewood Campus, Bloomberg 274
Syllabus (PDF): Download
Teaching Assistants & Office Hours
- Ben Weinberg (bweinbe5@jhu.edu) — Tue 3:00 pm – 5:00 pm
- Yuxin Ma (yma93@jh.edu) — Tue 10:30 am – 12:30 pm
- Qi Li (qli112@jh.edu) — Thu 3:00 pm – 5:00 pm
Optimization formulations and algorithms play a central role in data analysis and machine learning. In the era of big data, the need to solve large-scale optimization problems is ubiquitous across industry and science. This course is a mathematically rigorous introduction to large-scale optimization for data science and machine learning, emphasizing algorithms, convergence/complexity guarantees, and practical implementation. Applications include text analysis, page ranking, speech recognition, image classification, finance, and decision sciences.
Prerequisites / Corequisites- Linear algebra (vector spaces, quadratic forms, inner product, norms, etc.)
- Multivariate calculus (gradient, Hessian, Taylor approximation, chain rule, etc.)
- Probability theory (expectation, law of large numbers, tower property, etc.)
- Numerical programming (MATLAB / Python / Julia or equivalent)
Some degree of mathematical maturity (ability to read/write proofs) is required. Background in optimization theory is highly recommended.
Course Materials- Detailed slides (posted before each lecture)
- Primary resources:
-
Optimization for Data Analysis (Ch. 1–5), Stephen J. Wright & Benjamin Recht
Publisher page (not open access): Cambridge Core. Free alternative (author-posted notes): Wright: Optimization Algorithms for Data Analysis (PDF). -
First-order and Stochastic Optimization Methods for Machine Learning (Ch. 1–6), Guanghui Lan
Open-access draft (author): Lectures on Optimization Methods for Machine Learning (PDF). Publisher page: SpringerLink.
-
Optimization for Data Analysis (Ch. 1–5), Stephen J. Wright & Benjamin Recht
- Further useful references:
-
Convex Optimization: Algorithms and Complexity, Sébastien Bubeck
Open access: PDF (author site). -
First-Order Methods in Optimization, Amir Beck
Publisher page (not open access): SIAM. -
Understanding Machine Learning: From Theory to Algorithms, S Shalev-Schwartz, & S Ben-David
Open access (authors): PDF. -
Advances and Open Problems in Federated Learning (Ch. 1–3) Peter Kairouz et al.
Open access: arXiv PDF.
-
Convex Optimization: Algorithms and Complexity, Sébastien Bubeck
- Analyze scalable nonlinear optimization algorithms used in ML and data science.
- Execute high-performance computing (HPC) implementations of optimization algorithms in selected applications.
- Select algorithms based on problem structure (convex/nonconvex, smooth/nonsmooth, data regime, memory/distributed constraints).
- Assignments: 50%
- Exams: 50%
Stochastic Optimization
-
Foundations of Large-Scale OptimizationCore setup + tools; deterministic GD as baseline.
- Stochastic optimization for machine learning
- Basic tools from convex analysis, optimization, and probability
- Deterministic gradient descent
-
Stochastic Gradient Descent & VariantsStep-size policies + guarantees across regimes.
- Assumptions
- Constant step-size: strongly convex, convex, nonconvex guarantees
- Decreasing step-size: strongly convex, convex, nonconvex guarantees
- Mini-batches, sampling strategies (uniform vs. importance sampling)
-
Variance Reduction MethodsFaster convergence via control variates.
- SVRG, L-SVRG, SAGA
- Unified analysis of stochastic gradient methods
-
Linear System SolversRandomized solvers and links to SGD.
- Sketch-and-project methods
- Randomized Kaczmarz; randomized coordinate descent (Gauss–Seidel)
- Stochastic reformulations and connections to SGD
-
Acceleration with MomentumHeavy-ball and Nesterov; stochastic momentum variants.
- Heavy-ball (Polyak) momentum
- Nesterov acceleration
- Momentum variants in stochastic settings
-
Adaptive / Parameter-Free MethodsAdaptive scaling and parameter-free step sizes.
- AdaGrad, RMSProp, Adam
- Stochastic Polyak step-size
Variational Inequalities & Min-Max Optimization
-
BackgroundFormulations, assumptions, and ML connections.
- Formulation and connections to optimization
- Variational inequalities, min-max optimization, smooth games, saddle points
- Applications in ML; key assumptions
-
Popular AlgorithmsGDA, extragradient, and game-optimization methods.
- Gradient descent–ascent (and stochastic variants)
- Extragradient methods (and stochastic variants)
- Hamiltonian gradient methods and consensus optimization
Distributed / Decentralized / Federated Optimization
-
Main AlgorithmsWhy distribution matters; core methods and settings.
- Motivation & applications
- Distributed & decentralized methods
-
Rules for Improved Communication ComplexityCommunication efficiency via compression and locality.
- Compressed operators
- Local updates
- Network topology
The links in the table below will be updated as the semester progresses.
| Lectures | Topic | Materials |
|---|---|---|
| L1–L3 |
Foundations of Large-Scale Optimization
Part I Stochastic optimization for ML; convex/probability tools; deterministic gradient descent. |
Slides Notes |
| L4–L7 |
Stochastic Gradient Descent & Variants
Part I Assumptions; step-size policies; convex/strongly convex/nonconvex guarantees; mini-batching and sampling. |
Slides Notes |
| L8–L9 |
Variance Reduction Methods
Part I SVRG, L-SVRG, SAGA; unified analysis of stochastic gradient methods. |
Slides Notes |
| L10 |
Linear System Solvers
Part I Sketch-and-project methods; stochastic reformulations and connections to SGD. |
Slides Notes |
| L11–L12 |
Randomized Kaczmarz & Coordinate Descent
Part I Randomized Kaczmarz; randomized coordinate descent (Gauss–Seidel) for linear systems. |
Slides Notes |
| L13–L14 |
Acceleration with Momentum
Part I Heavy-ball (Polyak) momentum; Nesterov acceleration; stochastic momentum variants. |
Slides Notes |
| L15–L16 |
Adaptive / Parameter-Free Methods
Part I AdaGrad, RMSProp, Adam; stochastic Polyak step-size. |
Slides Notes |
| L17–L19 |
Variational Inequalities & Min-Max Optimization — Background
Part II Formulation and connections to optimization; smooth games and saddle points; ML applications and assumptions. |
Slides Notes |
| L20–L21 |
Min-Max Algorithms
Part II Gradient descent–ascent; extragradient methods; Hamiltonian gradient and consensus optimization (and stochastic variants). |
Slides Notes |
| L22–L23 |
Distributed / Decentralized / Federated Optimization
Part III Core algorithms; communication-efficient rules (compression, local updates, topology). |
Slides Notes |