A collection of optimizer-related papers and code.
For the last column, we let GD for Gradient Descent, S for second-order (quasi-newton) methods, E for evolutionary, GF for gradient free, VR for variance reduced.
| Title | Year | Optimizer | Published | Code | |
|---|---|---|---|---|---|
| The AdEMAMix Optimizer: Better, Faster, Older | 2024 | AdEMAMix | arxiv | pytorch | GD |
| FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information | 2024 | FAdam | arxiv | pytorch | GD |
| GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection | 2024 | GaLore | arxiv | pytorch | GD |
| CoRe Optimizer: An All-in-One Solution for Machine Learning | 2023 | CoRe | arxiv | pytorch | GD |
| AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix | 2023 | AGD | arxiv | pytorch | GD,S |
| AdaLomo: Low-memory Optimization with Adaptive Learning Rate | 2023 | AdaLOMO | arxiv | pytorch | GD |
| Large Language Models as Optimizers | 2023 | OPRO | arxiv | python | llm |
| Promoting Exploration in Memory-Augmented Adam using Critical Momenta | 2023 | Adam+CM | arxiv | pytorch | GD |
| CAME: Confidence-guided Adaptive Memory Efficient Optimization | 2023 | CAME | acl'23 | pytorch | GD |
| Full Parameter Fine-tuning for Large Language Models with Limited Resources | 2023 | LOMO | arxiv | pytorch | GD |
| Prodigy: An Expeditiously Adaptive Parameter-Free Learner | 2023 | Prodigy | arxiv | pytorch | GD |
| DoWG Unleashed: An Efficient Universal Parameter-Free Gradient Descent Method | 2023 | DoWG | neurips'23 | GD | |
| Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training | 2023 | Sophia | arxiv | pytorch | GD |
| UAdam: Unified Adam-Type Algorithmic Framework for Non-Convex Stochastic Optimization | 2023 | UAdam | arxiv | GD | |
| Sharpness-Aware Minimization Revisited: Weighted Sharpness as a Regularization Term | 2023 | WSAM | kdd'23 | pytorch | GD |
| DP-Adam: Correcting DP Bias in Adam's Second Moment Estimation | 2023 | DP-Adam | iclr-W'23 | GD | |
| An Adam-enhanced Particle Swarm Optimizer for Latent Factor Analysis | 2023 | ADHPL | arxiv | E | |
| DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule | 2023 | DoG | icml'23 | pytorch | GD |
| FOSI: Hybrid First and Second Order Optimization | 2023 | FOSI | HPI'23 | jax | GD,S |
| Symbolic Discovery of Optimization Algorithms | 2023 | Lion | neurips'23 | jax, tf, pytorch | GD |
| Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale | 2022 | Amos | arxiv | jax | GD |
| VeLO: Training Versatile Learned Optimizers by Scaling Up | 2022 | VeLO | arxiv | jax | GD |
| Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method | 2022 | GradaGrad | arxiv | GD | |
| CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10 minutes on 1 GPU | 2022 | CowClip | aaai'23 | tf | GD |
| Smooth momentum: improving lipschitzness in gradient descent | 2022 | Smooth Momentum | APIN | GD | |
| Towards Better Generalization of Adaptive Gradient Methods | 2020 | SAGD | neurips'20 | GD | |
| An Improved Adaptive Optimization Technique for Image Classification | 2020 | Mean-ADAM | ICIEV | GD | |
| SCW-SGD: Stochastically Confidence-Weighted SGD | 2020 | SCWSGD | ICIP | GD | |
| Slime mould algorithm: A new method for stochastic optimization | 2020 | SMA | FGCS | code | E |
| Ranger-Deep-Learning-Optimizer | 2020 | Ranger | github | pytorch | GD |
| pbSGD: Powered Stochastic Gradient Descent Methods for Accelerated Non-Convex Optimization | 2020 | pbSGD | ijcai'20 | pytorch | GD |
| A Variant of Gradient Descent Algorithm Based on Gradient Averaging | 2020 | Grad-Avg | arxiv | GD | |
| Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum | 2020 | FRSGD | arxiv | GD | |
| CADA: Communication-Adaptive Distributed Adam | 2020 | CADA | arxiv | pytorch, matlab | GD |
| Eigenvalue-corrected Natural Gradient Based on a New Approximation | 2020 | TEKFAC | arxiv | GD | |
| SMG: A Shuffling Gradient-Based Method with Momentum | 2020 | SMG | icml'21 | GD | |
| SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization | 2020 | SALR | TNNLS | GD | |
| Self-Tuning Stochastic Optimization with Curvature-Aware Gradient Filtering | 2020 | MEKA | neurips-W'21 | GD | |
| Mixing ADAM and SGD: a Combined Optimization Method | 2020 | MAS | arxiv | pytorch | GD |
| EAdam Optimizer: How ε Impact Adam | 2020 | EAdam | arxiv | pytorch | GD |
| Adam+: A Stochastic Method with Adaptive Variance Reduction | 2020 | Adam+ | arxiv | GD | |
| Sharpness-aware Minimization for Efficiently Improving Generalization | 2020 | SAM | iclr'21 | jax | GD |
| Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties | 2020 | Expectigrad | arxiv | tf | GD |
| AEGD: Adaptive Gradient Descent with Energy | 2020 | AEGD | AIMS | pytorch | GD |
| Adam with Bandit Sampling for Deep Learning | 2020 | Adambs | arxiv | GD | |
| AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients | 2020 | AdaBelief | neurips'20 | pytorch | GD |
| Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization | 2020 | Apollo[W] | arxiv | pytorch | GD,S |
| S-SGD: Symmetrical Stochastic Gradient Descent with Weight Noise Injection for Reaching Flat Minima | 2020 | S-SGD | arxiv | GD | |
| Gravilon: Applications of a New Gradient Descent Method to Machine Learning | 2020 | Gravilon | arxiv | GD | |
| PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization | 2020 | PAGE | icml'21 | GD | |
| Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities | 2020 | Ada{ACSA,AGD+} | aaai'21 | GD | |
| Stochastic Normalized Gradient Descent with Momentum for Large Batch Training | 2020 | SNGM | arxiv | GD | |
| AdaScale SGD: A User-Friendly Algorithm for Distributed Training | 2020 | AdaScale | icml'21 | GD | |
| Momentum-based variance-reduced proximal stochastic gradient method for composite nonconvex stochastic optimization | 2020 | PSTorm | JOTA | GD | |
| MTAdam: Automatic Balancing of Multiple Training Loss Terms | 2020 | MTAdam | acl'21 | pytorch | GD |
| AdaSGD: Bridging the gap between SGD and Adam | 2020 | AdaSGD | arxiv | GD | |
| AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights | 2020 | AdamP | iclr'21 | pytorch | GD |
| Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes | 2020 | LANS | arxiv | pytorch | GD |
| AdaSwarm: Augmenting Gradient-Based optimizers in Deep Learning with Swarm Intelligence | 2020 | AdaSwarm | TETC | pytorch | E |
| Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods | 2020 | SKQN,S4QN | cvpr'21 | GD | |
| Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs | 2020 | SHAdaGrad | arxiv | GD | |
| A New Accelerated Stochastic Gradient Method with Momentum | 2020 | SGDM | arxiv | GD | |
| Practical Quasi-Newton Methods for Training Deep Neural Networks | 2020 | K-BFGS[(L)] | neurips'20 | pytorch | GD |
| AdaS: Adaptive Scheduling of Stochastic Gradients | 2020 | AdaS | cvpr'22 | pytorch | GD |
| Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia | 2020 | Adai | icml'22 | pytorch | GD |
| ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning | 2020 | ADAHESSIAN | aaai'21 | pytorch | GD |
| Momentum with Variance Reduction for Nonconvex Composition Optimization | 2020 | MVRC-[1,2] | arxiv | GD | |
| CoolMomentum: A Method for Stochastic Optimization by Langevin Dynamics with Simulated Annealing | 2020 | CoolMomentum | arxiv | tf, pytorch | GD |
| Gradient Centralization: A New Optimization Technique for Deep Neural Networks | 2020 | GC | eccv'20 | pytorch, tf | GD |
| AdaX: Adaptive Gradient Descent with Exponential Long Term Memory | 2020 | AdaX[-W] | arxiv | pytorch | GD |
| Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale | 2020 | RM3 | arxiv | tf | GD |
| TAdam: A Robust Stochastic Gradient Optimizer | 2020 | TAdam | arxiv | pytorch | GD |
| Iterative Averaging in the Quest for Best Test Error | 2020 | Gadam | arxiv | GD | |
| On the distance between two neural networks and the stability of learning | 2020 | Fromage | neurips'20 | pytorch | GD |
| Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent | 2020 | SRSGD | arxiv | pytorch | GD |
| Stochastic Runge-Kutta methods and adaptive SGD-G2 stochastic gradient descent | 2020 | SGD-G2 | arxiv | GD | |
| LaProp: Separating Momentum and Adaptivity in Adam | 2020 | LaProp | arxiv | pytorch | GD |
| Compositional ADAM: An Adaptive Compositional Solver | 2020 | C-ADAM | arxiv | GD | |
| Biased Stochastic Gradient Descent for Conditional Stochastic Optimization | 2020 | BSGD | arxiv | GD | |
| On the Trend-corrected Variant of Adaptive Stochastic Optimization Methods | 2020 | AdamT | ijcnn'20 | pytorch | GD |
| Efficient Learning Rate Adaptation for Convolutional Neural Network Training | 2019 | e-AdLR | ijcnn'19 | GD | |
| ProxSGD: Training Structured Neural Networks under Regularization and Constraints | 2019 | ProxSGD | iclr'20 | tf | GD |
| An Adaptive Optimization Algorithm Based on Hybrid Power and Multidimensional Update Strategy | 2019 | AdaHMG | ieee | GD | |
| signSGD via Zeroth-Order Oracle | 2019 | ZO-signSGD | iclr'19 | GF | |
| Fast DENSER: Efficient Deep NeuroEvolution | 2019 | F-DENSER | arxiv | tf | E |
| Adathm: Adaptive Gradient Method Based on Estimates of Third-Order Moments | 2019 | Adathm | DSC | GD | |
| A new perspective in understanding of Adam-Type algorithms and beyond | 2019 | AdamAL | arxiv | pytorch | GD |
| CProp: Adaptive Learning Rate Scaling from Past Gradient Conformity | 2019 | CProp | arxiv | pytorch | GD |
| Domain-independent Dominance of Adaptive Methods | 2019 | AvaGrad, Delayed Adam | cvpr'21 | pytorch | GD |
| Second-order Information in First-order Optimization Methods | 2019 | AdaSqrt | arxiv | tf | GD |
| Does Adam optimizer keep close to the optimal point? | 2019 | AdaFix | arxiv | GD | |
| Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates | 2019 | AdaAlter | arxiv | mxnet | GD |
| UniXGrad: A Universal, Adaptive Algorithm with Optimal Guarantees for Constrained Optimization | 2019 | UniXGrad | neurips'19 | GD | |
| Demon: Improved Neural Network Training with Momentum Decay | 2019 | Demon {SGDM,Adam} | icassp'22 | tf | GD |
| ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization | 2019 | ZO-AdaMM | neurips'19 | tf | GF |
| On Empirical Comparisons of Optimizers for Deep Learning | 2019 | RMSterov | arxiv | GD | |
| An Adaptive and Momental Bound Method for Stochastic Learning | 2019 | AdaMod | arxiv | pytorch | GD |
| On Higher-order Moments in Adam | 2019 | HAdam | arxiv | GD | |
| diffGrad: An Optimization Method for Convolutional Neural Networks | 2019 | diffGrad | TNNLS | pytorch | GD |
| Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM | 2019 | SAMSGrad | arxiv | pytorch | GD |
| On the Variance of the Adaptive Learning Rate and Beyond | 2019 | RAdam | iclr'20 | pytorch, TF | GD |
| BGADAM: Boosting based Genetic-Evolutionary ADAM for Neural Network Optimization | 2019 | BGADAM | arxiv | GD | |
| Adaloss: Adaptive Loss Function for Landmark Localization | 2019 | Adaloss | arxiv | GD | |
| signADAM: Learning Confidences for Deep Neural Networks | 2019 | signADAM[++] | icdmw'19 | pytorch | GD |
| The Role of Memory in Stochastic Optimization | 2019 | PolyAdam | UAI'20 | GD | |
| Lookahead Optimizer: k steps forward, 1 step back | 2019 | Lookahead | neurips'19 | tf, pytorch | GD |
| Momentum-Based Variance Reduction in Non-Convex SGD | 2019 | STORM | neurips'19 | pytorch | GD |
| SAdam: A Variant of Adam for Strongly Convex Functions | 2019 | SAdam | iclr'20 | code | GD |
| Matrix-Free Preconditioning in Online Learning | 2019 | RecursiveOptimizer | icml'19 | tf | GD |
| PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization | 2019 | PowerSGD[M] | neurips'19 | pytorch | GD |
| Fast-DENSER++: Evolving Fully-Trained Deep Artificial Neural Networks | 2019 | F-DENSER++ | arxiv | tf | E |
| Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks | 2019 | Novograd | neurips'19 | pytorch | GD |
| An Adaptive Remote Stochastic Gradient Method for Training Neural Networks | 2019 | NAMS{G,B},ARSG | arxiv | pytorch,mxnet | GD |
| Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates | 2019 | ArmijoLS | neurips'19 | pytorch | GD |
| Large Batch Optimization for Deep Learning: Training BERT in 76 minutes | 2019 | LAMB | iclr'19 | tf,pytorch | GD |
| On the Convergence Proof of AMSGrad and a New Version | 2019 | AdamX | arxiv | GD | |
| An Optimistic Acceleration of AMSGrad for Nonconvex Optimization | 2019 | OPT-AMSGrad | acml'21 | GD | |
| Parabolic Approximation Line Search for DNNs | 2019 | PAL | neurip'20 | pytorch | GD |
| Gradient-only line searches: An Alternative to Probabilistic Line Searches | 2019 | GOLS-I | arxiv | GD | |
| Adaptive Gradient Methods with Dynamic Bound of Learning Rate | 2019 | AdaBound | iclr'19 | pytorch | GD |
| Memory-Efficient Adaptive Optimization | 2019 | SM3 | neurips'19 | tf | GD |
| DADAM: A Consensus-based Distributed Adaptive Gradient Method for Online Optimization | 2019 | DADAM | arxiv | matlab | GD |
| On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks | 2018 | Ada{NAG,HB} | arxiv | GD | |
| SADAGRAD: Strongly Adaptive Stochastic Gradient Methods | 2018 | SADAGRAD | icml'18 | GD | |
| PSA-CMA-ES: CMA-ES with population size adaptation | 2018 | PSA-CMA-ES | gecco'18 | E | |
| Adaptive Methods for Nonconvex Optimization | 2018 | Yogi | neurips'18 | tf | GD |
| Deep Frank-Wolfe For Neural Network Optimization | 2018 | DFW | iclr'19 | pytorch | GD |
| HyperAdam: A Learnable Task-Adaptive Adam for Network Training | 2018 | HyperAdam | aaai'19 | tf, pytorch | GD |
| Practical Bayesian Learning of Neural Networks via Adaptive Optimisation Methods | 2018 | BADAM | icml'20 | tf | GD |
| Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization | 2018 | KGD | arxiv | tf | GD |
| Quasi-hyperbolic momentum and Adam for deep learning | 2018 | QHM,QHAdam | iclr'19 | pytorch, tf | GD |
| AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods | 2018 | AdaShift | iclr'19 | pytorch | GD |
| Optimal Adaptive and Accelerated Stochastic Gradient Descent | 2018 | A2Grad{Exp,Inc,Uni} | arxiv | pytorch | GD |
| Accelerating SGD with momentum for over-parameterized learning | 2018 | MaSS | arxiv | tf | GD |
| Online Adaptive Methods, Universality and Acceleration | 2018 | AcceleGrad | neurips'18 | GD | |
| On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization | 2018 | AdaFom | iclr'19 | GD | |
| AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes | 2018 | AdaGrad-Norm | icml'19 | pytorch | GD |
| Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam | 2018 | VAdam | vadam'18 | pytorch, tf | GD |
| Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks | 2018 | Padam | ijcai'20 | pytorch | GD |
| Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis | 2018 | EKFAC | neurips'18 | pytorch | GD |
| Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods | 2018 | AdaBayes[FP] | neurips'18 | pytorch | GD |
| Nostalgic Adam: Weighting more of the past gradients when designing the adaptive learning rate | 2018 | NosAdam | ijcai'19 | pytorch | GD |
| Small steps and giant leaps: Minimal Newton solvers for Deep Learning | 2018 | Curveball | iccv'19 | matlab | GD |
| GADAM: Genetic-Evolutionary ADAM for Deep Neural Network Optimization | 2018 | GADAM | arxiv | GD | |
| Adafactor: Adaptive Learning Rates with Sublinear Memory Cost | 2018 | Adafactor | icml'18 | pytorch | GD |
| Aggregated Momentum: Stability Through Passive Damping | 2018 | AggMo | iclr'19 | pytorch, tf | GD |
| Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization | 2018 | Katyusha X | icml'18 | VR | |
| WNGrad: Learn the Learning Rate in Gradient Descent | 2018 | WNGrad | arxiv | C++ | GD |
| VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning | 2018 | VR-SGD | IKDE | C++ | GD |
| signSGD: Compressed Optimisation for Non-Convex Problems | 2018 | signSGD | icml'18 | mxnet | GD |
| Shampoo: Preconditioned Stochastic Tensor Optimization | 2018 | Shampoo | icml'18 | tf | GD |
| L4: Practical loss-based stepsize adaptation for deep learning | 2018 | L4{Adam,Momentum} | neurips'18 | pytorch, tf | GD |
| On the Convergence of Adam and Beyond | 2018 | AMSGrad, AdamNC | iclr'18 | pytorch | GD |
| SW-SGD: The Sliding Window Stochastic Gradient Descent Algorithm | 2017 | SW-SGD | PCS | GD | |
| Improving Generalization Performance by Switching from Adam to SGD | 2017 | SWATS | iclr'18 | pytorch | GD |
| Noisy Natural Gradient as Variational Inference | 2017 | Noisy {Adam,K-FAC} | icml'18 | tf | GD |
| AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training | 2017 | AdaComp | aaai'18 | GD | |
| AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks | 2017 | AdaBatch | iclr-W'18 | PyTorch | GD |
| First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time | 2017 | NEON | neurips'18 | GD | |
| BPGrad: Towards Global Optimality in Deep Learning via Branch and Pruning | 2017 | BPGrad | cvpr'18 | matlab | GD |
| Decoupled Weight Decay Regularization | 2017 | AdamW,SGDW | iclr'19 | lua | GD |
| Evolving Deep Convolutional Neural Networks for Image Classification | 2017 | EvoCNN | ITEC | python | E |
| Normalized Direction-preserving Adam | 2017 | ND-Adam | arxiv | pytorch, tf | GD |
| Regularizing and Optimizing LSTM Language Models | 2017 | NT-ASGD | iclr'18 | pytorch | GD |
| Natasha 2: Faster Non-Convex Optimization Than SGD | 2017 | Natasha{1.5,2} | neurips'18 | GD | |
| Large Batch Training of Convolutional Networks | 2017 | LARS | arxiv | pytorch | GD |
| Practical Gauss-Newton Optimisation for Deep Learning | 2017 | KFRA, KFLR | icml'17 | GD | |
| YellowFin and the Art of Momentum Tuning | 2017 | YellowFin | arxiv | tf | GD |
| Variants of RMSProp and Adagrad with Logarithmic Regret Bounds | 2017 | SC-{Adagrad,RMSProp} | icml'17 | pytorch | GD |
| Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients | 2017 | M-SVAG | icml'18 | tf | GD |
| Training Deep Networks without Learning Rates Through Coin Betting | 2017 | COCOB | neurips'17 | tf | GD |
| Sub-sampled Cubic Regularization for Non-convex Optimization | 2017 | SCR | icml'17 | numpy | S |
| Online Convex Optimization with Unconstrained Domains and Losses | 2017 | RescaledExp | neurips'16 | GD | |
| Evolving Deep Neural Networks | 2017 | CoDeepNEAT | arxiv | tf | E |
| SARAH: A Novel Method for Machine Learning Problems Using Stochastic Recursive Gradient | 2017 | SARAH | icml'17 | VR | |
| IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate | 2017 | IQN | icassp'17 | C++ | GD,S |
| NMODE --- Neuro-MODule Evolution | 2017 | NMODE | arxiv | C++ | E |
| The Whale Optimization Algorithm | 2016 | WOA | AES | numpy | E |
| Incorporating Nesterov Momentum into Adam | 2016 | Nadam | arxiv | pytorch | GD |
| Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates | 2016 | Eve | arxiv | pytorch | GD |
| Direct Feedback Alignment Provides Learning in Deep Neural Networks | 2016 | DFA | neurips'16 | numpy | GD |
| SGDR: Stochastic Gradient Descent with Warm Restarts | 2016 | SGDR | iclr'17 | theano | GD |
| Stochastic Quasi-Newton Methods for Nonconvex Stochastic Optimization | 2016 | Damp-oBFGS-Inf | SIAM | pytorch | GD,S |
| A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order | 2016 | ZO-SCD | neurips'16 | GF | |
| Barzilai-Borwein Step Size for Stochastic Gradient Descent | 2016 | {SGD,SVRG}-BB | neurips'16 | numpy | GD |
| Adaptive Learning Rate via Covariance Matrix Based Preconditioning for Deep Neural Networks | 2016 | SDProp | ijcai'17 | GD | |
| Katyusha: The First Direct Acceleration of Stochastic Gradient Methods | 2016 | Katyusha | stoc'17 | VR | |
| Accelerating SVRG via second-order information | 2015 | SVRG+{I,II} | arxiv | GD,S | |
| adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs | 2015 | adaQN | ecml'16 | numpy | GD,S |
| A Linearly-Convergent Stochastic L-BFGS Algorithm | 2015 | SVRG-SQN | aistats | julia | GD,S |
| Optimizing Neural Networks with Kronecker-factored Approximate Curvature | 2015 | K-FAC | icml'15 | tf | GD |
| Probabilistic Line Searches for Stochastic Optimization | 2015 | ProbLS | JMLR | GD | |
| Scale-Free Algorithms for Online Linear Optimization | 2015 | AdaFTRL | alt'15 | GD | |
| Adam: A Method for Stochastic Optimization | 2014 | Adam, AdaMax | iclr'15 | pytorch | GD |
| Random feedback weights support learning in deep neural networks | 2014 | FA | arxiv | pytorch | GD |
| A Computationally Efficient Limited Memory CMA-ES for Large Scale Optimization | 2014 | LM-CMA-ES | gecco'14 | E | |
| A Proximal Stochastic Gradient Method with Progressive Variance Reduction | 2014 | Prox-SVRG | SIAM | tf, numpy | VR |
| RES: Regularized Stochastic BFGS Algorithm | 2014 | Reg-oBFGS-Inf | arxiv | GD,S | |
| A Stochastic Quasi-Newton Method for Large-Scale Optimization | 2014 | SQN | SIAM | matlab | GD,S |
| SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives | 2014 | SAGA | neurips'14 | numpy | VR |
| Accelerating stochastic gradient descent using predictive variance reduction | 2013 | SVRG | neurips'13 | pytorch | VR |
| Ad Click Prediction: a View from the Trenches | 2013 | FTRL | kdd'13 | pytorch | GD |
| Semi-Stochastic Gradient Descent Methods | 2013 | S2GD | arxiv | VR | |
| Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming | 2013 | ZO-SGD | SIAM | GF | |
| Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization | 2013 | ZO-{ProxSGD,PSGD} | arxiv | GF | |
| Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients | 2013 | vSGD-fd | arxiv | GD | |
| Neural Networks for Machine Learning | 2012 | RMSProp | coursera | tf | GD |
| An Enhanced Hypercube-Based Encoding for Evolving the Placement, Density, and Connectivity of Neurons | 2012 | ES-HyperNEAT | AL | go | E |
| CMA-TWEANN: efficient optimization of neural networks via self-adaptation and seamless augmentation | 2012 | CMA-TWEANN | gecoo'12 | E | |
| ADADELTA: An Adaptive Learning Rate Method | 2012 | ADADELTA | arxiv | pytorch | GD |
| No More Pesky Learning Rates | 2012 | vSGD-{b,g,l} | icml'13 | lua | VR |
| A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets | 2012 | SAG | neurips'12 | VR | |
| CMA-ES: evolution strategies and covariance matrix adaptation | 2011 | CMA-ES | gecco'12 | tf | E |
| Adaptive Subgradient Methods for Online Learning and Stochastic Optimization | 2011 | AdaGrad | JMLR | pytorch,C++ | GD |
| AdaDiff: Adaptive Gradient Descent with the Differential of Gradient | 2010 | AdaDiff | iopscience | GD | |
| A Hypercube-Based Encoding for Evolving Large-Scale Neural Networks | 2009 | HyperNEAT | AL | E | |
| Scalable training of L1-regularized log-linear models | 2007 | OWL-QN | acm | javascript | GD,S |
| A Stochastic Quasi-Newton Method for Online Convex Optimization | 2007 | O-LBFGS | icml'07 | GD,S | |
| Online convex programming and generalized infinitesimal gradient ascent | 2003 | OGD | icml'03 | GD | |
| A Limited Memory Algorithm for Bound Constrained Optimization | 2003 | L-BFGS-B | SIAM | fortran, matlab | GD,S |
| Evolving Neural Networks through Augmenting Topologies | 2002 | NEAT | EC | numpy | E |
| Trust region methods | 2000 | Sub-sampled TR | SIAM | S | |
| A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm | 1993 | RPROP | icnn'93 | pytorch | GD |
| Acceleration of Stochastic Approximation by Averaging | 1992 | ASGD | SIAM | pytorch | GD |
| Particle swarm optimization | 1995 | PSO | icnn'95 | E | |
| On the limited memory BFGS method for large scale optimization | 1989 | L-BFGS | MP | GD,S | |
| Large-scale linearly constrained optimization | 1978 | MINOS | MP | pytorch | GD,S |
| Some methods of speeding up the convergence of iteration methods | 1964 | Polyak (momentum) | paper | GD | |
| A Stochastic Approximation Method | 1951 | SGD | paper | pytorch | GD |