Optimization problems arising in intelligent systems are similar to those studied in other fields (such as operations research, control, and computational physics), but they have some prominent features that set them apart, and which are not addressed by classic optimization methods. Numerical optimization is a domain where probabilistic numerical methods offer a particularly interesting theoretical contribution.

One key issue is computational noise. Big Data problems often have the property that computational precision can be traded off against computational cost. One of the most widely occuring problem structure is that one has to find a (local) optimum of a function $L$ that is the sum of many similar terms, each arising from an individual data point $y_i$

$$L(x) = \frac{1}{N}\sum_{i = 1} ^N \ell(y_i,x) $$

Examples of this problem include the training of neural networks, of logistic regressors, and many other linear and nonlinear regression/classification algorithms. If the dataset is very large or even infinite, it is impossible, or at least inefficient, to evaluate the entire sum. Instead, one draws $M\ll N$ (hopefully representative) *samples $y_j$* from some distribution and computes the approximation

$$\hat{L}(x) = \frac{1}{M} \sum_{j=1} ^M \ell(y_j,x) \approx L(x)$$

If the representers $y_j$ are drawn independently and identically from some distribution, then this approximation deviates, relative to the true $L(x)$, by an approximately Gaussian disturbance.

Classic optimizers like quasi-Newton methods are unstable to these disturbances, hence the popularity of first-order methods, like stochastic gradient descent (sgd), in deep learning. But even such simple methods become harder to use in the stochastic domain. In particular, sgd and its variants exhibit free parameters (e.g. step-sizes / learning rates) in the stochastic setting, even though such parameters can be easily tuned automatically in the noise-free case. Thus, even at the world's leading large AI companies, highly trained engineers spent their work time hand-tuning parameters by repeatedly running the same training routine on high-performance hardware. A very wasteful use of both human and hardware resources. A NeurIPS workshop organized by us in 2016 highlighted the urgency of this issue.

The probabilistic perspective offers a clean way to capture this issue: It simply amounts to changing the *likelihood* term of the computation from a point measure on $L(x)$ to a *Gaussian* distribution $p(\hat{L}\mid L) = \mathcal{N}(\hat{L};L,\Sigma)$. This seemingly straightforward formulation immediately offers an analytical avenue to understand why existing optimizers fundamentally require hand-tuning: While a point measure only has a single parameter (the location), a Gaussian has *two* parameters: mean and (co-) variance. But the latter does not feature in classic analysis, and is simply unknown to the algorithm. It is possible to show [ ] that this lack of information can make certain parameters (such as step sizes) fundamentally un-identified. Identifying them not just requires new algorithms, but also concretely *computing* a new object: In addition to batch gradients, also batch *square* gradients, to empirically estimate the variance. Doing so is not free, but it has low and bounded computational cost [ ], because it can re-use the back-prop pass, the most expensive part of deep learning training steps.

Over years we have built a series of tools that use this additional quantity to tune various learning hyperparameters such as the learning rate [ ] [ ] [ ], batch size [ ] and stopping criterion [ ]. We have also contributed theoretical analysis for some of the most popular deep learning optimizers [ ] and are now working towards a new code platform for the automation of deep learning in the inner loop, to free practitioners hands to build models, rather than tune algorithms.

10 results

**Probabilistic Approaches to Stochastic Optimization**
Eberhard Karls Universität Tübingen, Germany, 2018 (phdthesis)

**Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients**
In *Proceedings of the 35th International Conference on Machine Learning (ICML)*, 2018 (inproceedings) Accepted

**Probabilistic Line Searches for Stochastic Optimization**
*Journal of Machine Learning Research*, 18(119):1-59, November 2017 (article)

**Early Stopping Without a Validation Set**
*arXiv preprint arXiv:1703.09580*, 2017 (article)

**Coupling Adaptive Batch Sizes with Learning Rates**
In *Proceedings Conference on Uncertainty in Artificial Intelligence (UAI) 2017*, pages: 410-419, (Editors: Gal Elidan and Kristian Kersting), Association for Uncertainty in Artificial Intelligence (AUAI), Conference on Uncertainty in Artificial Intelligence (UAI), August 2017 (inproceedings)

**Probabilistic Line Searches for Stochastic Optimization**
In *Advances in Neural Information Processing Systems 28*, pages: 181-189, (Editors: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama and R. Garnett), Curran Associates, Inc., 29th Annual Conference on Neural Information Processing Systems (NIPS), 2015 (inproceedings)

**Probabilistic Interpretation of Linear Solvers**
*SIAM Journal on Optimization*, 25(1):234-260, 2015 (article)

**Quasi-Newton Methods: A New Direction**
*Journal of Machine Learning Research*, 14(1):843-865, March 2013 (article)

**Fast Probabilistic Optimization from Noisy Gradients**
In *Proceedings of The 30th International Conference on Machine Learning, JMLR W&CP 28(1)*, pages: 62–70, (Editors: S Dasgupta and D McAllester), ICML, 2013 (inproceedings)

**Quasi-Newton Methods: A New Direction**
In *Proceedings of the 29th International Conference on Machine Learning*, pages: 25-32, ICML ’12, (Editors: John Langford and Joelle Pineau), Omnipress, New York, NY, USA, ICML, July 2012 (inproceedings)