A *learning algorithm* is the backbone of machine learning that distinguishes it from traditional computer programming by allowing data-driven model building. In the past years, we have developed learning algorithms using a number and tools and for diverse application domains, as outlined below.

Kernel methods offer a mathematically elegant arsenal to help tackle several problems that arise in machine learning ranging from probabilistic inference to deep learning. Recently, a subfield of kernel methods known as *Hilbert space embedding of distributions* has gained increasing popularity [ ], thanks to foundational work done in our department during the last 10+ years. For a probability distribution $\mathbb{P}$ over a measurable space $\mathcal{X}$, the kernel mean embedding of $\mathbb{P}$ can is defined as the mapping $\mu: \mathbb{P} \mapsto \int k(x,\cdot) \,\mathrm{d}\mathbb{P}(x)$ where $k:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ is a positive definite kernel function. Its applications include, but are not limited to, comparing real-world distributions on the basis of samples, differentially private learning, and determining goodness-of-fit of a model.

Our department has an ongoing history of contributions to the state-of-the-art in this area. In [ ], we develop privacy-preserving algorithms based on the kernel mean embedding that allow one to release a database while guaranteeing the privacy of each record in the database.

In applications such as probabilistic programming, transforming a base random variable $X$ with a function $f$ forms a basic building block to manipulate a probabilistic model. It then becomes necessary to characterize the distribution of $f(X)$. In [ ], we show that for any continuous function $f$, consistent estimators of the mean embedding of a random variable $X$ lead to consistent estimators of the mean embedding of $f(X)$. For Mat\`ern kernels and sufficiently smooth functions, we also provide rates of convergence.

In [ ], we address the problem of measuring the relative goodness of fit of two models using kernel mean embeddings. Given two candidate models, and a set of target observations, the goal is to produce a set of interpretable examples (so-called informative features) which indicate the regions in the data domain where one model fits better than the other. The task is formulated as a statistical test whose runtime complexity is linear in the sample size.

**Figure:** The Rosenbrock function, a non-convex function which serves as a test-bed for optimization algorithms (image credit: Wikipedia)

Optimization lies at the heart of most machine learning algorithms. Characteristics of modern large-scale machine learning problems include: high-dimensional, noisy, and uncertain data; huge volumes of batch or streaming data; intractable models, low accuracy, and reliance on distributed computation or stochastic approximations. Optimizing under these settings with approaches such as coordinate descent and the Frank-Wolfe algorithm has shown promosing results in recent years. The high-level goal of research in optimization in our department is to understand the convergence property of coordinate descent as well as Frank-Wolfe optimization algorithms under different sampling schemes and constraints.

It is well known that greedy coordinate descent (CD) converges faster in practice than the randomized version, however the properties of greedy CD were less well understood. In [ ], we provide a theoretical understanding of greedy coordinate descent for smooth functions. We also propose an approximate greedy CD approach which is computationally cheap and always provably better than the randomized version. Similarly, in [ ] we propose an adaptive recursive sampling scheme based on the min-max optimal solution of the variance reduction problem to achieve faster convergence for CD. The proposed approach can also be applied to stochastic gradient descent.

Matching pursuit (MP), Frank-Wolfe (FW), and coordinate descent do have a similar structure of the optimization problem. A connection between MP and coordinate descent is explored in [ ]. We also prove the rate for accelerated matching pursuit, which was not known previously. The MP algorithm for optimization over conic hulls is proposed in [ ]. In [ ], an easy-to-implement conditional gradient method is proposed for a composite minimization problem, which converges at the rate of $O(1/\sqrt{k})$. In a different line of work, we propose a Frank-Wolfe based approach to boost variational inference [ ], which enables us to analyze the convergence of the proposed framework under suitable assumptions.

Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. It has been shown that machine learning problems arising in tasks such as recommendation, ranking, and web-advertising can be reduced to the framework of extreme classification. It had been long conjectured that a binary-relevance-based one-vs-rest scheme is not statistically and computationally tenable for such scenarios. Surprisingly, we have been able to show in our recent work [ ], that a Hamming loss minimizing one-vs-rest paradigm is key to getting good prediction performance, as well as to efficient training (by enabling parallel training). DiSMEC [ ], when published in 2016, surpassed the contemporary state-of-the-art methods by up to 10\% points on various datasets consisting of up to a million labels. Since then, it has been a top-performing benchmark method in this domain for over two years now. The concurrent training coupled with model pruning paradigms in DiSMEC have motivated algorithms by Microsoft research which have been used in the Bing Search engine for dynamic search advertising and related searches.

Research interest in *deep neural networks*, especially in the generative adversarial network (GAN) approach [ ], has increased substantially in recent years. In [ ], we propose a simple module to improve a GAN by preprocessing samples with a network that initially makes the task of the discriminator harder (akin to a data smoothing), thus simplifying the generator's task. This leads to a tempered learning process for both generator and discriminator. In a number of experiments, the proposed method can improve quality, stability and/or convergence speed across a range of different GAN architectures (DCGAN, LSGAN, WGAN-GP). In [ ], we propose the AdaGAN, a boosting style meta-algorithm which can be combined with various modern generative models (including GANs and VAEs) to improve their quality. We provide an optimal closed form solution for performing greedy updates to approximate an unknown distribution with a sequentially built mixture in any given f-divergence. The paper establishes a fruitful connection between learning theory and neural network research and has already attracted a large amount of follow-up work.

The work [ ] develops a deep neural network that can learn to write programs from a corpus of program induction problems. The approach leads to an order of magnitude speedup over strong baselines and an approach based on a recurrent neural network (RNN).

Ideas from causality are beginning to influence our work on machine learning, and the notion of *independent causal mechanism* has been adopted in several areas including semi-supervised learning, domain adaptation, and transfer learning. From a deep learning perspective, we investigated whether a set of independent mechanisms can be recovered using deep neural networks [ ]. We proposed an algorithm that enables a set of experts (i.e., deep neural networks) to recover independent (inverse) mechanisms from a data set that has undergone unlabelled transformations. Using a competitive training procedure, the experts specialize to different mechanisms. Not only can the mechanisms be learned successfully, but the system also generalizes to transformed data in other domains.

27 results

**Sobolev GAN**
*6th International Conference on Learning Representations (ICLR)*, May 2018, *equal contribution (conference)

**Design and Analysis of the NIPS 2016 Review Process **
*Journal of Machine Learning Research*, 19(49):1-34, 2018, *equal contribution (article)

**Informative Features for Model Comparison**
*Advances in Neural Information Processing Systems 31*, pages: 816-827, (Editors: S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett), Curran Associates, Inc., 32nd Annual Conference on Neural Information Processing Systems, December 2018 (conference)

**Robustifying Independent Component Analysis by Adjusting for Group-Wise Stationary Noise**
2018, *equal contribution (article) Submitted

**Fidelity-Weighted Learning**
*6th International Conference on Learning Representations (ICLR)*, May 2018 (conference)

**Learning Independent Causal Mechanisms**
*Proceedings of the 35th International Conference on Machine Learning (ICML)*, 80, pages: 4033-4041, Proceedings of Machine Learning Research, (Editors: Dy, Jennifer and Krause, Andreas), PMLR, July 2018 (conference)

**A Conditional Gradient Framework for Composite Convex Minimization with Applications to Semidefinite Programming**
*Proceedings of the 35th International Conference on Machine Learning (ICML)*, 80, pages: 5713-5722, Proceedings of Machine Learning Research, (Editors: Dy, Jennifer and Krause, Andreas), PMLR, July 2018 (conference)

**On Matching Pursuit and Coordinate Descent**
*Proceedings of the 35th International Conference on Machine Learning (ICML)*, 80, pages: 3204-3213, Proceedings of Machine Learning Research, (Editors: Dy, Jennifer and Krause, Andreas), PMLR, July 2018 (conference)

**Prediction of Glucose Tolerance without an Oral Glucose Tolerance Test**
*Frontiers in Endocrinology*, 9, pages: 82, 2018 (article)

**Tempered Adversarial Networks**
*Proceedings of the 35th International Conference on Machine Learning (ICML)*, 80, pages: 4448-4456, Proceedings of Machine Learning Research, (Editors: Dy, Jennifer and Krause, Andreas), PMLR, July 2018 (conference)

**Boosting Variational Inference: an Optimization Perspective**
*Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS)*, 84, pages: 464-472, Proceedings of Machine Learning Research, (Editors: Amos Storkey and Fernando Perez-Cruz), PMLR, April 2018 (conference)

**Leveraging the Crowd to Detect and Reduce the Spread of Fake News and Misinformation**
*Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM)*, pages: 324-332, (Editors: Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek), ACM, Febuary 2018 (conference)

**Differentially Private Database Release via Kernel Mean Embeddings**
*Proceedings of the 35th International Conference on Machine Learning (ICML)*, 80, pages: 423-431, Proceedings of Machine Learning Research, (Editors: Dy, Jennifer and Krause, Andreas), PMLR, July 2018 (conference)

**Greedy Algorithms for Cone Constrained Optimization with Convergence Guarantees**
*Advances in Neural Information Processing Systems 30*, pages: 773-784, (Editors: Guyon I. and Luxburg U.v. and Bengio S. and Wallach H. and Fergus R. and Vishwanathan S. and Garnett R.), Curran Associates, Inc., 31st Annual Conference on Neural Information Processing Systems, December 2017 (conference)

**Safe Adaptive Importance Sampling**
*Advances in Neural Information Processing Systems 30*, pages: 4384-4394, (Editors: Guyon I. and Luxburg U.v. and Bengio S. and Wallach H. and Fergus R. and Vishwanathan S. and Garnett R.), Curran Associates, Inc., 31st Annual Conference on Neural Information Processing Systems, December 2017 (conference)

**Kernel Mean Embedding of Distributions: A Review and Beyond**
*Foundations and Trends in Machine Learning*, 10(1-2):1-141, 2017 (article)

**Approximate Steepest Coordinate Descent**
*Proceedings of the 34th International Conference on Machine Learning*, 70, pages: 3251-3259, Proceedings of Machine Learning Research, (Editors: Doina Precup, Yee Whye Teh), PMLR, International Conference on Machine Learning (ICML), August 2017 (conference)

**Local Group Invariant Representations via Orbit Embeddings**
*Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 54, pages: 1225-1235, Proceedings of Machine Learning Research, (Editors: Aarti Singh and Jerry Zhu), April 2017 (conference)

**DeepCoder: Learning to Write Programs**
*Proceedings International Conference on Learning Representations 2017*, OpenReviews.net, International Conference on Learning Representations, April 2017 (conference)

**Distilling Information Reliability and Source Trustworthiness from Digital Traces**
*Proceedings of the 26th International Conference on World Wide Web (WWW)*, pages: 847-855, (Editors: Barrett, R., Cummings, R., Agichtein, E. and Gabrilovich, E. ), ACM, April 2017 (conference)

**DiSMEC – Distributed Sparse Machines for Extreme Multi-label Classification**
*Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM)*, pages: 721-729, Febuary 2017 (conference)

**Influence Estimation and Maximization in Continuous-Time Diffusion Networks**
*ACM Transactions on Information Systems*, 34(2):9:1-9:33, 2016 (article)

**Consistent Kernel Mean Estimation for Functions of Random Variables**
*Advances in Neural Information Processing Systems 29*, pages: 1732-1740, (Editors: D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett), Curran Associates, Inc., 30th Annual Conference on Neural Information Processing Systems, December 2016, *joint first authors (conference)

**Estimating Diffusion Networks: Recovery Conditions, Sample Complexity and Soft-thresholding Algorithm**
*Journal of Machine Learning Research*, 17(90):1-29, 2016 (article)

**Unifying distillation and privileged information**
*International Conference on Learning Representations (ICLR)*, November 2016 (conference)

**TerseSVM : A Scalable Approach for Learning Compact Models in Large-scale Classification**
*Proceedings of the 2016 SIAM International Conference on Data Mining (SDM)*, pages: 234-242, (Editors: Sanjay Chawla Venkatasubramanian and Wagner Meira), May 2016 (conference)

**Learning Taxonomy Adaptation in Large-scale Classification**
*Journal of Machine Learning Research*, 17(98):1-37, 2016 (article)