UPenn Optimization Seminar

Date	Speaker	Affiliation	Title
Jan 25	René Vidal	UPenn ESE	Learning Dynamics of Overparametrized Networks
Feb 1	Michael Kearns	UPenn CIS	Poison and Cure: Non-Convex Optimization Techniques for Private Synthetic Data and Reconstruction Attacks [slides]
Feb 8	Weijie Su	UPenn Stats	What Should Good Deep Learning Models Look Like? An Optimization Perspective [slides]
Feb 15	Jelena Diakonikolas	UW-Madison CS	Cyclic Block Coordinate Methods on a Finer Scale: Tighter Bounds and New Methods [slides]
Feb 22	Dimitris Bertsimas	MIT Sloan	The R.O.A.D. to precision medicine
Feb 29	Erik Waingarten	UPenn CIS	A Quasi-Monte Carlo Algorithm for Smooth Kernel Evaluation [slides]
Mar 7 [Fall Break]	------------	------------
Mar 14	Surbhi Goel	UPenn CIS	Understanding training dynamics in deep learning using sparse parities and markov chains [slides]
Mar 21	Fatma Kılınç-Karzan	CMU OR	Mistake Bounds, Truthfulness, and Margin Guarantees in Online Strategic Classification [slides]
Mar 28	Philippe Rigollet	MIT Math	The emergence of clusters in self-attention dynamics
April 4	Stephen Wright	UW-Madison CS	Optimization in Theory and Practice [slides]
Apr 11	Matus Telgarsky	NYU Courant	Feature learning and margin maximization via mirror descent
Apr 18	Katya Scheinberg	Cornell ORIE	Stochastic First Order Oracles and Where to Find Them
Apr 25	Hamed Hassani	UPenn ESE	Adversarial Machine Learning: Fundamental Limits, Algorithms, and New Applications in Generative AI [slides]

Deep networks have led to significant improvements in the performance of AI systems. However, the mathematical reasons for this success remain elusive. For example, contrary to the common belief that overparameterization may hurt generalization and optimization, recent work suggests that overparameterization may bias the optimization algorithm towards solutions that generalize well — a phenomenon known as implicit regularization or implicit bias — and may also accelerate convergence — a phenomenon known as implicit acceleration. This lecture will study both phenomena through the lens of dynamical systems. In particular, we will provide a detailed analysis of the dynamics of gradient flow and gradient descent for overparametrized linear models showing that convergence to equilibrium depends on the imbalance between input and output weights (which is fixed at initialization) and the margin of the initial solution. The talk will also provide an analysis of the implicit bias, showing that large hidden layer width, together with (properly scaled) random initialization, constrains the network parameters to converge to a solution which is close to the min-norm solution.

Michael Kearns: Poison and Cure: Non-Convex Optimization Techniques for Private Synthetic Data and Reconstruction Attacks

I will survey results describing the use of modern non-convex optimization methods to the problems of reconstruction attacks on private datasets (the “poison”), and the algorithmic generation of synthetic versions of private datasets that provably provide strong privacy guarantees (the “cure”).

Weijie Su: What Should Good Deep Learning Models Look Like? An Optimization Perspective

In this talk, we will investigate the emergence of geometric patterns in well-trained deep learning models by making use of a layer-peeled model and the law of equi-separation. The former is a nonconvex optimization program that models the last-layer features and weights. We use the model to shed light on the neural collapse phenomenon of Papyan, Han, and Donoho, and to predict a hitherto-unknown phenomenon that we term minority collapse in imbalanced training.

The law of equi-separation is a pervasive empirical phenomenon that describes how data are separated according to their class membership from the bottom to the top layer in a well-trained neural network. We will show that, through extensive computational experiments, neural networks improve data separation through layers in a simple exponential manner. This law leads to roughly equal ratios of separation that a single layer is able to improve, thereby showing that all layers are created equal. We will conclude the talk by discussing the implications of this law on the interpretation, robustness, and generalization of deep learning, as well as on the inadequacy of some existing approaches toward demystifying deep learning.

Jelena Diakonikolas: Cyclic Block Coordinate Methods on a Finer Scale: Tighter Bounds and New Methods

Cyclic block coordinate methods represent one of the simplest and most intuitive approaches in continuous optimization. They operate by partitioning coordinates into subsets and cyclically updating these subsets. These methods have received significant attention in the literature, with the alternating minimization method from the 1970s being one of the earliest examples (though an even earlier method of Kaczmarz from 1937 can arguably also be viewed as a cyclic coordinate method). Today, cyclic (block) coordinate methods are widely employed in practical applications and serve as the default choice in software packages designed for large-scale statistical learning, including SparseNet and GLMNet.

Despite their broad adoption in practice, the empirical success of cyclic methods is not matched by the existing theory. In particular, even though these methods are often preferred over their randomized or full-vector-update counterparts, their theoretical convergence bounds are usually substantially worse. For example, the theoretical convergence bound for cyclic coordinate gradient descent is worse than the bound of (full vector) gradient descent by a factor scaling linearly with the dimension, and this is tight in the worst case.

In this talk, I will present a novel perspective on cyclic methods that leads to a more fine-grained characterization of their non-asymptotic convergence. I will then present novel cyclic methods with provably better scaling as a function of the number of blocks (dimension in the case of single-coordinate blocks). These results break nearly a decade-old computational barrier of cyclic methods by a factor scaling with the square-root of the dimension. I will further provide examples of problems and cyclic methods for which the convergence of cyclic methods is no worse than the convergence of randomized or full-vector-update methods, even in the worst case, and discuss some ongoing and future directions.

Dimitris Bertsimas: The R.O.A.D. to precision medicine

We propose a prognostic stratum matching framework that uses optimization in its core that addresses the deficiencies of Randomized trial data subgroup analysis and transforms Observational Data to be used as if they were randomized, thus paving the road for precision medicine. Our approach counters the effects of unobserved confounding in observational data by correcting the estimated probabilities of the outcome under a treatment through a novel two-step process. We applied our framework to observational data of patients with gastrointestinal stromal tumors (GIST) and validated our approach in an external cohort using the sensitivity and specificity metrics. We show that these recommendations outperformed those of experts in GIST. We further applied the same framework to randomized clinical trial (RCT) data of patients with extremity sarcomas. Remarkably, despite the initial trial results suggesting that all patients should receive treatment, our framework, after addressing imbalances in patient distribution due to the trial’s small sample size, identified a subset of patients with unique characteristics who may not require treatment. Again, we successfully validated our recommendations in an external cohort. Joint work with Dr. Antonis Margonis and Angelos Koulouras

Erik Waingarten: A Quasi-Monte Carlo Algorithm for Smooth Kernel Evaluation

In the kernel density estimation (KDE) problem, one is given a kernel and a dataset of points in a high dimensional Euclidean space, and must prepare a small space data structure that can quickly answer density queries: given a point, output an approximation to the sum of pairwise kernel evaluations. The classical approach to KDE (and the more general problem of matrix vector multiplication for kernel matrices) is the celebrated fast multipole method of Greengard and Rokhlin [1983]. The fast multipole method combines a basic space partitioning approach with a multidimensional Taylor expansion, which yields an exponential-in-dimension query time. A recent line of work initiated by Charikar and Siminelakis [2017] achieved polynomial dependence on the dimension via a combination of random sampling and randomized space partitioning, with Backurs et al. [2018] improving on the data structure further for smooth kernels. Our main result is insight is a new way to combine discrepancy theory with randomized space partitioning inspired by, but significantly more efficient than, that of the fast multipole methods.

Surbhi Goel: Understanding training dynamics in deep learning using sparse parities and markov chains

The dynamics of gradient-based methods on deep learning objectives display a plethora of interesting phenomena: implicit regularization, phase transitions/emergence, lottery tickets, scaling laws, edge of stability, grokking, and so on. Most theoretical work has focused on using linear models or kernel methods as a proxy to understand the underlying mechanism of these phenomena. While these models allow for understanding several phenomena, they do not involve any feature learning. In this talk, we will argue the need for better proxy models, and propose two toy models, sparse parities and Markov chains, that allow for rigorous analysis as well as useful empirical insights. We will conclude with some perspectives on toy models: future opportunities, limitations, and their importance as diagnostic tools for deep learning pipelines. Based on joint works with Boaz Barak, Ben Edelman, Ezra Edelman, Sham Kakade, Eran Malach, Nikos Tsilivis, and Cyril Zhang.

Fatma Kılınç-Karzan: Mistake Bounds, Truthfulness, and Margin Guarantees in Online Strategic Classification

We consider an online strategic classification problem where each arriving agent can manipulate their true feature vector to obtain a positive predicted label, while incurring a cost for their manipulation that depends on distance between the true feature vector and the manipulated one. The learner seeks to predict the agent’s true label given access to only the manipulated features. After the learner releases their prediction, the agent’s true label is revealed. While previous algorithms such as the strategic perceptron enjoy finite mistake guarantees, we show that they may fail to encourage truthfulness, that is, agents must manipulate their features to get the correct label or provide margin guarantees. We extend the notion of margin from the non-strategic setting to the strategic setting, propose two Strategic Max-Margin (SMM) algorithms for computing classifiers with margin guarantees, which we show to also result in truthfulness as well as finite mistake guarantees under minor assumptions. Our numerical study on real and synthetic data also demonstrates the efficacy of our algorithms. When agent manipulations are noiseless, we show that the new algorithms outperform the strategic perceptron in terms of number of mistakes, number of manipulations and margin. When agent responses are noisy, even though the performances of all algorithms degrade as the noise increases, strategic perceptron and one of our SMM algorithms seem to be robust for low noise.

This is joint work with Lingqing Shen, Nam Ho-Nguyen and Khanh-Hung Giang-Tran.

Philippe Rigollet: The emergence of clusters in self-attention dynamics

Since their introduction in 2017, Transformers have revolutionized large language models and the broader field of deep learning. Central to this success is the groundbreaking self-attention mechanism. In this presentation, I’ll introduce a mathematical framework that casts this mechanism as a mean-field interacting particle system, revealing a desirable long-time clustering behavior. This perspective leads to a trove of fascinating questions with unexpected connections to Kuramoto oscillators, sphere packing, and Wasserstein gradient flows. Primarily based on https://arxiv.org/abs/2312.10794 as well as more recent results from our group.

Steve Wright: Optimization in Theory and Practice

Complexity analysis in optimization seeks upper bounds on the amount of work required to find approximate solutions of optimization problems in a given class with a given algorithm. There is also interest in lower bounds, usually in the form of a worst-case example from a given problem class for a given algorithm class. The relationship between theoretical complexity bounds and practical performance of algorithms on “typical” problems varies widely across problem and algorithm classes, and relative interest among researchers between the theoretical and practical aspects of algorithm design and analysis has waxed and waned over the years. This talk surveys complexity analysis and its relationship to practice in optimization, with an emphasis on linear programming and convex and nonconvex nonlinear optimization, providing historical (and cultural) perspectives on research in these areas.

Matus Telgarsky: Feature learning and margin maximization via mirror descent

This chalk talk will motivate and describe the margin maximization problem in neural networks, and show how it can be solved via a simple refinement of the standard mirror descent analysis. In detail, the first part of the talk will explain how margin maximization is an approach to establishing feature learning; as an example, it can achieve sample complexities better than any kernel (e.g., the NTK). The talk will then show how standard gradient descent can be analyzed via an alternative implicit mirror descent, and leads to margin maximization. Time permitting, the talk will also show other consequences of this refined mirror descent, for instance into the study of arbitrarily large steps sizes with logistic regression.

The main content is joint work with Danny Son, and will appear on arXiv in May. The "time permitting" part is joint work with Jingfeng Wu, Peter Bartlett, and Bin Yu, and can be found at https://arxiv.org/abs/2402.15926 .

Katya Scheinberg: Stochastic First Order Oracles and Where to Find Them

Continuous optimization is a mature field, which has recently undergone major expansion and change. One of the key new directions is the development of methods that do not require exact information about the objective function. Nevertheless, the majority of these methods, from stochastic gradient descent to "zero-th order" methods use some kind of approximate first order information. We will overview different methods of obtaining this information, including simple stochastic gradient via sampling, robust gradient estimation in adversarial settings, traditional and randomized finite difference methods and more. We will discuss what key properties of these inexact, stochastic first order oracles are useful for convergence analysis of optimization methods that use them.

Hamed Hassani: Adversarial Machine Learning: Fundamental Limits, Algorithms, and New Applications in Generative AI

I will present some of our recent works in the area of adversarial ML. In doing so, I will touch upon some of the fundamental ideas that have been developed in the field in the last ten years (by many groups), ranging from fundamental tradeoffs between robustness and generalization, algorithmic approaches to training robust models, and new ideas to break LLMs. The talk will be self-contained and no background on adversarial ML is required. More related to the optimization seminar series, there is a clear optimization problem that we will study throughout the talk.