UPenn Optimization Seminar

Date	Speaker	Affiliation	Title
Jan 23	Aaron Roth	UPenn	Conditional Calibration for Task Specific Uncertainty Quantification
Jan 30	Mengdi Wang	Princeton	Controlled Generation for Large Foundation Models
Feb 6	Shirin Saeedi Bidokhti	UPenn	Learning-Based Data Compression: Fundamental limits and Algorithms
Feb 13	Alex Damian	Princeton	Foundations of Deep Learning: Optimization and Representation Learning
Feb 20	Noah Golowich	MIT	Theoretical foundations for multi-agent learning
Feb 27	Ayush Sekhari	MIT	ML for an Interactive World: From Learning to Unlearning
Mar 6	Bartolomeo Stellato	Princeton	Data-Driven Algorithm Design and Verification for Parametric Convex Optimization
Mar 20	Yaniv Romano	Technion	Statistics-Powered ML: Building Trust and Robustness in Black-Box Predictions
Mar 27	Dan Roy	Toronto	The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws
Apr 3	Frank E. Curtis	Lehigh	Stochastic-Gradient-based Algorithms for Solving Nonconvex Constrained Optimization Problems
Apr 10	Ryan Tibshirani	Berkeley	Gradient Equilibrium in Online Learning
Apr 17	Angelia Nedich	ASU	Resilient Distributed Optimization for Cyberphysical Systems
Apr 24	Jason Altschuler	UPenn	Negative Stepsizes Make Gradient-Descent-Ascent Converge

For many tasks, optimal solutions would be simple exercises if only we had direct access to some "true distribution on outcomes", conditional on all of our observed covariates. Unfortunately "true probabilities" are fundamentally impossible to come by, without making very strong modelling assumptions about the environment. On the other hand, there are various non-parametric methods for uncertainty quantification --- such as calibration and conformal prediction --- that can be used without making essentially any assumptions at all --- but their guarantees are marginal, and it is unclear what kinds of tasks they are good for. A recent line of work has given algorithms that can produce calibration guarantees that hold conditionally on any bounded set of conditioning events. This turns out to form a useful design framework for making predictions that can be used as if they were probabilities for many downstream tasks, so long as one selects the conditioning events appropriately with the downstream task in mind. We'll see three applications --- giving a method to make predictions that can be usefully consumed by many downstream decision makers, a method to make predictions that can be used to form prediction sets that are conditionally valid subject to any collection of conditioning events, and a method of making forecasts that can be used to interact with another forecaster and quickly reach agreement, recovering tractable versions of Aumann's agreement theorem.

Mengdi Wang: Controlled Generation for Large Foundation Models

Recent advances in large foundation models, such as large language models (LLMs) and diffusion models, have demonstrated impressive capabilities. However, to truly align these models with user feedback or maximize real-world objectives, it is crucial to exert control over the decoding processes, in order to steer the distribution of generated output. In this talk, we will explore methods and theory for controlled generation within LLMs and diffusion models. We will discuss various modalities or achieving this control, focusing on applications such as alignment of LLM, accelerated inference, transfer learning, and diffusion-based optimizer.

Shirin Saeedi Bidokhti: Learning-Based Data Compression: Fundamental limits and Algorithms

Data-driven methods have been the driving force of many scientific disciplines in the past decade, relying on huge amounts of empirical, experimental, and scientific data. Working with big data is impossible without data compression techniques that reduce the dimension and size of the data for storage and communication purposes and effectively denoise for efficient and accurate processing. In the past decade, learning-based compressors such as nonlinear transform coding (NTC) have shown great success in the task of compression by learning to map a high dimensional source onto its representative latent space of lower dimension using neural networks and compressing in that latent space. Despite this success, it is unknown how the rate-distortion performance of such compressors compare with the optimal limits of compression (known as the rate-distortion function) that information theory characterizes. It is also unknown how advances in the field of information theory translate to practice in the paradigm of deep learning.

In the first part of the talk, we develop neural estimation methods to compute the rate-distortion function of high dimensional real-world datasets. Using our estimate, and through experiments, we show that the rate-distortion achieved by NTC compressors are within several bits of the rate-distortion function for real-world datasets such as MNIST. We then ask if this gap can be closed using ideas in information theory. In particular, incorporating lattice coding in the latent domain, we propose lattice transform coding as a novel framework for neural compression. LTC provides significant improvement compared to the state of the art on synthetic and real-world sources.

Alex Damian: Foundations of Deep Learning: Optimization and Representation Learning

Deep learning's success stems from the ability of neural networks to automatically discover meaningful representations from raw data. In this talk, I will describe some recent insights into how optimization enables this learning process. First, I will show how optimization algorithms exhibit surprisingly rich dynamics when training neural networks, and how these complex dynamics are actually crucial to their success – enabling them to find solutions that generalize well, navigate challenging loss landscapes, and efficiently adapt to local curvature. I will then explore how optimization enables neural networks to adapt to low-dimensional structure in the data, how the geometry of the loss landscape shapes the difficulty of feature learning, and how these ideas extend to in-context learning in transformers.

Noah Golowich: Theoretical foundations for multi-agent learning

As learning algorithms become increasingly capable of acting autonomously, it is important to better understand the behavior that results from their interactions. For example, a pervasive challenge in multi-agent learning settings, which spans both theory and practice and dates back decades, has been the failure of convergence for iterative algorithms such as gradient descent. Accordingly, a longstanding central question with broad relevance is: how quickly can we compute solution concepts, i.e., equilibria, in multi-agent settings?

I will discuss results which address this question at a variety of levels, starting from foundational settings involving normal-form games and building up to complex problems such as multi-agent reinforcement learning which more aptly model realistic situations. First, I will present a result establishing a near-optimal convergence rate for a simple online learning algorithm in normal-form games, resolving a decade-long line of work which gave suboptimal bounds. I will then discuss a new algorithm for minimizing swap regret exponentially faster than previous approaches. Our algorithm allows us to answer several open questions, such as by establishing the first PTAS for correlated equilibria in extensive-form games.

Beyond contending with agents' differing incentives, the increasing use of machine learning algorithms presents other challenges, such as the proliferation of AI-generated content. In the latter part of the talk, I will discuss an approach to detect such content via watermarking. Our watermarking scheme is the first to embed a watermark in a language model's output in a way which only leads to negligible changes in the distribution of the output but which is robust to adversarial edits.

Ayush Sekhari: ML for an Interactive World: From Learning to Unlearning

The remarkable recent success of Machine Learning (ML) is driven by our ability to develop and deploy interactive models that can solve complicated tasks by understanding and adapting to the ever-changing state of the world. However, the development of such models demands significant data and computing resources. Moreover, as these models increasingly interact with humans, new post-deployment challenges emerge, including privacy concerns, data integrity, and the potential for model misuse. Addressing these issues necessitates innovative algorithmic solutions.

Reinforcement Learning (RL) is the preferred method for training interactive models. In the first part of my talk, I will discuss my work on Hybrid RL, which has led to the development of the first general-purpose, computationally efficient, and theoretically rigorous algorithms for RL. Our method learns effective policies by integrating the trial-and-error processes of RL with pre-collected interaction data logs, demonstrating strong performance in practical applications.

In the second half of my talk, I will discuss my work on the foundations of machine unlearning, a newly emerging field with significant practical applications. Machine unlearning involves updating trained ML models to exclude specific data samples from the trained model upon their deletion request, without retraining from scratch. I will delve into how machine unlearning presents a more viable alternative to traditional methods like differential privacy for data deletion, thus providing a more practical solution for ensuring data privacy post-deployment.

Bartolomeo Stellato: Data-Driven Algorithm Design and Verification for Parametric Convex Optimization

We present computational tools for analyzing and designing first-order methods in parametric convex optimization. These methods are popular for their low per-iteration cost and warm-starting capabilities. However, precisely quantifying the number of iterations required to compute high-quality solutions remains a key challenge, especially in real-time applications. First, we introduce a numerical framework for verifying the worst-case performance of first-order methods in parametric quadratic optimization. We formulate this as a mixed-integer linear program that maximizes the infinity norm of the fixed-point residual after a given number of iterations. Our approach captures a broad class of gradient, projection, and proximal iterations through affine or piecewise-affine constraints, with strong polyhedral formulations. To improve scalability, we incorporate bound-tightening techniques that exploit operator-theoretic bounds. Numerical results show that our method closely matches true worst-case performance, achieving significant reductions in worst-case fixed-point residuals compared to standard convergence analyses. Second, we present a data-driven approach for analyzing the performance of first-order methods using statistical learning theory. We establish generalization guarantees for classical optimizers using sample convergence bounds and for learned optimizers using the Probably Approximately Correct (PAC)-Bayes framework. We then apply this framework to learn accelerated first-order methods by directly minimizing the PAC-Bayes bound over key algorithmic parameters (e.g., gradient steps and warm-starts). Numerical experiments demonstrate that our approach provides strong generalization guarantees for both classical and learned optimizers, with statistical bounds that closely match true out-of-sample performance.

Yaniv Romano: Statistics-Powered ML: Building Trust and Robustness in Black-Box Predictions

Modern ML models produce valuable predictions across various applications, influencing people’s lives, opportunities, and scientific advancements. However, these systems can fail in unexpected ways, generating unreliable inferences and perpetuating biases present in the data. These issues are particularly troubling in high-stakes applications, where models are trained on increasingly diverse, incomplete, and noisy data and then deployed in dynamic environments—conditions that often exacerbate test-time failures. In response to these challenges, this talk explores a key question: How can fundamental statistical principles be harnessed to produce trustworthy predictive inference?

In the first part, I will present a new advancement in conformal prediction—a statistical wrapper for any black-box model that provides precise error bounds on ML predictions. I will focus on scenarios where training data is corrupted or biased, such as through missing features and labels, and introduce a framework for constructing predictive uncertainty estimates that remain valid despite distribution shifts between the available corrupted data and unknown clean data.

In the second part, I will show how sequential statistical testing can enable a novel test-time training scheme, allowing a pre-trained model to adapt online to unfamiliar environments. For instance, consider an image classification task where test images are captured under varying illumination conditions that differ from the training setup. Building on conformal betting martingales, I will first introduce a monitoring tool to detect data drifts. Using this tool, I will derive a rigorous ‘anti-drift correction’ mechanism grounded in (online) optimal transport principles. This mechanism forms the foundation of a self-training scheme that produces robust predictions invariant to dynamically changing environments.

Dan Roy: The Size of Teachers as a Measure of Data Complexity: PAC-Bayes Excess Risk Bounds and Scaling Laws

We study the generalization properties of neural networks through the lens of data complexity. Recent work by Buzaglo et al. (2024) shows that random (nearly) interpolating networks generalize, provided there is a small "teacher" network that achieves small excess risk. We give a short single-sample PAC-Bayes proof of this result and an analogous "fast-rate" result for random samples from Gibbs posteriors. The resulting oracle inequality motivates a new notion of data complexity, based on the minimal size of a teacher network required to achieve any given level of excess risk. We show that polynomial data complexity gives rise to power laws connecting risk to the number of training samples, like in empirical neural scaling laws. By comparing the "scaling laws" resulting from our bounds to those observed in empirical studies, we provide evidence for lower bounds on the data complexity of standard benchmarks. Joint work with G. K. Dziugaite.

Frank E. Curtis: Stochastic-Gradient-based Algorithms for Solving Nonconvex Constrained Optimization Problems

I will present recent work by my research group on the design and analysis of stochastic-gradient-based algorithms for solving nonconvex constrained optimization problems, which may arise, for example, in informed supervised learning. I will focus in particular on algorithmic strategies that have consistently been shown to exhibit the best practical performance, rather than focus on regularization-based methods that are popular for theoretical analyses, but often fail to yield satisfactory results. Our algorithms possess solid theoretical convergence guarantees and preliminary experiments motivate continued study.

Ryan Tibshirani: Gradient Equilibrium in Online Learning

We present a new perspective on online learning that we refer to as gradient equilibrium: a sequence of iterates achieves gradient equilibrium if the average of gradients of losses along the sequence converges to zero. In general, this condition is not implied by, nor implies, sublinear regret. It turns out that gradient equilibrium is achievable by standard online learning methods such as gradient descent and mirror descent with constant step sizes (rather than decaying step sizes, as is usually required for no regret). Further, as we show through examples, gradient equilibrium translates into an interpretable and meaningful property in online prediction problems spanning regression, classification, quantile estimation, and others. Notably, we show that the gradient equilibrium framework can be used to develop a debiasing scheme for black-box predictions under arbitrary distribution shift, based on simple post hoc online descent updates. We also show that post hoc gradient updates can be used to calibrate predicted quantiles under distribution shift, and that the framework leads to unbiased Elo scores for pairwise preference prediction.

Angelia Nedich: Resilient Distributed Optimization for Cyberphysical Systems

This talk considers the problem of resilient distributed multi-agent optimization for cyberphysical systems in the presence of malicious or non-cooperative agents. It is assumed that stochastic values of trust between agents are available which allows agents to learn their trustworthy neighbors simultaneously with performing updates to minimize their own local objective functions. The development of this trustworthy computational model combines the tools from statistical learning and distributed consensus-based optimization. Specifically, we derive a unified mathematical framework to characterize convergence, deviation of the consensus from the true consensus value, and expected convergence rate, when there exists additional information of trust between agents. Under certain conditions, we show that the consensus protocol has almost sure convergence to a common limit value is possible even when malicious agents constitute more than half of the network; the deviation of the converged limit, from the nominal no attack case can be bounded with probability that approaches 1 exponentially, and that correct classification of malicious and legitimate agents can be attained in (random) finite time almost surely. Further, the expected convergence rate decays exponentially with the quality of the trust observations between agents. We then combine the trust-learning model within a distributed gradient-based method for solving a multi-agent optimization problem and characterize its performance.

Jason Altschuler: Negative Stepsizes Make Gradient-Descent-Ascent Converge

Solving min-max problems is a central question in optimization, games, learning, and controls. Arguably the most natural algorithm is Gradient-Descent-Ascent (GDA), however since the 1970s, conventional wisdom has argued that it fails to converge even on simple problems. This failure spurred the extensive literature on modifying GDA with extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes.

The key innovation is the proposal of unconventional stepsize schedules that are time-varying, asymmetric, and (most surprisingly) periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). The core intuition is that although negative stepsizes make backward progress, they de-synchronize the min/max variables (overcoming the cycling issue of GDA) and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. Joint work with Henry Shugart.