FOLDS seminar (new!)
Foundations of Optimization, Learning, and Data Science


Quick logistics


About




Schedule for Fall 2025

Date Speaker Affiliation Title
Sep 4 Joel Tropp Caltech Positive random walks and positive-semidefinite random matrices
Sep 11 Rina Barber UChicago Algorithmic stability for regression and classification
Sep 18 Jong-Shi Pang USC Heaviside Composite Optimization: A new paradigm of optimization
Sep 25 Joan Bruna NYU Propagation-of-Chaos in Shallow Neural Networks beyond Logarithmic Time
Oct 2 Yury Polyanskiy MIT Theory and practice of LLM quantization
Oct 9 [Fall Break] ------------ ------------
Oct 16 (room AGH 105A) Adam Klivans UT Austin A New Paradigm for Learning with Distribution Shift
Oct 23 Pratik Chaudhari UPenn An Information Geometric Understanding of Deep Learning
Oct 30 Nati Srebro TTIC Weak to Strong Generalization in Random Feature Models
Nov 6 Zhimei Ren UPenn ACS: An interactive framework for machine-assisted selection with model-free guarantees
Nov 13 Eva Tardos Cornell Learning in Strategic Queuing
Nov 20 Rob Nowak UW Madison





Abstracts

Joel Tropp: Positive random walks and positive-semidefinite random matrices

On the real line, a random walk that can only move in the positive direction is very unlikely to remain close to its starting point. After a fixed number of steps, the left tail has a Gaussian profile, under minimal assumptions. Remarkably, the same phenomenon occurs when we consider a positive random walk on the cone of positive-semidefinite matrices. After a fixed number of steps, the minimum eigenvalue is also described by a Gaussian model.

This talk introduces a new way to make this intuition rigorous. The methodology provides the solution to an open problem in computational mathematics about sparse random embeddings. The presentation is targeted at a general mathematical audience. Preprint: https://arxiv.org/abs/2501.16578


Rina Barber: Algorithmic stability for regression and classification

In a supervised learning setting, a model fitting algorithm is unstable if small perturbations to the input (the training data) can often lead to large perturbations in the output (say, predictions returned by the fitted model). Algorithmic stability is a desirable property with many important implications such as generalization and robustness, but testing the stability property empirically is known to be impossible in the setting of complex black-box models. In this work, we establish that bagging any black-box regression algorithm automatically ensures that stability holds, with no assumptions on the algorithm or the data. Furthermore, we construct a new framework for defining stability in the context of classification, and show that using bagging to estimate our uncertainty about the output label will again allow stability guarantees for any black-box model. This work is joint with Jake Soloff and Rebecca Willett.


Jong-Shi Pang: Heaviside Composite Optimization: A new paradigm of optimization

A Heaviside function is an indicator function of a semi-infinite interval. A Heaviside composite function is a Heaviside function composed with a multivariate function that may be nonsnooth. This talk presents a touch of this novel class of discontinuous optimization problems that borders on continuous and discrete optimization. Among the rich applications of such problems, we mention one pertaining to classification and treatment with constraints solvable by an elementary yet novel progressive integer programming (PIP) method that can be applied broadly to many other problems, including indefinite quadratic programs and linear programs with linear complementarity constraints. Computational results demonstrate the excellent performance of the PIP idea and its superiority over a full integer programming approach.


Joan Bruna: Propagation-of-Chaos in Shallow Neural Networks beyond Logarithmic Time

The analysis of gradient-based learning of Neural Networks remains an outstanding challenge, even for the simplest shallow architectures. A powerful mathematical framework that has emerged over recent years lifts the optimization in the space of probability measures, and captures important empirical phenomena such as the 'blessing of overparametrization'. However, the resulting learning guarantees remain mostly qualitative. In this talk, we study the fluctuations between idealized mean-field dynamics and polynomially-sized networks over suitable time horizons --- the so-called Propagation of Chaos. We provide a novel analysis that goes beyond traditional Gronwall-based PoC by exploiting certain geometric properties of the optimization landscape, and apply these results to representative models such as single-index models, establishing polynomial learning guarantees. Joint work with Margalit Glasgow and Denny Wu.


Yury Polyanskiy: Theory and practice of LLM quantization

Modern LLMs process information by repeatedly applying a basic primitive of matrix multiplication. Estimates show that about 60-84% of the energy consumed by LLMs goes into memory load/store operations. How can we reduce this power consumption? Tokens start as about 16-bit integers but get mapped to vectors of floats of length in the 1000s, suggesting very low information density per dimension. Thus, unsurprisingly there has been much success in reducing precision of both weights and activations without much loss in LLM performance. In this talk we will present information-theoretic analysis of quantized representations and show how it lead us to creating NestQuant, a new SOTA algorithm for weight/KV-cache/activations (ICML'2025).


Adam Klivans: A New Paradigm for Learning with Distribution Shift

We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.

We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier. This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution. Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.


Pratik Chaudhari: An Information Geometric Understanding of Deep Learning

I will argue that properties of natural data are what predominantly make deep networks so effective. To that end, I will show that deep networks work well because of a characteristic structure in the space of learnable tasks. The input correlation matrix for typical tasks has a “sloppy” eigenspectrum where eigenvalues decay linearly on a logarithmic scale. As a consequence, the Hessian and the Fisher Information Matrix of a trained network also have a sloppy eigenspectrum. Using this idea, I will demonstrate an analytical, non-vacuous PAC-Bayes bound on the generalization error for general deep networks.

I will show that the training process in deep learning explores a remarkably low dimensional manifold, as low as three. Networks with a wide variety of architectures, sizes, optimization and regularization methods lie on the same manifold. Networks being trained on different tasks (e.g., different subsets of ImageNet) using different methods (e.g., supervised, transfer, meta, semi and self-supervised learning) also lie on the same low-dimensional manifold.

I will show that typical tasks are highly redundant functions of their inputs. Many perception tasks, from visual recognition, semantic segmentation, optical flow, depth estimation, to vocalization discrimination, can be predicted extremely well regardless of whether data is projected in the principal subspace where it varies the most, some intermediate subspace with moderate variability---or the bottom subspace where data varies the least.


Nati Srebro: Weak to Strong Generalization in Random Feature Models

Weak-to-Strong Generalization (Burns et al., 2023) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong and complex learner like GPT-4, nor pre-training. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A ‘weak’ teacher, with a small number of units (i.e. random features), is trained on the population, and a ‘strong’ student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove and understand, how the student can outperform the teacher, even though trained only on data labeled by the teacher, with no pretraining or other knowledge or data advantage over the teacher. We explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model. Joint work with Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora and Zhiyuan Li.


Zhimei Ren: ACS: An interactive framework for machine-assisted selection with model-free guarantees

In this talk, I will introduce adaptive conformal selection (ACS), an interactive framework for model-free selection with guaranteed error control. Building on conformal selection (Jin and Candès, 2023b), ACS generalizes the approach to support human-in-the-loop adaptive data analysis. Under the ACS framework, we can partially reuse the data to boost the selection power, make decisions on the fly while exploring the data, and incorporate new information or preferences as they arise. The key to ACS is a carefully designed principle that controls the information available for decision making, allowing the data analyst to explore the data adaptively while maintaining rigorous control of the false discovery rate (FDR). Based on the ACS framework, we provide concrete selection algorithms for various goals, including model update/selection, diversified selection, and incorporating newly available labeled data. The effectiveness of ACS is demonstrated through extensive numerical simulations and real-data applications in large language model (LLM) deployment and drug discovery.


Eva Tardos: Learning in Strategic Queuing

Over the last two decades we have developed good understanding how to quantify the impact of strategic user behavior on outcomes in many games (including traffic routing and online auctions) and showed that the resulting bounds extend to repeated games assuming players use a form of learning (no-regret learning) to adapt to the environment. However, these results assume that there is no carry-over effects between rounds: outcomes in one round have no effect on the game in the future. Many repeated games have an evolving state resulting in direct carry-over effect, such as repeated auctions with budgets, as well as queuing systems. In this talk we will study this phenomenon in the context of a game modeling queuing systems: routers compete for servers, where packets that do not get served need to be resent, resulting in a system where the number of packets at each round depends on the success of the routers in the previous rounds. We study the required excess server capacity needed to guarantee that all packets get served in two different queuing systems (with or without buffers) despite the selfish (myopic) behavior of the participants.

IDEAS Logo Penn AI Logo STAT Logo