UPenn FOLDS Seminar

Date	Speaker	Affiliation	Title
Sep 4	Joel Tropp	Caltech	Positive random walks and positive-semidefinite random matrices
Sep 11	Rina Barber	UChicago	Algorithmic stability for regression and classification
Sep 18	Jong-Shi Pang	USC	Heaviside Composite Optimization: A new paradigm of optimization
Sep 25	Joan Bruna	NYU	Propagation-of-Chaos in Shallow Neural Networks beyond Logarithmic Time
Oct 2	Yury Polyanskiy	MIT	Theory and practice of LLM quantization
Oct 9 [Fall Break]	------------	------------
Oct 16	Adam Klivans	UT Austin	A New Paradigm for Learning with Distribution Shift
Oct 23	Pratik Chaudhari	UPenn	An Information Geometric Understanding of Deep Learning
Oct 30	Nati Srebro	TTIC	Weak to Strong Generalization in Random Feature Models
Nov 6	Zhimei Ren	UPenn	ACS: An interactive framework for machine-assisted selection with model-free guarantees
Nov 10 (Monday 2:30-3:30pm)	Lénaïc Chizat	EPFL	The Hidden Width of Deep ResNets
Nov 13	Eva Tardos	Cornell	Learning in Strategic Queuing
Nov 20	Rob Nowak	UW Madison	Function Space Perspectives on Neural Networks

On the real line, a random walk that can only move in the positive direction is very unlikely to remain close to its starting point. After a fixed number of steps, the left tail has a Gaussian profile, under minimal assumptions. Remarkably, the same phenomenon occurs when we consider a positive random walk on the cone of positive-semidefinite matrices. After a fixed number of steps, the minimum eigenvalue is also described by a Gaussian model.

This talk introduces a new way to make this intuition rigorous. The methodology provides the solution to an open problem in computational mathematics about sparse random embeddings. The presentation is targeted at a general mathematical audience. Preprint: https://arxiv.org/abs/2501.16578

In a supervised learning setting, a model fitting algorithm is unstable if small perturbations to the input (the training data) can often lead to large perturbations in the output (say, predictions returned by the fitted model). Algorithmic stability is a desirable property with many important implications such as generalization and robustness, but testing the stability property empirically is known to be impossible in the setting of complex black-box models. In this work, we establish that bagging any black-box regression algorithm automatically ensures that stability holds, with no assumptions on the algorithm or the data. Furthermore, we construct a new framework for defining stability in the context of classification, and show that using bagging to estimate our uncertainty about the output label will again allow stability guarantees for any black-box model. This work is joint with Jake Soloff and Rebecca Willett.

A Heaviside function is an indicator function of a semi-infinite interval. A Heaviside composite function is a Heaviside function composed with a multivariate function that may be nonsnooth. This talk presents a touch of this novel class of discontinuous optimization problems that borders on continuous and discrete optimization. Among the rich applications of such problems, we mention one pertaining to classification and treatment with constraints solvable by an elementary yet novel progressive integer programming (PIP) method that can be applied broadly to many other problems, including indefinite quadratic programs and linear programs with linear complementarity constraints. Computational results demonstrate the excellent performance of the PIP idea and its superiority over a full integer programming approach.

The analysis of gradient-based learning of Neural Networks remains an outstanding challenge, even for the simplest shallow architectures. A powerful mathematical framework that has emerged over recent years lifts the optimization in the space of probability measures, and captures important empirical phenomena such as the 'blessing of overparametrization'. However, the resulting learning guarantees remain mostly qualitative. In this talk, we study the fluctuations between idealized mean-field dynamics and polynomially-sized networks over suitable time horizons --- the so-called Propagation of Chaos. We provide a novel analysis that goes beyond traditional Gronwall-based PoC by exploiting certain geometric properties of the optimization landscape, and apply these results to representative models such as single-index models, establishing polynomial learning guarantees. Joint work with Margalit Glasgow and Denny Wu.

Yury Polyanskiy: Theory and practice of LLM quantization

Modern LLMs process information by repeatedly applying a basic primitive of matrix multiplication. Estimates show that about 60-84% of the energy consumed by LLMs goes into memory load/store operations. How can we reduce this power consumption? Tokens start as about 16-bit integers but get mapped to vectors of floats of length in the 1000s, suggesting very low information density per dimension. Thus, unsurprisingly there has been much success in reducing precision of both weights and activations without much loss in LLM performance. In this talk we will present information-theoretic analysis of quantized representations and show how it lead us to creating NestQuant, a new SOTA algorithm for weight/KV-cache/activations (ICML'2025).

Adam Klivans: A New Paradigm for Learning with Distribution Shift

We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.

We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier. This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution. Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.

Pratik Chaudhari: An Information Geometric Understanding of Deep Learning

I will argue that properties of natural data are what predominantly make deep networks so effective. To that end, I will show that deep networks work well because of a characteristic structure in the space of learnable tasks. The input correlation matrix for typical tasks has a “sloppy” eigenspectrum where eigenvalues decay linearly on a logarithmic scale. As a consequence, the Hessian and the Fisher Information Matrix of a trained network also have a sloppy eigenspectrum. Using this idea, I will demonstrate an analytical, non-vacuous PAC-Bayes bound on the generalization error for general deep networks.

I will show that the training process in deep learning explores a remarkably low dimensional manifold, as low as three. Networks with a wide variety of architectures, sizes, optimization and regularization methods lie on the same manifold. Networks being trained on different tasks (e.g., different subsets of ImageNet) using different methods (e.g., supervised, transfer, meta, semi and self-supervised learning) also lie on the same low-dimensional manifold.

I will show that typical tasks are highly redundant functions of their inputs. Many perception tasks, from visual recognition, semantic segmentation, optical flow, depth estimation, to vocalization discrimination, can be predicted extremely well regardless of whether data is projected in the principal subspace where it varies the most, some intermediate subspace with moderate variability---or the bottom subspace where data varies the least.

Nati Srebro: Weak to Strong Generalization in Random Feature Models

Weak-to-Strong Generalization (Burns et al., 2023) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong and complex learner like GPT-4, nor pre-training. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A ‘weak’ teacher, with a small number of units (i.e. random features), is trained on the population, and a ‘strong’ student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove and understand, how the student can outperform the teacher, even though trained only on data labeled by the teacher, with no pretraining or other knowledge or data advantage over the teacher. We explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model. Joint work with Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora and Zhiyuan Li.

Zhimei Ren: ACS: An interactive framework for machine-assisted selection with model-free guarantees

In this talk, I will introduce adaptive conformal selection (ACS), an interactive framework for model-free selection with guaranteed error control. Building on conformal selection (Jin and Candès, 2023b), ACS generalizes the approach to support human-in-the-loop adaptive data analysis. Under the ACS framework, we can partially reuse the data to boost the selection power, make decisions on the fly while exploring the data, and incorporate new information or preferences as they arise. The key to ACS is a carefully designed principle that controls the information available for decision making, allowing the data analyst to explore the data adaptively while maintaining rigorous control of the false discovery rate (FDR). Based on the ACS framework, we provide concrete selection algorithms for various goals, including model update/selection, diversified selection, and incorporating newly available labeled data. The effectiveness of ACS is demonstrated through extensive numerical simulations and real-data applications in large language model (LLM) deployment and drug discovery.

Lénaïc Chizat: The Hidden Width of Deep ResNets

We present a mathematical framework to analyze the training dynamics of deep ResNets that rigorously captures practical architectures (including Transformers) trained from standard random initializations. Our approach combines stochastic approximation of ODEs with propagation-of-chaos arguments to obtain tight convergence rates to the “infinite size” limit of the dynamics. It yields the following insights: 1/ Depth begets width: infinite-depth ResNets of any hidden width behave throughout training as if they were infinitely wide; 2/ Phase diagram: we derive the phase diagram of the training dynamics, which singles out an “ideal" scaling of hyper-parameters (initialization scale and learning-rates), extending “CompleteP” to more general architectures; 3/ Optimal shape scaling: our analysis suggests how to scale depth, hidden width and embedding dimension of a ResNet when scaling up parameter count. With the optimal shape and a parameter budget P, we argue that the model converges to its limiting dynamics at rate P^{-1/6}.

Eva Tardos: Learning in Strategic Queuing

Over the last two decades we have developed good understanding how to quantify the impact of strategic user behavior on outcomes in many games (including traffic routing and online auctions) and showed that the resulting bounds extend to repeated games assuming players use a form of learning (no-regret learning) to adapt to the environment. However, these results assume that there is no carry-over effects between rounds: outcomes in one round have no effect on the game in the future. Many repeated games have an evolving state resulting in direct carry-over effect, such as repeated auctions with budgets, as well as queuing systems. In this talk we will study this phenomenon in the context of a game modeling queuing systems: routers compete for servers, where packets that do not get served need to be resent, resulting in a system where the number of packets at each round depends on the success of the routers in the previous rounds. We study the required excess server capacity needed to guarantee that all packets get served in two different queuing systems (with or without buffers) despite the selfish (myopic) behavior of the participants.

Rob Nowak: Function Space Perspectives on Neural Networks

This talk reviews a theory of the functions learned by neural networks with Rectified Linear Unit (ReLU) activations. At its core is the observation that deep ReLU networks can be characterized as solutions to data-fitting problems in certain infinite dimensional function spaces. The solutions are compositions of functions from Banach spaces of second-order bounded variation, defined in the Radon transform domain. Functions in these spaces exhibit strong smoothness in most directions, making it a natural setting for adapting to intrinsic low-dimensional structure in data. Moreover, the norms in these spaces are closely tied to the size of neural network weights, providing a direct connection between function complexity and network parametrization. In particular, the total variation norm provides an analytic tool for identifying functions that cannot be realized by shallow networks, thereby yielding a precise characterization of depth separation. Representer theorems reveal the solutions are sparse in the number of active neurons per layer. Sparsity provides a principled path to network compression, yet some sparse solutions can suffer from poor generalization. The theory suggests new training strategies to promote solutions that generalize more robustly.