FOLDS seminar (new!)
Foundations of Optimization, Learning, and Data Science


Quick logistics


About




Schedule for Spring 2026

Date Speaker Affiliation Title
Feb 5 Elad Hazan Princeton / Google Provably Efficient Learning in Nonlinear Dynamical Systems via Spectral Transformers
Feb 12 Jiaoyang Huang Penn Fast Convergence of High-Order ODE Solvers for Diffusion Models
Feb 19 Yuxin Chen Penn Transformers Meet In-Context Learning: A Universal Approximation Theory
Feb 26 Daniel Hsu Columbia Multi-step reasoning via curriculum learning
Mar 5 Paris Perdikaris Penn Optimization Challenges in Physics-Informed Neural Networks
Mar 12 [Spring Break] ------------ ------------
Mar 19 Mehryar Mohri NYU / Google Coherence Mechanisms for Provable Self-Improvement
Mar 23 (Monday, in Singh Center) Rachel Cummings Columbia Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model
Apr 2 Maryam Fazel UW Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixtures
Apr 9 Weijie Su Penn Surrogate-Model Approaches to Optimizers for LLM Training
Apr 15 (joint with ASSET seminar) Misha Belkin UCSD From kernel machines to the linear representation hypothesis for monitoring and steering LLMs
Apr 23 Shivani Agarwal Penn





Abstracts

Elad Hazan: Provably Efficient Learning in Nonlinear Dynamical Systems via Spectral Transformers

Learning in dynamical systems is a fundamental challenge underlying modern sequence modeling. Despite extensive study, efficient algorithms with formal guarantees for general nonlinear systems have remained elusive. This talk presents a provably efficient framework for learning in any bounded and Lipschitz nonlinear dynamical system, establishing the first sublinear regret guarantees in a dimension-free setting. Our approach combines Koopman lifting, Luenberger observers, and, crucially, spectral filtering to show that nonlinear dynamics are learnable. These insights motivate a new neural architecture, the Spectral Transform Unit (STU), which achieves state-of-the-art performance on language modeling, dynamical system, and differential equation benchmark.


Jiaoyang Huang: Fast Convergence of High-Order ODE Solvers for Diffusion Models

Score-based diffusion models can be sampled efficiently by reformulating the reverse dynamics as a deterministic probability flow ODE and integrating it with high-order solvers. Since the score function is typically approximated by a neural network, the overall sampling accuracy depends on the interplay between score regularity, approximation error, and numerical integration error. In this talk, we study the convergence of deterministic probability-flow-ODE samplers, focusing on high-order (exponential) Runge–Kutta schemes. Under mild regularity assumptions—specifically, bounded first and second derivatives of the approximate score—we bound the total variation distance between the target distribution and the generated distribution by the sum of a score-approximation term and a p-th order step-size term, explaining why accurate sampling is achievable with only a few solver steps. We also empirically validate the regularity assumptions on benchmark datasets. Our guarantees apply to general forward diffusion processes with arbitrary variance schedules.


Yuxin Chen: Transformers Meet In-Context Learning: A Universal Approximation Theory

Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron's universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being mimicked, extending far beyond convex problems like linear regression. The key is to show that (i) any target function can be nearly linearly represented, with small ℓ1-norm, over a set of universal features, and (ii) a transformer can be constructed to find the linear representation -- akin to solving Lasso -- at test time.  This is joint work with Gen Li, Yuchen Jiao, Yu Huang, and Yuting Wei (arXiv:2506.05200). 


Daniel Hsu: Multi-step reasoning via curriculum learning

Can multi-step reasoning be learned from data? We investigate this question in the context of a simple function composition task. We prove that this task is hard to learn in the Statistical Query model, but is easy to learn with transformers under various forms of curriculum learning. This is joint work with Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Jason Lee, and Denny Wu.


Paris Perdikaris: Optimization Challenges in Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) minimize composite losses that penalize PDE residuals alongside boundary and initial conditions. While this resembles multi-task learning, the optimization landscape is fundamentally different. Differential operators amplify high-frequency error modes by polynomial factors, while the neural tangent kernel's eigenspectrum suppresses precisely those modes -- creating a spectral mismatch absent in standard supervised learning. Through NTK analysis, I will show that this leads to orders-of-magnitude disparities in per-component convergence rates, and that the resulting composite gradient is not merely imbalanced in magnitude but conflicted in direction. I will present a gradient alignment score that quantifies these directional conflicts and provide theoretical evidence that first-order methods are intrinsically limited in resolving them. On the practical side, I will show how layer-wise preconditioning (via the SOAP optimizer) achieves implicit gradient alignment and 2-10x accuracy gains on challenging benchmarks including the simulation of turbulent fluid flows, and how adaptive residual architectures restore trainability at depth. Throughout, I will highlight the structural properties that distinguish these problems from generic multi-task optimization — known operator spectra, deterministic residuals, a priori inter-task coupling -- and argue that these present rich opportunities for rigorous theory and scalable algorithm design.


Mehryar Mohri: Coherence Mechanisms for Provable Self-Improvement

Large language models are increasingly trained to improve themselves, yet the mechanisms driving this, such as self-reflection or RLAIF, rely almost entirely on empirical heuristics. Is it possible to mathematically guarantee self-improvement without human supervision?

In this talk, I will introduce a geometric framework that proves self-improvement is not only possible but monotonic, grounded in the principle of coherence. By formalizing self-improvement as a Bregman projection onto a space of logically consistent models, we can guarantee enhanced performance. Furthermore, I will present a surprising characterization theorem: any self-improvement mechanism that offers similar theoretical guarantees must, fundamentally, be a coherence projection in disguise. (Joint work with Jon Schneider and Yifan Wu.)


Rachel Cummings: Differentially Private Space-Efficient Algorithms for Counting Distinct Elements in the Turnstile Model

The turnstile continual release model of differential privacy captures scenarios where a privacy-preserving real-time analysis is sought for a dataset evolving through additions and deletions. In typical applications of real-time data analysis, both the length of the stream T and the size of the universe |U| from which data come can be extremely large. This motivates the study of private algorithms in the turnstile setting using space sublinear in both T and |U|. In this paper, we give the first sublinear space differentially private algorithms for the fundamental problem of counting distinct elements in the turnstile streaming model. Our algorithm achieves, on arbitrary streams, O(T^{1/3}) space and additive error, and a (1+\eta)-relative approximation for all \eta \in (0,1). Our result significantly improves upon the space requirements of the state-of-the-art algorithms for this problem, which is linear, approaching the known Omega(T^{1/4}) additive error lower bound for arbitrary streams. Moreover, when a bound W on the number of times an item appears in the stream is known, our algorithm provides O(\sqrt{W}) additive error, using O(\sqrt{W}) space. This additive error asymptotically matches that of prior work which required linear space. Our results address an open question about designing low-memory mechanisms for this problem. We complement these results with a space lower bound, which shows that any algorithm that uses similar techniques must use space \Omega(T^{1/3}) on arbitrary streams.

Joint work with Alessandro Epasto, Jieming Mao, Tamalika Mukherjee, Tingting Ou, and Peilin Zhong. Paper available at: https://arxiv.org/abs/2505.23682


Maryam Fazel: Global Convergence of Gradient EM for Over-Parameterized Gaussian Mixtures

Learning Gaussian Mixture Models (GMMs) is a fundamental problem in machine learning, and the Expectation-Maximization (EM) algorithm and its variant gradient-EM are the most widely used algorithms in practice. When the ground-truth GMM and the learning model have the same number of components, m, a line of prior work has attempted to establish rigorous recovery guarantees; however, EM methods are known to fail to recover the ground truth when m>2.

This talk considers the "over-parameterized" case, where the learning model uses n>m components to fit an m-component GMM. I will show that gradient-EM converges globally and recovers the GMM: for a well-separated GMM with only mild over-parameterization n = \Omega(m log m), randomly initialized gradient-EM converges to the ground truth at a polynomial rate with polynomial samples. The analysis relies on novel techniques to characterize the geometric landscape of the likelihood loss. This is the first global convergence result for EM methods beyond the special case of m=2.


Weijie Su: Surrogate-Model Approaches to Optimizers for LLM Training

The recent empirical success of the Muon optimizer in training large language models has outpaced the theoretical understanding of its matrix-gradient orthogonalization design. To bridge this gap, this talk introduces surrogate-model approaches that analyze and systematically improve deep learning optimization over a single iteration. We first present the isotropic curvature model, a convex program assuming curvature isotropy across perturbation directions, which reveals that optimal update matrices achieve a more homogeneous spectrum. This approach demonstrates that while Muon's gradient orthogonalization is directionally correct, it is only strictly optimal under specific curvature phase transitions. Building upon this theoretical foundation, we introduce a second quadratic surrogate model that approximates the loss using the gradient, an output-space curvature matrix, and the input data matrix. By minimizing this surrogate under an isotropic weight assumption, we derive Newton-Muon. This finding implies that standard Muon is an implicit Newton-type method that neglects the right preconditioning induced by the input second moment. Empirically, Newton-Muon accelerates GPT-2 pretraining, reaching target validation loss in 6% fewer iteration steps and reducing wall-clock training time by roughly 4%, illustrating the efficacy of principled surrogate models in designing LLM optimizers.


Misha Belkin: From kernel machines to the linear representation hypothesis for monitoring and steering LLMs

A trained Large Language Model (LLM) contains much of human knowledge. Yet, it is difficult to gauge the extent or accuracy of that knowledge, as LLMs do not always “know what they know” and may even be unintentionally or actively misleading. In this talk I will discuss feature learning introducing Recursive Feature Machines — a powerful generalization of the classical kernel methods designed for extracting relevant features from tabular data. I will demonstrate how this technique enables us to detect and precisely guide LLM behaviors toward almost any desired concept by manipulating a fixed vector in the LLM activation space. I will also discuss how the same method allows for probing for whether LLM exhibits motivated reasoning.

IDEAS Logo Penn AI Logo STAT Logo