FOLDS seminar (new!)
Foundations of Optimization, Learning, and Data Science


Quick logistics


About




Schedule for Spring 2026

Date Speaker Affiliation Title
Feb 5 Elad Hazan Princeton / Google Provably Efficient Learning in Nonlinear Dynamical Systems via Spectral Transformers
Feb 12 Jiaoyang Huang Penn Fast Convergence of High-Order ODE Solvers for Diffusion Models
Feb 19 Yuxin Chen Penn Transformers Meet In-Context Learning: A Universal Approximation Theory
Feb 26 Daniel Hsu Columbia Multi-step reasoning via curriculum learning
Mar 5 Paris Perdikaris Penn Optimization Challenges in Physics-Informed Neural Networks
Mar 12 [Spring Break] ------------ ------------
Mar 19 Mehryar Mohri NYU / Google
Mar 23 (Monday, in Singh Center) Rachel Cummings Columbia
Apr 2 Maryam Fazel UW
Apr 9 Weijie Su Penn
Apr 15 (joint with ASSET seminar) Misha Belkin UCSD
Apr 23 Shivani Agarwal Penn





Abstracts

Elad Hazan: Provably Efficient Learning in Nonlinear Dynamical Systems via Spectral Transformers

Learning in dynamical systems is a fundamental challenge underlying modern sequence modeling. Despite extensive study, efficient algorithms with formal guarantees for general nonlinear systems have remained elusive. This talk presents a provably efficient framework for learning in any bounded and Lipschitz nonlinear dynamical system, establishing the first sublinear regret guarantees in a dimension-free setting. Our approach combines Koopman lifting, Luenberger observers, and, crucially, spectral filtering to show that nonlinear dynamics are learnable. These insights motivate a new neural architecture, the Spectral Transform Unit (STU), which achieves state-of-the-art performance on language modeling, dynamical system, and differential equation benchmark.


Jiaoyang Huang: Fast Convergence of High-Order ODE Solvers for Diffusion Models

Score-based diffusion models can be sampled efficiently by reformulating the reverse dynamics as a deterministic probability flow ODE and integrating it with high-order solvers. Since the score function is typically approximated by a neural network, the overall sampling accuracy depends on the interplay between score regularity, approximation error, and numerical integration error. In this talk, we study the convergence of deterministic probability-flow-ODE samplers, focusing on high-order (exponential) Runge–Kutta schemes. Under mild regularity assumptions—specifically, bounded first and second derivatives of the approximate score—we bound the total variation distance between the target distribution and the generated distribution by the sum of a score-approximation term and a p-th order step-size term, explaining why accurate sampling is achievable with only a few solver steps. We also empirically validate the regularity assumptions on benchmark datasets. Our guarantees apply to general forward diffusion processes with arbitrary variance schedules.


Yuxin Chen: Transformers Meet In-Context Learning: A Universal Approximation Theory

Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron's universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being mimicked, extending far beyond convex problems like linear regression. The key is to show that (i) any target function can be nearly linearly represented, with small ℓ1-norm, over a set of universal features, and (ii) a transformer can be constructed to find the linear representation -- akin to solving Lasso -- at test time.  This is joint work with Gen Li, Yuchen Jiao, Yu Huang, and Yuting Wei (arXiv:2506.05200). 


Daniel Hsu: Multi-step reasoning via curriculum learning

Can multi-step reasoning be learned from data? We investigate this question in the context of a simple function composition task. We prove that this task is hard to learn in the Statistical Query model, but is easy to learn with transformers under various forms of curriculum learning. This is joint work with Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Jason Lee, and Denny Wu.


Paris Perdikaris: Optimization Challenges in Physics-Informed Neural Networks

Physics-informed neural networks (PINNs) minimize composite losses that penalize PDE residuals alongside boundary and initial conditions. While this resembles multi-task learning, the optimization landscape is fundamentally different. Differential operators amplify high-frequency error modes by polynomial factors, while the neural tangent kernel's eigenspectrum suppresses precisely those modes -- creating a spectral mismatch absent in standard supervised learning. Through NTK analysis, I will show that this leads to orders-of-magnitude disparities in per-component convergence rates, and that the resulting composite gradient is not merely imbalanced in magnitude but conflicted in direction. I will present a gradient alignment score that quantifies these directional conflicts and provide theoretical evidence that first-order methods are intrinsically limited in resolving them. On the practical side, I will show how layer-wise preconditioning (via the SOAP optimizer) achieves implicit gradient alignment and 2-10x accuracy gains on challenging benchmarks including the simulation of turbulent fluid flows, and how adaptive residual architectures restore trainability at depth. Throughout, I will highlight the structural properties that distinguish these problems from generic multi-task optimization — known operator spectra, deterministic residuals, a priori inter-task coupling -- and argue that these present rich opportunities for rigorous theory and scalable algorithm design.

IDEAS Logo Penn AI Logo STAT Logo