FOLDS seminar (new!)
Foundations of
Optimization,
Learning, and
Data
Science
Quick logistics
- When: Thursdays 12-1pm
- Where: Amy Gutmann Hall 414
- Mailing list:
subscribe here to "folds-seminar"
- Add to calendar:
subscribe
here ▾
- Organizers: Jason Altschuler and Hamed Hassani
- Administrative coordinator: Sonia Castro (email: soniacr@seas.upenn.edu)
About
- What: This seminar features leading experts in optimization, learning, and data science. Topics span algorithms, complexity, modeling, applications, and mathematical underpinnings.
- Why: Foundational advances in these fields are increasingly intertwined. This seminar serves as a university-wide hub to bring together the many communities across UPenn interested in optimization, learning, and data science — in the Department of Statistics and Data Science, Electrical Engineering, Computer Science, Applied Mathematics, Economics, Wharton OID, etc. To help promote this internal interaction, several speakers will be from UPenn.
- Funding: We are grateful to the IDEAS Center for Innovation in Data Engineering and Science, Penn AI, and the Wharton Department of Statistics and Data Science.
- Archived schedules: Fall 2025. (Previously: this seminar series evolved from the UPenn Optimization seminar Fall 2023, Spring 2024, Fall 2024, Spring 2025)
Schedule for Spring 2026
Abstracts
Elad Hazan: Provably Efficient Learning in Nonlinear Dynamical Systems via Spectral Transformers
Learning in dynamical systems is a fundamental challenge underlying modern sequence modeling. Despite extensive study, efficient algorithms with formal guarantees for general nonlinear systems have remained elusive. This talk presents a provably efficient framework for learning in any bounded and Lipschitz nonlinear dynamical system, establishing the first sublinear regret guarantees in a dimension-free setting. Our approach combines Koopman lifting, Luenberger observers, and, crucially, spectral filtering to show that nonlinear dynamics are learnable. These insights motivate a new neural architecture, the Spectral Transform Unit (STU), which achieves state-of-the-art performance on language modeling, dynamical system, and differential equation benchmark.
Jiaoyang Huang: Fast Convergence of High-Order ODE Solvers for Diffusion Models
Score-based diffusion models can be sampled efficiently by reformulating the reverse dynamics as a deterministic probability flow ODE and integrating it with high-order solvers. Since the score function is typically approximated by a neural network, the overall sampling accuracy depends on the interplay between score regularity, approximation error, and numerical integration error. In this talk, we study the convergence of deterministic probability-flow-ODE samplers, focusing on high-order (exponential) Runge–Kutta schemes. Under mild regularity assumptions—specifically, bounded first and second derivatives of the approximate score—we bound the total variation distance between the target distribution and the generated distribution by the sum of a score-approximation term and a p-th order step-size term, explaining why accurate sampling is achievable with only a few solver steps. We also empirically validate the regularity assumptions on benchmark datasets. Our guarantees apply to general forward diffusion processes with arbitrary variance schedules.
Yuxin Chen: Transformers Meet In-Context Learning: A Universal Approximation Theory
Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how transformers enable in-context learning. For a general class of functions (each representing a distinct task), we demonstrate how to construct a transformer that, without any further weight updates, can predict based on a few noisy in-context examples with vanishingly small risk. Unlike prior work that frames transformers as approximators of optimization algorithms (e.g., gradient descent) for statistical learning tasks, we integrate Barron's universal function approximation theory with the algorithm approximator viewpoint. Our approach yields approximation guarantees that are not constrained by the effectiveness of the optimization algorithms being mimicked, extending far beyond convex problems like linear regression. The key is to show that (i) any target function can be nearly linearly represented, with small ℓ1-norm, over a set of universal features, and (ii) a transformer can be constructed to find the linear representation -- akin to solving Lasso -- at test time.
This is joint work with Gen Li, Yuchen Jiao, Yu Huang, and Yuting Wei (arXiv:2506.05200).
Daniel Hsu: Multi-step reasoning via curriculum learning
Can multi-step reasoning be learned from data? We investigate this question in the context of a simple function composition task. We prove that this task is hard to learn in the Statistical Query model, but is easy to learn with transformers under various forms of curriculum learning. This is joint work with Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Jason Lee, and Denny Wu.
Paris Perdikaris: Optimization Challenges in Physics-Informed Neural Networks
Physics-informed neural networks (PINNs) minimize composite losses that penalize PDE residuals alongside boundary and initial conditions. While this resembles multi-task learning, the optimization landscape is fundamentally different. Differential operators amplify high-frequency error modes by polynomial factors, while the neural tangent kernel's eigenspectrum suppresses precisely those modes -- creating a spectral mismatch absent in standard supervised learning. Through NTK analysis, I will show that this leads to orders-of-magnitude disparities in per-component convergence rates, and that the resulting composite gradient is not merely imbalanced in magnitude but conflicted in direction. I will present a gradient alignment score that quantifies these directional conflicts and provide theoretical evidence that first-order methods are intrinsically limited in resolving them. On the practical side, I will show how layer-wise preconditioning (via the SOAP optimizer) achieves implicit gradient alignment and 2-10x accuracy gains on challenging benchmarks including the simulation of turbulent fluid flows, and how adaptive residual architectures restore trainability at depth. Throughout, I will highlight the structural properties that distinguish these problems from generic multi-task optimization — known operator spectra, deterministic residuals, a priori inter-task coupling -- and argue that these present rich opportunities for rigorous theory and scalable algorithm design.