We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.
We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier. This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution. Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.
I will argue that properties of natural data are what predominantly
make deep networks so effective. To that end, I will show that deep
networks work well because of a characteristic structure in the space
of learnable tasks. The input correlation matrix for typical tasks has
a “sloppy” eigenspectrum where eigenvalues decay linearly on a
logarithmic scale. As a consequence, the Hessian and the Fisher
Information Matrix of a trained network also have a sloppy
eigenspectrum. Using this idea, I will demonstrate an analytical,
non-vacuous PAC-Bayes bound on the generalization error for general
deep networks.
I will show that the training process in deep learning explores a
remarkably low dimensional manifold, as low as three. Networks with a
wide variety of architectures, sizes, optimization and regularization
methods lie on the same manifold. Networks being trained on different
tasks (e.g., different subsets of ImageNet) using different methods
(e.g., supervised, transfer, meta, semi and self-supervised learning)
also lie on the same low-dimensional manifold.
I will show that typical tasks are highly redundant functions of their
inputs. Many perception tasks, from visual recognition, semantic
segmentation, optical flow, depth estimation, to vocalization
discrimination, can be predicted extremely well regardless of whether
data is projected in the principal subspace where it varies the most,
some intermediate subspace with moderate variability---or the bottom
subspace where data varies the least.
Nati Srebro: Weak to Strong Generalization in Random Feature Models
Weak-to-Strong Generalization (Burns et al., 2023) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong and complex learner like GPT-4, nor pre-training. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A ‘weak’ teacher, with a small number of units (i.e. random features), is trained on the population, and a ‘strong’ student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove and understand, how the student can outperform the teacher, even though trained only on data labeled by the teacher, with no pretraining or other knowledge or data advantage over the teacher. We explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
Joint work with Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora and Zhiyuan Li.
Zhimei Ren: ACS: An interactive framework for machine-assisted selection with model-free guarantees
In this talk, I will introduce adaptive conformal selection (ACS), an interactive framework for model-free selection with guaranteed error control. Building on conformal selection (Jin and Candès, 2023b), ACS generalizes the approach to support human-in-the-loop adaptive data analysis. Under the ACS framework, we can partially reuse the data to boost the selection power, make decisions on the fly while exploring the data, and incorporate new information or preferences as they arise. The key to ACS is a carefully designed principle that controls the information available for decision making, allowing the data analyst to explore the data adaptively while maintaining rigorous control of the false discovery rate (FDR). Based on the ACS framework, we provide concrete selection algorithms for various goals, including model update/selection, diversified selection, and incorporating newly available labeled data. The effectiveness of ACS is demonstrated through extensive numerical simulations and real-data applications in large language model (LLM) deployment and drug discovery.
Lénaïc Chizat: The Hidden Width of Deep ResNets
We present a mathematical framework to analyze the training dynamics of deep ResNets that rigorously captures practical architectures (including Transformers) trained from standard random initializations. Our approach combines stochastic approximation of ODEs with propagation-of-chaos arguments to obtain tight convergence rates to the “infinite size” limit of the dynamics. It yields the following insights:
1/ Depth begets width: infinite-depth ResNets of any hidden width behave throughout training as if they were infinitely wide;
2/ Phase diagram: we derive the phase diagram of the training dynamics, which singles out an “ideal" scaling of hyper-parameters (initialization scale and learning-rates), extending “CompleteP” to more general architectures;
3/ Optimal shape scaling: our analysis suggests how to scale depth, hidden width and embedding dimension of a ResNet when scaling up parameter count. With the optimal shape and a parameter budget P, we argue that the model converges to its limiting dynamics at rate P^{-1/6}.
Eva Tardos: Learning in Strategic Queuing
Over the last two decades we have developed good understanding how to quantify the impact of strategic user behavior on outcomes in many games (including traffic routing and online auctions) and showed that the resulting bounds extend to repeated games assuming players use a form of learning (no-regret learning) to adapt to the environment. However, these results assume that there is no carry-over effects between rounds: outcomes in one round have no effect on the game in the future. Many repeated games have an evolving state resulting in direct carry-over effect, such as repeated auctions with budgets, as well as queuing systems. In this talk we will study this phenomenon in the context of a game modeling queuing systems: routers compete for servers, where packets that do not get served need to be resent, resulting in a system where the number of packets at each round depends on the success of the routers in the previous rounds. We study the required excess server capacity needed to guarantee that all packets get served in two different queuing systems (with or without buffers) despite the selfish (myopic) behavior of the participants.
Rob Nowak: Function Space Perspectives on Neural Networks
This talk reviews a theory of the functions learned by neural networks with Rectified Linear Unit (ReLU) activations. At its core is the observation that deep ReLU networks can be characterized as solutions to data-fitting problems in certain infinite dimensional function spaces. The solutions are compositions of functions from Banach spaces of second-order bounded variation, defined in the Radon transform domain. Functions in these spaces exhibit strong smoothness in most directions, making it a natural setting for adapting to intrinsic low-dimensional structure in data. Moreover, the norms in these spaces are closely tied to the size of neural network weights, providing a direct connection between function complexity and network parametrization. In particular, the total variation norm provides an analytic tool for identifying functions that cannot be realized by shallow networks, thereby yielding a precise characterization of depth separation. Representer theorems reveal the solutions are sparse in the number of active neurons per layer. Sparsity provides a principled path to network compression, yet some sparse solutions can suffer from poor generalization. The theory suggests new training strategies to promote solutions that generalize more robustly.