We revisit the fundamental problem of learning with distribution shift, where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′ and is asked to output a classifier with low test error. The standard approach in this setting is to prove a generalization bound in terms of some notion of distance between D and D′. These distances, however, are difficult to compute, and this has been the main stumbling block for efficient algorithm design over the last two decades.
We sidestep this issue and define a new model called TDS learning, where a learner runs a test on the training set and is allowed to reject if this test detects distribution shift relative to a fixed output classifier.  This approach leads to the first set of efficient algorithms for learning with distribution shift that do not take any assumptions on the test distribution.  Finally, we discuss how our techniques have recently been used to solve longstanding problems in supervised learning with contamination.
  I will argue that properties of natural data are what predominantly
make deep networks so effective. To that end, I will show that deep
networks work well because of a characteristic structure in the space
of learnable tasks. The input correlation matrix for typical tasks has
a “sloppy” eigenspectrum where eigenvalues decay linearly on a
logarithmic scale. As a consequence, the Hessian and the Fisher
Information Matrix of a trained network also have a sloppy
eigenspectrum. Using this idea, I will demonstrate an analytical,
non-vacuous PAC-Bayes bound on the generalization error for general
deep networks.
I will show that the training process in deep learning explores a
remarkably low dimensional manifold, as low as three. Networks with a
wide variety of architectures, sizes, optimization and regularization
methods lie on the same manifold. Networks being trained on different
tasks (e.g., different subsets of ImageNet) using different methods
(e.g., supervised, transfer, meta, semi and self-supervised learning)
also lie on the same low-dimensional manifold.
I will show that typical tasks are highly redundant functions of their
inputs. Many perception tasks, from visual recognition, semantic
segmentation, optical flow, depth estimation, to vocalization
discrimination, can be predicted extremely well regardless of whether
data is projected in the principal subspace where it varies the most,
some intermediate subspace with moderate variability---or the bottom
subspace where data varies the least.
Nati Srebro: Weak to Strong Generalization in Random Feature Models
Weak-to-Strong Generalization (Burns et al., 2023) is the phenomenon whereby a strong student, say GPT-4, learns a task from a weak teacher, say GPT-2, and ends up significantly outperforming the teacher. We show that this phenomenon does not require a strong and complex learner like GPT-4, nor pre-training. We consider students and teachers that are random feature models, described by two-layer networks with a random and fixed bottom layer and trained top layer. A ‘weak’ teacher, with a small number of units (i.e. random features), is trained on the population, and a ‘strong’ student, with a much larger number of units (i.e. random features), is trained only on labels generated by the weak teacher. We demonstrate, prove and understand, how the student can outperform the teacher, even though trained only on data labeled by the teacher, with no pretraining or other knowledge or data advantage over the teacher. We explain how such weak-to-strong generalization is enabled by early stopping. Importantly, we also show the quantitative limits of weak-to-strong generalization in this model.
Joint work with Marko Medvedev, Kaifeng Lyu, Dingli Yu, Sanjeev Arora and Zhiyuan Li.
Zhimei Ren: ACS: An interactive framework for machine-assisted selection with model-free guarantees
In this talk, I will introduce adaptive conformal selection (ACS), an interactive framework for model-free selection with guaranteed error control. Building on conformal selection (Jin and Candès, 2023b), ACS generalizes the approach to support human-in-the-loop adaptive data analysis. Under the ACS framework, we can partially reuse the data to boost the selection power, make decisions on the fly while exploring the data, and incorporate new information or preferences as they arise. The key to ACS is a carefully designed principle that controls the information available for decision making, allowing the data analyst to explore the data adaptively while maintaining rigorous control of the false discovery rate (FDR). Based on the ACS framework, we provide concrete selection algorithms for various goals, including model update/selection, diversified selection, and incorporating newly available labeled data. The effectiveness of ACS is demonstrated through extensive numerical simulations and real-data applications in large language model (LLM) deployment and drug discovery.
  
Eva Tardos: Learning in Strategic Queuing
  Over the last two decades we have developed good understanding how to quantify the impact of strategic user behavior on outcomes in many games (including traffic routing and online auctions) and showed that the resulting bounds extend to repeated games assuming players use a form of learning (no-regret learning) to adapt to the environment. However, these results assume that there is no carry-over effects between rounds: outcomes in one round have no effect on the game in the future. Many repeated games have an evolving state resulting in direct carry-over effect, such as repeated auctions with budgets, as well as queuing systems. In this talk we will study this phenomenon in the context of a game modeling queuing systems: routers compete for servers, where packets that do not get served need to be resent, resulting in a system where the number of packets at each round depends on the success of the routers in the previous rounds. We study the required excess server capacity needed to guarantee that all packets get served in two different queuing systems (with or without buffers) despite the selfish (myopic) behavior of the participants.