UPenn Optimization Seminar (new!)
Quick facts
- When: Thursdays, 12-1pm
- Where: Room 1201, Steinberg-Dietrich Hall
- Mailing list:
Subscribe to "opt-seminar" here
- Organizers: Jason Altschuler and Hamed Hassani
- Administrative coordinator: Sonia Rodriguez (email: soniacr@seas.upenn.edu)
About
- What: This seminar series features leading experts in optimization and adjacent fields. Topics range broadly from the design and analysis of optimization algorithms, to the complexity of fundamental optimization tasks, to the modeling and formulation of optimization applications arising in machine learning, engineering, operations, economics, applied mathematics, etc.
- Why: This seminar serves as a university-wide hub to bring together the many optimization communities across UPenn --- in the Departments of Statistics and Data Science, Electrical Engineering, Computer Science, Applied Mathematics, Economics, Wharton OID, etc. To help promote this internal interaction, several speakers will be from UPenn.
- Funding: We gratefully acknowledge funding from the Wharton Department of Statistics and Data Science, the IDEAS Institute for Data Science, and the NSF-Tripods-I program.
- Archived schedules: Fall 2023, Spring 2024
Schedule for Fall 2024
Abstracts
Aaron Sidford: Theoretical Advances in Efficiently Solving Markov Decision Processes
Markov Decision Processes (MDP) are a fundamental mathematical model for reasoning about uncertainty and are foundational to reinforcement learning theory. Over the past decade, there have been substantial advances in the design and analysis of algorithms for provably computing approximately optimal policies for MDPs in a variety of settings. In this talk, I will survey these advances touching upon optimization tools of potential broader utility. In particular, this talk will highlight recent joint work with Ishani Karmarkar, Jiayi Wang, and Yujia Jin on this topic (arXiv:2405.12952).
Madeleine Udell: AI and the Future of Optimization Modeling
Optimization problems are pervasive in sectors from manufacturing and distribution to healthcare. However, most such problems are still solved heuristically by hand rather than optimally by state-of-the-art solvers, as the expertise required to formulate and solve these problems limits the widespread adoption of optimization tools and techniques. As a glimpse of the future, this talk will introduce OptiMUS, a Large Language Model (LLM)-based agent designed to formulate and solve MILP problems from natural language descriptions. OptiMUS can develop mathematical models, write and debug solver code, develop tests, and check the validity of generated solutions. Experimentally, OptiMUS correctly solves more than 80% of benchmark problems, more than twice as many as a basic LLM prompting strategy. More broadly, we discuss the potential for LLMs in domains where accuracy and fidelity to real-world data is critical and strategies to augment and safeguard their performance.
Alejandro Ribeiro: Constrained Reinforcement Learning
Constrained reinforcement learning (CRL) involves multiple rewards that must individually accumulate to given thresholds. CRL arises naturally in cyberphysical systems which are most often specified by a set of requirements. We explain in this talk that CRL problems have null duality gaps even though they are not convex. These facts imply that they can be solved in the dual domain but that standard dual gradient descent algorithms may fail to find optimal policies. We circumvent this limitation with the introduction of a state augmented algorithm in which Lagrange multipliers are incorporated in the state space. We show that state augmented algorithms sample from stochastic policies that achieve target rewards. We further introduce resilient CRL as a mechanism to relax constraints when requirements are overspecified. We illustrate results and implications with safety and wireless communication applications.
Jacob Gardner: Bayesian Optimization for Scientific Discovery
Many scientific experimentation and discovery problems can be cast as black-box optimization problems, where the goal is to use a data-driven approach to jointly design a material, protein, or molecule that maximizes some desired property while knowing relatively little about the underlying objective. For example, a laboratory developing novel antibody therapeutics might wish to maximize binding affinity to some target antigen but will typically also care about yield, thermostability, humanization, and other constraints. In this talk, I will discuss some of our recent work advancing both the practical application and theory of Bayesian optimization--a widely studied surrogate model based approach that has emerged as one of the most successful frameworks for black-box optimization. Practically, our recent work has enabled optimization of objectives with constraints defined over complex, structured objects like molecules and peptides, leading to the discovery of diverse, novel antibiotic molecules that are as or more potent than clinically available ones on high priority drug resistant pathogens in mice. Theoretically, I will argue that--surprisingly--local approaches to global optimization may partially address to some degree pessimistic lower bounds for optimizing high dimensional black-box objectives.
Jason Lee: Learning Representations and Associations with Gradient Descent
Machine Learning has undergone a paradigm shift with the success of pretrained models. Pretraining models via gradient descent learns transferable representations that adapt to a wide swath of downstream tasks. However, significant prior theoretical work has demonstrated that in many regimes, overparametrized neural networks trained by gradient descent behave like kernel methods, and do not learn transferable representations. In this talk, we close this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a neural network by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime.
Finally, I will demonstrate how pretraining learns associations for in-context learning with transformers. This leads to a systematic and mechanistic understanding of learning causal structures including the celebrated induction head identified by Anthropic.
Robert Freund: The Role of Level-Set Geometry on the Performance of PDHG for Conic Linear Optimization
In joint work with Zikai Xiong (MIT OR Center), we consider solving huge-scale instances of (convex) conic linear optimization problems, at the scale where matrix-factorization-free methods are attractive or necessary. The restarted primal-dual hybrid gradient method (rPDHG) -- with heuristic enhancements and GPU implementation -- has been very successful in solving huge-scale linear programming (LP) problems. Here we extend the study of rPDHG to encompass Conic Linear Optimization as well. We present a new theoretical analysis of rPDHG for general (convex) conic linear optimization and for LP as a special case thereof. We show a relationship between geometric measures of the primal-dual (sub-)level sets and the convergence rate of rPDHG. We also show how central-path-based linear transformations -- including conic rescaling -- can markedly enhance the convergence rate of rPDHG. Last of all, we present computational results that demonstrate how rescalings can accelerate convergence to high-accuracy solutions, and lead to more efficient methods for huge-scale LP problems in particular. Many of our ideas are aligned with and connected to work by Yinyu Ye, and these connections will be discussed in the talk.
John Wright: Fitting Deep Networks to Structured Data: Theory and Architectures
Data in science and engineering often exhibit nonlinear, low-dimensional structure, due to the physical laws that govern data generation. In this talk, we study how deep neural networks interact with structured data:
+ When can we guarantee to fit and generalize?
+ How do the resources (depth, width, data) required depend on the complexity of the data?
+ How can we leverage physical prior knowledge to reduce these resource requirements?
Our main mathematical result is a guarantee of generalization for a model classification problem involving data on low-dimensional manifolds — we prove that for networks of polynomial width, with polynomially many samples, randomly initialized gradient descent rapidly converges to a solution which correctly labels *every* point on the two manifolds. To our knowledge this is the first such result for deep networks on data which are not linearly separable. We highlight intuitions about the roles of depth, width, sample complexity, and the geometry of feature representations, which may be useful in analyzing other problems involving low-dimensional structure.
We illustrate these ideas through applied problems in astrophysics and computer vision. In these settings, we further suggest how incorporating physical prior knowledge can reduce the resources (architecture, data) required for learning, leading to more efficient and interpretable learning architectures.
Joint work with Sam Buchanan, Dar Gilboa, Tingran (Tim) Wang, Jingkai Yan, Shiyu Wang
Marco Cuturi: On Parameterizing Optimal Transport with Elastic Costs
I will present in this talk an overview of the computations of optimal transport, focusing in particular on the challenge of computing OT maps using two samples from high-dimensional probability measures. After reviewing a few of the popular methods that have been explored for this task recently, including those leveraging neural architectures, I will introduce our recent work on parameterising OT problems with elastic costs, i.e. ground costs that mix the classic squared Euclidean distance with a regularizer (e.g. L1 norm). After highlighting the properties of OT maps that follow such costs, I will present a method to compute ground truth OT maps with elastic costs and also a method to learn the parameters, adaptively, of such a regularizer.
Xinyi Chen: A Non-stochastic Control Approach to Optimization
Selecting the best hyperparameters for a particular optimization instance, such as the learning rate and momentum, is an important but non-convex problem. As a result, local methods such as hypergradient descent lack global optimality guarantees in general. We propose an online control methodology for mathematical optimization. We first describe meta-optimization, an online learning formulation of learning the best optimization algorithm from a class of methods. Meta-optimization over gradient-based methods can be framed as a feedback control problem over the choice of hyperparameters, including the learning rate, momentum, and the preconditioner. We show how recent methods from online non-stochastic control can be applied to develop a convex relaxation, and obtain regret guarantees vs. the best offline solution. This guarantees that in meta-optimization, given a sequence of optimization problems, we can learn a method with performance comparable to that of the best method in hindsight from a class of methods. We end with experiments on regression and deep neural network training that demonstrate the effectiveness of meta-optimization.
Aryan Mokhtari: Online Learning Guided Quasi-Newton Methods: Improved Global Non-asymptotic Guarantees
Quasi-Newton (QN) methods are popular iterative algorithms known for their superior practical performance compared to Gradient Descent (GD)-type methods. However, the existing theoretical results for this class of algorithms do not sufficiently justify their advantage over GD-type methods. In this talk, we discuss our recent efforts to address this issue. Specifically, in the strongly convex setting, we propose the first “globally” convergent QN method that achieves an explicit “non-asymptotic superlinear” rate. We show that the rate presented for our method is provably faster than GD after at most $O(d)$ iterations, where $d$ is the problem dimension. Additionally, in the convex setting, we present an accelerated variant of our proposed method that provably outperforms the accelerated gradient method and converges at a rate of $O(\min\{1/k^2, \sqrt{d \log k }/ k^{2.5}\})$, where $k$ is the number of iterations. To attain these results, we diverge from conventional approaches and construct our QN methods based on the Hybrid Proximal Extragradient (HPE) framework and its accelerated variants. Furthermore, a pivotal algorithmic concept underpinning our methodologies is an online learning framework for updating the Hessian approximation matrices. Specifically, we relate our method's convergence rate to the regret of a specific online convex optimization problem in the matrix space and choose the sequence of Hessian approximation matrices to minimize its overall regret.
Ashia Wilson: Two Approaches Towards Adaptive Optimization
This talk will address to recent projects I am excited about. The first describes efficient methodologies for hyper-parameter estimation in optimization algorithms. I will describe two approaches for how to adaptively estimate these parameters that often lead to significant improvement in convergence. The second describes a new method, called Metropolis-Adjusted Preconditioned Langevin Algorithm for sampling from a convex body. Taking an optimization perspective, I focus on the mixing time guarantees of these algorithms — an essential theoretical property for MCMC methods — under natural conditions over the target distribution and the geometry of the domain.
Sewoong Oh: Private Fine-Tuning of Language Models without Backpropagation
Recent advances in training large language models (LLMs) with privacy reveal interesting phenomena that are distinct for private optimization. One prominent observation is that the fine-tuning landscape of LLMs exhibits low-dimensional structures. We exploit this to design a novel memory-efficient private optimization method, which we call DPZero, that bypasses memory-heavy backpropagation and only uses forward passes. On fine-tuning a RoBERTa model with 355M parameters on text classification tasks, we get 8x gain in memory with little loss in the performance. Our theoretical analysis explains how DPZero benefits from low-dimensional structures to achieve a dimension independent guarantee. This is based on a joint work (https://arxiv.org/abs/2310.09639) with Liang Zhang, Kiran Koshy Thekumparampil, and Niao He.
Sophie Yu: Constant Regret Primal-Dual Policy for Multi-Way Dynamic Matching
We study a discrete-time dynamic multi-way matching model. There are finitely many agent types that arrive stochastically and wait to be matched. State-of-the-art dynamic matching policies in the literature require the knowledge of all system parameters to determine an optimal basis of the fluid relaxation, and focus on controlling the number of waiting agents using only matches within the optimal basis (Kerimov et al., 2024, 2023; Gupta, 2024). In this paper, we propose a primal-dual policy that schedule matches for future arrivals based on an estimator for the dual solution. Our policy does not require the knowledge of the arrival rates and operates with greater flexibility as it does not restrict matches to only the match types within an optimal basis. We show that our policy is first to achieve constant regret at all times under unknown arrival rates, and when the arrival rates are known, it achieves the optimal scaling as the lower-bound described in Kerimov et al. (2024, 2023). Furthermore, when the arrival rates are known, the primal-dual policy significantly outperforms alternative dynamic matching policies in several numerical simulations.
Coauthors: Yehua Wei and Jiaming Xu.