Combinatorial Optimization on Massive Datasets: Streaming, Distributed, and
Massively Parallel Computation

Author: Sepehr Assadi.
PhD Thesis: Computer and Information Science Department, University of Pennsylvania, August 2018
Recipient of the European Association for Theoretical Computer Science (EATCS) Distinguished Dissertation Award, 2019.
Recipient of the ACM-EATCS Principles of Distributed Computing Doctoral Dissertation Award, 2019.
Recipient of the Morris and Dorothy Rubinoff Award for best Computer Science PhD thesis at University of Pennsylvania, 2019.
Abstract: With the emergence of massive datasets across different application domains, there is a rapidly growing need to solve various optimization tasks over such datasets. This in turn raises the following fundamental question:

How well can we solve a large-scale optimization problem on massive datasets in a resource-efficient manner?

The focus of this thesis is on answering this question for various problems in different modern computational models for processing massive datasets, in particular, streaming, distributed, and massively parallel computation (such as MapReduce) models.

The first part of this thesis is focused on graph optimization, in which we study several fundamental graph optimization problems including matching, vertex cover, and connectivity. The main results in this part include a tight bound on the space complexity of the matching problem in dynamic streams, a unifying framework for achieving algorithms that improve the state-of-the-art for matching and vertex cover problems in all the men- tioned models, and the first massively parallel algorithm for connectivity on sparse graphs that improve upon the classical parallel PRAM algorithms for a large family of graphs.

In the second part of the thesis, we consider submodular optimization and in particular two canonical examples of set cover and maximum coverage problems. We establish tight bounds on the space complexity of approximating set cover and maximum coverage in both single- and multi-pass streams, and build on these results to address the general problem of submodular maximization across all aforementioned models.

In the last part, we consider resource constrained optimization settings which require computation over data which is not particularly large but still imposes restrictions of similar nature. Examples include optimization over data which can only be accessed with limited adaptivity (e.g. in crowdsourcing applications), or corresponds to private informa- tion of agents and is not available in an integrated form (e.g., in auction and mechanism design applications). We use the toolkit developed in the first two parts of the thesis to obtain several new results for such problems, significantly improving the state-of-the-art.

A common theme in this thesis is a rigorous study of problems from both an algorithmic perspective and impossibility results. Rather than coming up with ad hoc solutions to problems at hand, the goal here is to develop general techniques for solving various large-scale optimization problems in a unified way in different models, which in turns requires further understanding of the limitations of known approaches for addressing these problems.
Summary: [PDF]
Full version: [PDF]