### Monday, December 14, 2015 - 2:00pm

### Location:

7101 Gates & Hillman Centers### Speaker:

VENKATA KRISHNA PILLUTLA, Masters Student http://www.cs.cmu.edu/~vpillutlIn distributed machine learning, data is dispatched to multiple machines for processing. Traditional distributed Machine Learning systems dealing with high volume of data randomly distribute the data among workers and either learn a linear model for each partition of data or in tandem where the workers contribute updates to one synchronized global model. Motivated by the fact that similar data points are often belonging to the same or similar classes, and more generally, classification rules of high accuracy tend to be “locally simple but globally complex” [54], we propose data dependent dispatching that takes advantage of such structures. Our main technical contribution is to provide algorithms with provable guarantees for data-dependent dispatching, that partition the data in a way that satisfies important conditions for effective distributed learning, including fault tolerance and load-balacing. We achieve this by initially clustering a small sample and dispatching the rest of the dataset using Nearest Neighbor Dispatch. For the first step, we present approximation algorithms with provable guarantees for the NP-Hard problem of bal anced clustering. For dispatch, we use ideas from Nearest Neighbor Classification to show that the dispatcher respects locality and load balancing properties of the clustering. After data-dependent partitioning, learning can be performed independently with no communication, or with complete communication, we each machine storing a local correction. We show the effectiveness of our method over the widely used random partitioning scheme with the same communication budget on several real world image and advertising dataset Thesis Committee:Maria-FLorina Balcan (Chair)Alexander J. SmolaChristos Faloutsos Copy of Thesis Document