Doctoral Thesis Proposal - Daiyaan Arfeen

— 3:30pm

Location:
In Person - ASA Conference Room, Gates Hillman 6115

Speaker:
DAIYAAN ARFEEN , Ph.D. Student, Computer Science Department, Carnegie Mellon University
https://csd.cmu.edu/people/doctoral-student/daiyaan-arfeen

Designing Scalable DNN Training Systems to Overcome Algorithmic Constraints

LLM training requires massive amounts of compute due to large model and dataset sizes, so it is not unusual to train LLMs on tens or hundreds of thousands of GPUs to complete training in a reasonable amount of time (days or weeks). However, GPU failures (which are common at these scales) and data-dependencies (introduced by the training algorithms) can lead to severe GPU underutilization.  

In this talk, we present distributed LLM training systems which are efficient and fault-tolerant at these scales. We first present Nonuniform-tensor-parallelism (NTP), a technique which increases the fault-tolerance of tensor-parallel training, thereby reducing the blast-radius of GPU failures. NTP enables scale-up training with little-to-no loss in training efficiency from realistic rates of GPU failures. Next we present PipeFill, a system for recovering GPU utilization (lost due to scale-out training) by filling pipeline bubbles with third-party latency-insensitive jobs. We will discuss how PipeFill could be extended to support filling pipeline bubbles with online inference jobs, which are latency-sensitive.

Thesis Committee

Greg Ganger (Chair)
Zhihao Jia
Phillip B. Gibbons
Dheevatsa Mudigere (NVIDIA)


Add event to Google
Add event to iCal