Abstract:
Scaling up AI systems requires more than just GPUs — it demands a fundamental rethinking of how we move, compute, and represent information across the entire AI stack. In this talk, I will begin by addressing the communication bottleneck that limits distributed training and inference. I will present two complementary directions: compression, which reduces communication overhead while preserving convergence and accuracy; and decentralization, which enables collaborative training across heterogeneous, low-bandwidth environments without centralized coordination. Building on this foundation, the second part of the talk extends beyond communication to explore the modeling and hardware dimensions of scalability. I will discuss our recent efforts to remove tokenization bottlenecks through byte-level modeling, enhance long-context generation via chunk-based sparse attention, and achieve low-latency inference through unified mega-kernel execution. Together, these directions outline a holistic roadmap for scaling AI systems.
Bio:
Yucheng Lu is an Assistant Professor of Computer Science at NYU Shanghai. His research focuses on building scalable and efficient systems for training and serving AI models. He received his Ph.D. in Computer Science from Cornell University, where his work on distributed and communication-efficient learning earned recognition at top venues such as ICML, NeurIPS, and ICLR, including the ICML Outstanding Paper Runner-Up Award and a Meta PhD Fellowship. Beyond academia, Yucheng has worked at Microsoft, Google, and Amazon, and was part of the early technical team at Together AI, where he contributed to the design of high-performance inference engines and GPU kernels for large-scale language models. At NYU Shanghai, he leads the HeavyBall Research Group, which aims to advance the next generation of AI systems through algorithm–system co-design.
