Xiaonan Nie (聂小楠)

Xiaonan is currently a staff research scientist specializing in ML systems at ByteDance in San Jose, within the TopSeed Program. His work primarily focuses on scaling and optimizing the training of deep learning models. He received his Ph.D. degree in Computer Science from Peking University in 2024, under the supervision of Prof. Bin Cui.

In terms of academic research, Xiaonan has published over 10 papers in top conferences and journals. He won the Best Scalable Data Science Award at VLDB 2022, and was invited to present his research on MoE training at the 1st Google MoE workshop and his research on LLM inference at NVIDIA’s GPU technology conference (GTC) 2024.

In terms of system design and implementation, Xiaonan was the principal developer of HETU, a distributed DL system of Peking University; the technical lead for Angel-PTM of Tencent in 2022-2023, an industrial-level LLM training system for trillion-scale models. Additionally, he lead the MLSys group of Baichuan-AI (a GenAI start-up) in 2023-2024, where he takes responsibility for the rapid training of the Baichuan series models.

Email: xiaonan.nie [AT] pku.edu.cn, niexiaonan [AT] bytedance.com

Research Interests:

Unified Model for Understanding and Generation
Algorithm-System Co-Design
Machine Learning System
Data Management

🔍🔍🔍 I am seeking highly motivated full-time employees and research interns. If interested, please contact me directly!

What’s New:

Feb, 2025: Two papers were accepted by SIGMOD 2025.
Jan, 2025: One paper was accepted by ICLR 2025.
Jan, 2025: One paper was accepted by ICDE 2025.
Sep, 2024: One paper was accepted by NeurIPS 2024.
Sep, 2024: One paper was accepted by SIGMOD 2025.
Aug, 2024: One paper was accepted by SOSP 2024.
Jun, 2024: I was awarded as Outstanding Graduate of Peking University.
May, 2024: I successfully defended my Ph.D. dissertation!
Mar, 2024: I was invited to present my research on NVIDIA GTC 2024.
Feb, 2024: One paper was accepted by TKDE 2024.

Systems:

Hetu: A Highly Efficient Automatic Parallel Distributed Deep Learning System
- 2021 Synced Machine Intelligence TOP-10 Open Source Awards.
- Pop SOTA！List for AI Developers 2021.
- Outstanding Award & Champion of [2021 CCF BDCI Contest]
Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent
- Supports the training of trillion-level models (e.g., HunYuan-NLP 1T, Top-1 model in CLUE)

Tech Reports:

[ImageGen] Seedream 4.0: Toward next-generation multimodal image generation, ByteDance Seed. PDF
[VideoGen] Seedance 1.0: Exploring the Boundaries of Video Generation Models, ByteDance Seed. PDF
[Multimodal] Bagel: Emerging Properties in Unified Multimodal Pretraining, ByteDance Seed. PDF
[LLM] Baichuan 2: Open large-scale language models, Baichuan Inc. PDF

Publications:

2025

[SIGMOD] Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui. “Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs.”.
[SIGMOD] Haoyang Li, Fangcheng Fu, Hao Ge, Sheng Lin, Xuanyu Wang, Jiawen Niu, Yujie Wang, Hailin Zhang, Xiaonan Nie, Bin Cui. “Malleus: Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization.”.
[SIGMOD] Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, Bin Cui. “PQCache: Product Quantization-based KVCache for Long Context LLM Inference.”.
[SIGCOMM] Hao Ge, Junda Feng, Qi Huang, Fangcheng Fu, Xiaonan Nie, Lei Zuo, Haibin Lin, Bin Cui, Xin Liu. “ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs.”.
[ICLR] Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, Bin Cui. “NetMoE: Accelerating MoE Training through Dynamic Sample Placement.”.
[ICDE] Keer Lu, Xiaonan Nie, Zheng Liang, Da Pan, Shusen Zhang, Weipeng Chen, Zenan Zhou, Guosheng Dong, Bin Cui, Wentao Zhang. “DataSculpt: A Holistic Data Management Framework for Long-Context LLMs Training.”.

2024

[NeurIPS] Xiaonan Nie, Qibin Liu, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, Bin Cui. “LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing.”.
[SOSP] Hao Ge, Fangcheng Fu, Haoyang Li, Xuanyu Wang and Sheng Lin, Yujie Wang, Xiaonan Nie, Hailin Zhang, Xupeng Miao, Bin Cui. “Enabling Parallelism Hot Switching for Efficient Training of Large Language Models”.
[TKDE] Yujie Wang, Youhe Jiang, Xupeng Miao, Fangcheng Fu, Shenhan Zhu, Xiaonan Nie, Yaofeng Tu, Bin Cui. “Improving Automatic Parallel Training via Balanced Memory Workload Optimization”.

2023

[SIGMOD] Xiaonan Nie, Xupeng Miao, Zilong Wang, Jilong Xue, Lingxiao Ma, Zichao Yang, Gang Cao and Bin Cui. “FlexMoE: Scaling Large-scale Sparse Pre-trained Model Training via Dynamic Device Placement”.
[VLDB] Xiaonan Nie, Yi Liu, Fangcheng Fu, Jinbao Xue, Dian Jiao, Xupeng Miao, Yangyu Tao, and Bin Cui. “Angel-PTM: A Scalable and Economical Large-scale Pre-training System in Tencent”.
[VLDB] Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin Zhang and Bin Cui. “Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism”.
[IJCAI] Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie and Bin Cui. “OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning”.

2022

[ICDE] Xiaonan Nie, Xupeng Miao, Zhi Yang and Bin Cui. “TSPLIT: Fine-grained GPU Memory Management for Efficient DNN Training via Tensor Splitting”.
[1st Google MoE Workshop] Xiaonan Nie, Xupeng Miao, Shijie Cao, Lingxiao Ma, Qibin Liu, Jilong Xue, Youshan Miao, Yi Liu, Zhi Yang and Bin Cui. “EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate”.
[SCIS] Xupeng Miao, Xiaonan Nie, Hailin Zhang, Tong Zhao and Bin Cui. “Hetu: A highly efficient automatic parallel distributed deep learning system”.
[VLDB] Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu Tao and Bin Cui. “HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework”. Best Scalable Data Science Paper!
[SIGMOD] Xupeng Miao, Yining Shi, Hailin Zhang, Xin Zhang, Xiaonan Nie, Zhi Yang and Bin Cui. “HET-GMP: A Graph-based System Approach to Scaling Large Embedding Model Training”.

2021

[SIGMOD] Xupeng Miao, Xiaonan Nie, Yingxia Shao, Zhi Yang, Jiawei Jiang, Lingxiao Ma and Bin Cui. “Heterogeneity-Aware Distributed Machine Learning Training via Partial Reduce”.

Academic Services

Program Committee of ICDE, WWW, FAISys
Reviewer of ICLR, NeurIPS, ICML, CVPR

Nie Xiaonan