Bowen Wang (王博文)

I am a senior undergraduate student at the Department of Computer Science and Technology, Tsinghua University. Currently, I am visiting the Sky Computing Lab, advised by Professor Ion Stoica.

My research focuses on machine learning systems, especially inference acceleration for large language models.

I was advised by Professor Jie Tang and Professor Yuxiao Dong in the Knowledge Engineering Group at Tsinghua University and Z.ai. I was an alumnus of Stanford Undergraduate Visiting Research in 2024 and was advised by Professor Tsachy Weissman.

CV / Email / Google Scholar / GitHub / LinkedIn / X

Research

	Barbarians at the Gate: How AI is Upending Systems Research Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, Ion Stoica arXiv* [paper] [code] [website]
	PrefillOnly: An Inference Engine for Prefill-only Workloads in LLM Applications Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, Ion Stoica, Junchen Jiang SOSP 2025 [paper] [code]
	APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding Mingdao Liu, Aohan Zeng, Bowen Wang, Peng Zhang, Jie Tang, Yuxiao Dong arXiv [paper] [code]
	AgentTuning: Enabling Generalized Agent Abilities for LLMs Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, Jie Tang ACL 2024 Findings* [paper] [code] [website]

Open-Source Projects

vLLM Expert Parallelism Load Balancer (EPLB)

[repo] [pr]

Implemented a load balancer that rearranges the experts dynamically based on observed expert usage to solve the load imbalance problem of sparse Mixture-of-Experts (MoE) inference.

Supports redundant experts to further amortize the load of popular experts by distributing the load of heavy-hitters to more compute resources.

Achieves up to 30% throughput improvement and 25% latency reduction in sparse MoE inference. Core component of large scale expert parallelism inference.

Experience

	Sky Computing Lab, University of California, Berkeley 2024.12 - Present Visiting Student Researcher Advisor: Prof. Ion Stoica
	Undergraduate Visiting Research (UGVR), Stanford University 2024.07 - 2024.08 Research Intern Advisor: Prof. Tsachy Weissman
	Knowledge Engineering Group (KEG), Tsinghua University 2023.07 - 2024.06 Research Intern Advisor: Prof. Jie Tang & Prof. Yuxiao Dong
	Z.ai, Beijing, China 2023.07 - 2024.06 Research Intern, Member of the GLM Training Team

Other Projects

	NOP-Processor: Out-of-Order LoongArch Core Competition, NSCSCC 2023 (LoongArch Track) [video] [code] Special Prize (National Top 1), 7th "Loongsun Cup" CPU Design Competition. NOP-Processor is a high-performance out-of-order processor core for the LoongArch architecture, used as a strong baseline for the National Student System Capability Challenge (NSCSCC 2023). The project implements a modern pipeline with speculative execution, branch prediction, and an on-chip memory hierarchy, targeting both high frequency and competitive performance on the LoongArch benchmark suite.
	Dino Fit Adventure: Chrome Dino with Full-Body Control Coursework, Digital Logic Design, Tsinghua University [video] [code] Dino Fit Adventure lets you play Chrome Dino in the real world using your body movements, built as a digital design course project on FPGA. The system reads acceleration data from a wearable sensor to detect jumping and other actions, decodes the data in hardware, and renders a smooth VGA game pipeline, handling clock-domain crossing, buffering, and timing closure in SystemVerilog.
	Simple RDBMS from Scratch Coursework, Introduction to Database Management Systems, Tsinghua University [code] This project implements a simple relational database management system (RDBMS) from scratch, supporting core features such as CRUD operations, indexing, constraints, aggregation, and join queries. It focuses on building the end-to-end query pipeline, including a storage layer, execution engine, and basic optimization, to expose the full lifecycle of SQL query processing in a compact system.
	Wordle in Rust with WebAssembly Coursework, Programming Training, Tsinghua University [website] [code] This project is an implementation of the Wordle word game in Rust, compiled to WebAssembly so that it can run efficiently in the browser. It explores ergonomics of Rust for game logic, safe state management, and the toolchain for building, optimizing, and deploying Rust+WASM applications to the web.

Miscellaneous

I have a passion for learning languages – both programming and natural.
Some languages I'm using / learning:

🇨🇳 中文	你好！
🇺🇸 English	Hello!
🇯🇵 日本語	こんにちは！
🇰🇷 한국어	안녕하세요!
🇪🇸 Español	¡Hola!
🐍 Python	`print('Hello, world!')`
🦀 Rust	`println!("Hello, world!");`

Design and source code from Jon Barron's website

Bowen Wang (王博文)

Research

Barbarians at the Gate: How AI is Upending Systems Research

PrefillOnly: An Inference Engine for Prefill-only Workloads in LLM Applications

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

AgentTuning: Enabling Generalized Agent Abilities for LLMs

Open-Source Projects

vLLM Expert Parallelism Load Balancer (EPLB)

Experience

Other Projects

NOP-Processor: Out-of-Order LoongArch Core

Dino Fit Adventure: Chrome Dino with Full-Body Control

Simple RDBMS from Scratch

Wordle in Rust with WebAssembly

Miscellaneous