index.json


    
    
    
    
    
    
    
    
    
    
    
    [{"authors":null,"categories":null,"content":"I am a Ph.D. candidate in the School of Computer Science at Peking University, advised by Xin Jin. Before that, I received my B.S. degree (Summa Cum Laude) in computer science from Turing Class, Peking University. My research interests include machine learning systems, video conferencing, and cloud computing.\n","date":1727049600,"expirydate":-62135596800,"kind":"term","lang":"en","lastmod":1727049600,"objectID":"2525497d367e79493fd32b198b28f040","permalink":"https://BingyangWu.github.io/authors/admin/","publishdate":"0001-01-01T00:00:00Z","relpermalink":"/authors/admin/","section":"authors","summary":"I am a Ph.D. candidate in the School of Computer Science at Peking University, advised by Xin Jin. Before that, I received my B.S. degree (Summa Cum Laude) in computer science from Turing Class, Peking University.","tags":null,"title":"Bingyang Wu","type":"authors"},{"authors":["Bingyang Wu","Shengyu Liu","Yinmin Zhong","Peng Sun","Xuanzhe Liu","Xin Jin"],"categories":null,"content":"","date":1727049600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1727049600,"objectID":"026fbe36114bd4ed6896c54ad18ea8f9","permalink":"https://BingyangWu.github.io/publication/loongserve/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/loongserve/","section":"publication","summary":"The context window of large language models (LLMs) is rapidly increasing, leading to a huge variance in resource usage between different requests as well as between different phases of the same request. Restricted by static parallelism strategies, existing LLM serving systems cannot efficiently utilize the underlying resources to serve variable-length requests in different phases. To address this problem, we propose a new parallelism paradigm, elastic sequence parallelism (ESP), to elastically adapt to the variance between different requests and phases. Based on ESP, we design and build LoongServe, an LLM serving system that (1) improves computation efficiency by elastically adjusting the degree of parallelism in real-time, (2) improves communication efficiency by reducing key-value cache migration overhead and overlapping partial decoding communication with computation, and (3) improves GPU memory efficiency by reducing key-value cache fragmentation across instances. Our evaluation under diverse real-world datasets shows that LoongServe improves the maximum throughput by up to 3.85× compared to the chunked prefill and 5.81× compared to the prefill-decoding disaggregation.","tags":[],"title":"LoongServe: Efficiently Serving Long-Context Large Language Models with Elastic Sequence Parallelism","type":"publication"},{"authors":["Yinmin Zhong","Zili Zhang","Bingyang Wu","Shengyu Liu","Yukun Chen","Changyi Wan","Hanpeng Hu","Lei Xia","Ranchen Ming","Yibo Zhu","Xin Jin"],"categories":null,"content":"","date":1725235200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1725235200,"objectID":"73fbca90dd870595605bd0e80895a0eb","permalink":"https://BingyangWu.github.io/publication/rlhfuse/","publishdate":"2024-09-02T00:00:00Z","relpermalink":"/publication/rlhfuse/","section":"publication","summary":"Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between LLMs and human preference. The workflow of RLHF typically involves several models and tasks in a series of distinct stages. Existing RLHF training systems view each task as the smallest execution unit thus overlooking the opportunities for subtask-level optimizations. Due to the intrinsic nature of RLHF training, i.e., the data skewness in the generation stage, and the pipeline bubbles in the training stage, existing RLHF systems suffer from low GPU utilization in production deployments. RLHFuse breaks the traditional view of RLHF workflow as a composition of individual tasks, splitting each task into finer-grained subtasks, and performing stage fusion to improve GPU utilization. RLHFuse contains two key ideas. First, for generation and inference tasks, RLHFuse splits them into sample-level subtasks, enabling efficient inter-stage fusion to mitigate the original generation bottleneck dominated by long-tailed samples. Second, for training tasks, RLHFuse breaks them into subtasks of micro-batches. By leveraging the intuition that pipeline execution can be essentially complemented by another pipeline, RLHFuse performs intra-stage fusion to concurrently execute these subtasks in the training stage with a fused pipeline schedule, resulting in fewer pipeline bubbles. In addition, RLHFuse incorporates a series of system optimizations tailored for each stage of RLHF, making it efficient and scalable for our internal product usage. We evaluate RLHFuse on various popular LLMs and the results show that RLHFuse increases the training throughput by up to 3.7x, compared to existing state-of-the-art systems.","tags":[],"title":"RLHFuse: Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion","type":"publication"},{"authors":["Bingyang Wu","Ruidong Zhu","Zili Zhang","Peng Sun","Xuanzhe Liu","Xin Jin"],"categories":null,"content":"","date":1720742400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1720742400,"objectID":"ab97b8b33881bf27a7a42ebc2ab376f2","permalink":"https://BingyangWu.github.io/publication/dlora/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/dlora/","section":"publication","summary":"Low-rank adaptation (LoRA) is a popular approach to finetune pre-trained large language models (LLMs) to specific domains. This paper introduces dLoRA, an inference serving system for LoRA models. dLoRA achieves high serving efficiency by dynamically orchestrating requests and LoRA adapters in terms of two aspects: (i) dynamically merge and unmerge adapters with the base model; and (ii) dynamically migrate requests and adapters between different worker replicas. These capabilities are designed based on two insights. First, despite the allure of batching without merging a LoRA adapter into the base model, it is not always beneficial to unmerge, especially when the types of requests are skewed. Second, the autoregressive nature of LLM requests introduces load imbalance between worker replicas due to varying input and output lengths, even if the input requests are distributed uniformly to the replicas. We design a credit-based batching algorithm to decide when to merge and unmerge, and a request-adapter co-migration algorithm to decide when to migrate. The experimental results show that dLoRA improves the throughput by up to 57.9× and 26.0×, compared to vLLM and HugginFace PEFT, respectively. Compared to the concurrent work S-LoRA, dLoRA achieves up to 1.8× lower average latency.","tags":[],"title":"dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving","type":"publication"},{"authors":["Mengwei Xu","Wangsong Yin","Dongqi Cai","Rongjie Yi","Daliang Xu","Qipeng Wang","Bingyang Wu","Yihao Zhao","Chen Yang","Shihe Wang","Qiyang Zhang","Zhenyan Lu","Li Zhang","Shangguang Wang","Yuanchun Li","Yunxin Liu","Xin Jin","Xuanzhe Liu"],"categories":null,"content":"","date":1705363200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1705363200,"objectID":"d58e2d99e05fb49e7416f679ac04e15a","permalink":"https://BingyangWu.github.io/publication/llmsurvey/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/llmsurvey/","section":"publication","summary":"Large foundation models, including large language models (LLMs), vision transformers (ViTs), diffusion, and LLM-based multimodal models, are revolutionizing the entire machine learning lifecycle, from training to deployment. However, the substantial advancements in versatility and performance these models offer come at a significant cost in terms of hardware resources. To support the growth of these large models in a scalable and environmentally sustainable way, there has been a considerable focus on developing resource-efficient strategies. This survey delves into the critical importance of such research, examining both algorithmic and systemic aspects. It offers a comprehensive analysis and valuable insights gleaned from existing literature, encompassing a broad array of topics from cutting-edge model architectures and training/serving algorithms to practical system designs and implementations. The goal of this survey is to provide an overarching understanding of how current approaches are tackling the resource challenges posed by large foundation models and to potentially inspire future breakthroughs in this field.","tags":[],"title":"A Survey of Resource-efficient LLM and Multimodal Foundation Models","type":"publication"},{"authors":["Bingyang Wu","Kun Qian","Bo Li","Yunfei Ma","Qi Zhang","Zhigang Jiang","Jiayu Zhao","Dennis Cai","Ennan Zhai","Xuanzhe Liu","Xin Jin"],"categories":null,"content":"","date":1694304000,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1694304000,"objectID":"1a39fbb7d0a92552ffeb5a255f32a0c1","permalink":"https://BingyangWu.github.io/publication/xron/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/xron/","section":"publication","summary":"Quality and cost are two key considerations for video conferencing services. Service providers face a dilemma when selecting network tiers to build their infrastructure---relying on Internet links has poor quality, while using premium links brings excessive cost.We present XRON, a hybrid elastic cloud overlay network for our planetary-scale video conferencing service. XRON differs from prior overlays with two distinct features. First, XRON is hybrid, i.e., it leverages both Internet and premium links to simultaneously achieve high quality and low cost. Second, XRON is elastic, i.e., it exploits elastic cloud resources to adaptively scale its capacity based on realtime demand. The data plane of XRON combines active probing and passive tracking for scalable link state monitoring, uses asymmetric forwarding based on heterogeneous bidirectional link qualities, and quickly reacts to sudden link degradations without the control plane involvement. The control plane of XRON predicts video traffic based on application knowledge, and computes global forwarding paths and reaction plans with scalable algorithms. Large-scale deployment in DingTalk shows that XRON reduces video stall ratio and bad audio fluency by 77\\% and 65.2\\%, respectively, compared to using Internet links only, and reduces cost by 79\\%, compared to using premium links only.","tags":[],"title":"XRON: A Hybrid Elastic Cloud Overlay Network for Video Conferencing at Planetary Scale","type":"publication"},{"authors":["Bingyang Wu","Yinmin Zhong","Zili Zhang","Gang Huang","Xuanzhe Liu","Xin Jin"],"categories":null,"content":"","date":1683676800,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1683676800,"objectID":"a434cc20cedf724900aca6daebe117bf","permalink":"https://BingyangWu.github.io/publication/fastgen/","publishdate":"2023-05-10T00:00:00Z","relpermalink":"/publication/fastgen/","section":"publication","summary":"Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demand low job completion time (JCT) for model inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long JCT. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize JCT with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate states between GPU memory and host memory for LLM inference. We build a system prototype of FastServe based on NVIDIA FasterTransformer. Experimental results show that compared to the state-of-the-art solution Orca, FastServe improves the average and tail JCT by up to 5.1× and 6.4×, respectively.","tags":[],"title":"Fast Distributed Inference Serving for Large Language Models","type":"publication"},{"authors":["Bingyang Wu","Zili Zhang","Zhihao Bai","Xuanzhe Liu","Xin Jin"],"categories":null,"content":"","date":1681689600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1681689600,"objectID":"b96457186b7d4b3bcb764a378524edb5","permalink":"https://BingyangWu.github.io/publication/tgs/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/tgs/","section":"publication","summary":"Containers are widely used for resource management in datacenters. A common practice to support deep learning (DL) training in container clouds is to statically bind GPUs to containers in entirety. Due to the diverse resource demands of DL jobs in production, a significant number of GPUs are underutilized. As a result, GPU clusters have low GPU utilization, which leads to a long job completion time because of queueing.\nWe present TGS (Transparent GPU Sharing), a system that provides transparent GPU sharing to DL training in container clouds. In stark contrast to recent application-layer solutions for GPU sharing, TGS operates at the OS layer beneath containers. Transparency allows users to use any software to develop models and run jobs in their containers. TGS leverages adaptive rate control and transparent unified memory to simultaneously achieve high GPU utilization and performance isolation. It ensures that production jobs are not greatly affected by opportunistic jobs on shared GPUs. We have built TGS and integrated it with Docker and Kubernetes. Experiments show that (i) TGS has little impact on the throughput of production jobs; (ii) TGS provides similar throughput for opportunistic jobs as the state-of-the-art application-layer solution AntMan, and improves their throughput by up to 15× compared to the existing OS-layer solution MPS.","tags":[],"title":"Transparent GPU Sharing in Container Clouds for Deep Learning Workloads","type":"publication"},{"authors":["Size Zheng","Renze Chen","Anjiang Wei","Yicheng Jin","Qin Han","Liqiang Lu","Bingyang Wu","Xiuhong Li","Shengen Yan","Yun Liang"],"categories":null,"content":"","date":1655510400,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1655510400,"objectID":"b4844d9a4c59a1d273cd2671f15bacf5","permalink":"https://BingyangWu.github.io/publication/amos/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/amos/","section":"publication","summary":"Hardware specialization is a promising trend to sustain performance growth. Spatial hardware accelerators that employ specialized and hierarchical computation and memory resources have recently shown high performance gains for tensor applications such as deep learning, scientific computing, and data mining. To harness the power of these hardware accelerators, programmers have to use specialized instructions with certain hardware constraints. However, these hardware accelerators and instructions are quite new and there is a lack of understanding of the hardware abstraction, performance optimization space, and automatic methodologies to explore the space. Existing compilers use hand-tuned computation implementations and optimization templates, resulting in sub-optimal performance and heavy development costs.In this paper, we propose AMOS, which is an automatic compilation framework for spatial hardware accelerators. Central to this framework is the hardware abstraction that not only clearly specifies the behavior of spatial hardware instructions, but also formally defines the mapping problem from software to hardware. Based on the abstraction, we develop algorithms and performance models to explore various mappings automatically. Finally, we build a compilation framework that uses the hardware abstraction as compiler intermediate representation (IR), explores both compute mappings and memory mappings, and generates high-performance code for different hardware backends. Our experiments show that AMOS achieves more than 2.50× speedup to hand-optimized libraries on Tensor Core, 1.37× speedup to TVM on vector units of Intel CPU for AVX-512, and up to 25.04× speedup to AutoTVM on dot units of Mali GPU. The source code of AMOS is publicly available.","tags":[],"title":"AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction","type":"publication"},{"authors":["Size Zheng","Renze Chen","Yicheng Jin","Anjiang Wei","Bingyang Wu","Xiuhong Li","Shengen Yan","Yun Liang"],"categories":null,"content":"","date":1640649600,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1640649600,"objectID":"8bd0f2791f2e9481ed831f01e99a187e","permalink":"https://BingyangWu.github.io/publication/neoflow/","publishdate":"2017-01-01T00:00:00Z","relpermalink":"/publication/neoflow/","section":"publication","summary":"Deep neural networks (DNNs) are increasingly deployed in various image recognition and natural language processing applications. The continuous demand for accuracy and high performance has led to innovations in DNN design and a proliferation of new operators. However, existing DNN training frameworks such as PyTorch and TensorFlow only support a limited range of operators and rely on hand-optimized libraries to provide efficient implementations for these operators. To evaluate novel neural networks with new operators, the programmers have to either replace the holistic new operators with existing operators or provide low-level implementations manually. Therefore, a critical requirement for DNN training frameworks is to provide high-performance implementations for the neural networks containing new operators automatically in the absence of efficient library support. In this article, we introduce NeoFlow, which is a flexible framework for enabling efficient compilation for high-performance DNN training. NeoFlow allows the programmers to directly write customized expressions as new operators to be mapped to graph representation and low-level implementations automatically, providing both high programming productivity and high performance. First, NeoFlow provides expression-based automatic differentiation to support customized model definitions with new operators. Then, NeoFlow proposes an efficient compilation system that partitions the neural network graph into subgraphs, explores optimized schedules, and generates high-performance libraries for subgraphs automatically. Finally, NeoFlow develops an efficient runtime system to combine the compilation and training as a whole by overlapping their execution. In the experiments, we examine the numerical accuracy and performance of NeoFlow. The results show that NeoFlow can achieve similar or even better performance at the operator and whole graph level for DNNs compared to deep learning frameworks. Especially, for novel networks training, the geometric mean speedups of NeoFlow to PyTorch, TensorFlow, and CuDNN are 3.16X, 2.43X, and 1.92X, respectively.","tags":[],"title":"NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training","type":"publication"},{"authors":null,"categories":null,"content":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.\nNullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.\nCras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.\nSuspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.\nAliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.\n","date":1461715200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1461715200,"objectID":"e8f8d235e8e7f2efd912bfe865363fc3","permalink":"https://BingyangWu.github.io/project/example/","publishdate":"2016-04-27T00:00:00Z","relpermalink":"/project/example/","section":"project","summary":"An example of using the in-built project page.","tags":["Deep Learning"],"title":"Example Project","type":"project"},{"authors":null,"categories":null,"content":"","date":1461715200,"expirydate":-62135596800,"kind":"page","lang":"en","lastmod":1461715200,"objectID":"d1311ddf745551c9e117aa4bb7e28516","permalink":"https://BingyangWu.github.io/project/external-project/","publishdate":"2016-04-27T00:00:00Z","relpermalink":"/project/external-project/","section":"project","summary":"An example of linking directly to an external project website using `external_link`.","tags":["Demo"],"title":"External Project","type":"project"}]