Enabling Efficient Batch Serving for LMaaS Via Generation Length Prediction

IEEE International Conference on Web Services(2024)

引用 0|浏览22
暂无评分
摘要
Nowadays, large language models (LLMs) are published as a service and can beaccessed by various applications via APIs, also known aslanguage-model-as-a-service (LMaaS). Without knowing the generation length ofrequests, existing serving systems serve requests in a first-come, first-served(FCFS) manner with a fixed batch size, which leads to two problems that affectbatch serving efficiency. First, the generation lengths of requests in a batchvary, and requests with short generation lengths must wait for requests withlong generation lengths to finish during the batch serving procedure. Second,requests with longer generation lengths consume more memory during serving.Without knowing the generation lengths of batched requests, the batch size isalways set small to avoid the out-of-memory (OOM) error, thus preventing theGPU from being fully utilized. In this paper, we find that a significant numberof popular applications in the LMaaS scenario have a positive correlationbetween the generation length and the length of raw user input. Based on thisobservation, we propose Magnus, which can accurately predict the requestgeneration length with the user input length, application-level, and user-levelsemantic features. Accordingly, Magnus can achieve high request throughput bybatching requests of similar generation lengths together with adaptive batchsizes. Besides, Magnus can also schedule batches with the highest responseratio next (HRRN) policy to reduce request response time. Experiments conductedon our testbed show that Magnus improves request throughput by up to 234% andreduces response time by up to 89.7% compared to baselines.
更多
查看译文
关键词
Language Model as a Service,Transformer Inference,Generation Length Prediction,Quality of Service,Highest Response Ratio Next Scheduling
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要