Burst-Aware Weighted Fair Queueing for Serverless Inference: Mitigating Noisy Neighbor Effects in Multi-Tenant Systems
Keywords:
serverless, Cloud computing technology, distributed systems, fairness, multi tenant systemsAbstract
Multi-tenant serverless inference often devolves into noisy-neighbor scenarios where a single tenant’s bursty LLM batch floods the fleet, pushing interactive calls beyond their latency budgets. We are proposing a Burst-Aware Weighted Fair Queueing (BWFQ) - a scheduler that requires only two counters per tenant (tokens earned, tokens spent) and a constant-time heap pop to pick the next invocation. In BWFQ, we use the classic token-bucket shaper where tokens accumulate at a tenant-specific base rate and are reduced on each dispatch. When a tenant exhausts all its tokens, its requests are queued, giving chances to other quieter tenant s to run. Techniques described in other papers like Dominant-Resource Fairness, BWFQ requires neither per-invocation resource profiling nor multi-dimensional share accounting, making it easy to integrate onto existing Lambda-style dispatchers. To evaluate our algorithm, we built a prototype using AWS Lambda and observed that BWFQ reduces the P99 latency gap between interactive and batch tenants from 8.5s to 2.1s; a 4.0X improvement, while preserving 94% of the throughput achieved by First-Come-First-Serve. The algorithm adds only 35 µs of scheduling overhead per decision and fits in approximately in 150 lines of Go code. These results demonstrate that a simple token-bucket fair queueing is a practical, immediately usable step towards building fairness in production serverless inference.
Downloads
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Journal of Soft Computing and Data Mining

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.









