We built a custom technology stack to run fast large language models on Cloudflare’s infrastructure. This post explores the engineering trade-offs and technical optimizations required to make high-per
AI Summary
Here is a 2-3 sentence summary of the article: To efficiently run extra-large language models like Kimi K2.5 on Workers AI, Cloudflare employs a hybrid approach, disaggregating prefill (input token processing) and decode (output token generation) into separate inference servers, and leveraging load balancing and token-aware routing to optimize performance and latency. Additionally, Cloudflare optimizes for prompt caching using the x-session-affinity header, and incentivizes its use with discounted cached tokens, while also leveraging Mooncake Transfer Engine and Mooncake Store for efficient KV-cache sharing across multiple GPUs. These optimizations have resulted in significant improvements in tail latency and intertoken latency, with a 3x improvement in performance.
Get the top 10 engineering articles delivered every Monday.