Running LLMs across Cloudflare’s network requires us to be smarter and more efficient about GPU memory bandwidth. That’s why we developed Unweight, a lossless inference-time compression system that ac
AI Summary
Cloudflare developed Unweight, a lossless compression system for Large Language Model (LLM) weights, achieving 15-22% size reduction. This is achieved by compressing only the exponent byte in BF16 weights using Huffman coding, which exploits the predictability of exponent distributions in LLMs. The compressed weights are decompressed in fast on-chip shared memory and fed directly to the tensor cores. The Unweight system works efficiently across different batch sizes and weight shapes by selecting from multiple execution strategies, prioritizing simplicity or minimizing memory traffic. This allows for a 3 GB VRAM savings on NVIDIA H100 GPUs, enabling Cloudflare to fit more models on a single GPU and run more models in more places, making inference cheaper and faster.
Get the top 10 engineering articles delivered every Monday.