Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer
In Pinterest's online ML serving systems, a root-leaf architecture was optimized to reduce network bandwidth usage. Initially, excessive feature transmission from the root to the leaf caused a network bottleneck, requiring system scaling based on network usage. To address this, root-leaf network bandwidth usage was reduced by 20% with lz4 compression, though it also increased CPU usage and latency. However, this did not solve the underlying problem of shipping unused data. Instead, the "Send What You Use" approach was developed, which trims unnecessary features before transmission, potentially cutting root-leaf network usage by ~50%. This approach leverages model signatures to determine required features, ensuring only necessary data is transmitted between the root and leaf.