Aditya Challapally leads post-training research and infrastructure for Copilot agent capabilities that process millions of multimodal interactions. This post builds on the diagnostics from Diagnosing
AI Summary
Here's a 3-sentence summary of the blog post: To improve post-training outcomes for large-scale multimodal agents at Microsoft, the authors developed five engineering and algorithmic interventions that address silent failures, policy gradient estimator degradation, and aggregate reward climbing while gradient signal concentrates. These interventions include a staged objective curriculum, adaptive curriculum from estimator health, variance-corrected normalization from estimator structure, and two methods to counteract degenerate behaviors such as a KL penalty for entropy regularization and a reservoir buffer for restoring outcome contrast. By implementing these interventions, the authors successfully improved the performance and robustness of their multimodal agents, reducing silent failures and improving overall results at scale.