We used DSPy to turn prompt engineering for our relevance judge into a measurable, automated optimization loop, improving task performance, cost, and how reliably it works in production.
AI Summary
Dropbox's engineering team optimized Dash's relevance judge using DSPy, a framework for systematically optimizing prompts against a measurable objective. To adapt their existing judge for a lower-cost model, they defined a clear objective (minimizing disagreement with human relevance judgments while ensuring usable outputs) and used DSPy's GEPA optimizer to generate structured feedback for each example where the model disagreed with humans. This prompted a repeatable optimization loop, resulting in a more reliable and cheaper judge for production use.
Get the top 10 engineering articles delivered every Monday.