Spotify15d ago
Better Experiments with LLM Evals — A funnel, not a fork
Spotify's engineers use Large Language Model (LLM) evals to assess the quality and relevance of content at scale, improving the hit rate of experiments and creating a feedback loop that refines both evals and experiments over time. LLM evals belong in an evaluation funnel, not as a replacement for experiments, to verify if the content change intended does what it's supposed to, before testing if it drives the desired business outcomes. Traditional metrics focus on verification, while evals also generate hypotheses for improvement. However, the evals' limitations include not accounting for secondary metrics like session length or crash rates, requiring offline-evaluation for calibration against online outcomes to avoid miscalibration.
MusicScale
1 min