What is llm-as-a-judge?
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation. Source: maragudk/evals-skills.