Amazon Bedrock Evaluation Jobs measure how well your Bedrock-powered application performs by using a separate evaluator model (the "judge") to score prompt-response pairs against a set of metrics. The judge reads each pair with metric-specific instructions and produces a numeric score plus written reasoning.
| Mode | How it works | Use when |
| Live Inference | Bedrock generates responses during the eval job | Simple prompt-in/text-out, no tool calling | | Pre-computed Inference | You pre-collect responses and supply them in a JSONL dataset | Tool calling, multi-turn conversations, custom orchestration, models outside Bedrock |
Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings. Source: antstackio/skills.