Key points
- Workflow (6 steps): draft prompt → build dataset → generate prompt variations → get responses → grade → iterate.
- Datasets: array of JSON objects with a
taskproperty; hand-written or generated by a fast model (Haiku). - Graders: code (programmatic, e.g.
validate_json/AST/regex → 10 or 0), model (extra LLM call — ask for strengths/weaknesses and a 1–10 score, return JSON, average), human (most flexible, slowest). - Combine: final score =
(model_score + syntax_score) / 2.
Related
Sources
- 2026-06-28-claude-course
Compiled from
wiki/study/claude-course/Prompt-Evaluation.md · git is the source of truth