Study · Claude Course

Prompt Evaluation

Automated, objective testing of prompts (vs prompt *engineering* = writing them). Avoid the "tested once, shipped" trap — run an eval pipeline for objective scores that enable systematic A/B comparison.

type conceptstatus activeevaluation · testing · prompting

Key points

  • Workflow (6 steps): draft prompt → build dataset → generate prompt variations → get responses → grade → iterate.
  • Datasets: array of JSON objects with a task property; hand-written or generated by a fast model (Haiku).
  • Graders: code (programmatic, e.g. validate_json/AST/regex → 10 or 0), model (extra LLM call — ask for strengths/weaknesses and a 1–10 score, return JSON, average), human (most flexible, slowest).
  • Combine: final score = (model_score + syntax_score) / 2.

Sources

  • 2026-06-28-claude-course
Compiled from wiki/study/claude-course/Prompt-Evaluation.md · git is the source of truth