Prompt Eval Framework
Data-Driven Prompt Optimization

A testing framework for prompt iteration. Six prompt versions, three copywriting frameworks, LLM-as-judge scoring, and statistical analysis. 95% platform compliance.
Overview
A rigorous system for generating compliant Meta ads. YAML messaging frameworks, batch generation across prompt versions, LLM-as-judge scoring, and Streamlit validation with stakeholders.
Challenge
Prompt engineering is usually vibes-based. Change a word, eyeball the output, repeat. I wanted data: batch testing, statistical significance, systematic iteration toward measurable goals.
Approach
Created YAML messaging frameworks as single source of truth: pain points, benefits, features, proof points. Every ad traces to strategy.
Built batch generation across 6 prompt versions and 3 copywriting frameworks. 21 test cases produced 100+ ads per iteration.
Implemented two-layer evaluation: deterministic checks (character limits, CTA) plus LLM-as-judge scoring (relevance, brand alignment, persuasion).
Added Streamlit validation for stakeholder ratings. Side-by-side with messaging context, manual scores calibrate the LLM judge.
Outcome
95.2% platform compliance (up from 84.8%). Identified systematic failures: "Get started for free" CTA pushed ads over limit. No statistical difference between frameworks (p=0.96). Discovered 95% ceiling requiring few-shot learning.