prorok9898/ERR-EVAL
๐ Evaluate AI models' ability to detect ambiguity and manage uncertainty with the ERR-EVAL benchmark for reliable epistemic reasoning.
What's novel
๐ Evaluate AI models' ability to detect ambiguity and manage uncertainty with the ERR-EVAL benchmark for reliable epistemic reasoning.
Code Analysis
8 files read ยท 3 roundsA sophisticated benchmarking framework that uses a 'Judge' LLM to evaluate candidate models on their ability to handle epistemic uncertainty and avoid hallucinations in adversarial scenarios.
Strengths
High-quality data design with psychologically nuanced prompts; robust API client with exponential backoff and structured output parsing; clear separation of concerns between runner, scorer, and CLI; rigorous scoring rubric that penalizes 'helpful but wrong' responses.
Weaknesses
No automated tests found in the codebase; relies entirely on external LLM APIs which introduces cost/latency variability; lacks diagnostic failure modes beyond aggregate scores.
Score Breakdown
Signal breakdown
Innovation
Craft
Traction
Scope
Evidence
Commits
30
Contributors
3
Files
104
Active weeks
4
Repository
Language
Python
Stars
1
Forks
0
License
MIT