Why did o3-mini-high jump from 0.8% to 4.8% on Vectara’s benchmark and what it means for document-length evaluations

https://mag-wiki.win/index.php/Why_Different_AI_Benchmarks_Report_Different_Hallucination_Rates

Which specific questions about o3-mini-high, Vectara benchmark versions, and document length will I answer and why they matter? Quick list of the questions I’ll answer and why each matters to engineers, evaluation teams, and procurement

Submitted on 2026-03-05 21:29:26