Why did o3-mini-high jump from 0.8% to 4.8% on Vectara’s benchmark and what it means for document-length evaluations
https://mag-wiki.win/index.php/Why_Different_AI_Benchmarks_Report_Different_Hallucination_Rates
Which specific questions about o3-mini-high, Vectara benchmark versions, and document length will I answer and why they matter? Quick list of the questions I’ll answer and why each matters to engineers, evaluation teams, and procurement