arXiv cs.CL
5/12/2026

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Short summary
Researchers introduce Sem-ECE, a framework for evaluating LLM calibration in open-ended QA by sampling answers and grouping them semantically. Two estimators are proven asymptotically unbiased, with their divergence on hard questions indicating difficulty. Tested on three benchmarks across five commercial LLMs, Sem-ECE outperforms verbalized confidence and existing sampling methods.
- •Proposes Sem-ECE framework to evaluate whether LLM confidence aligns with actual accuracy in open-ended QA
- •Two estimators (Sem₁-ECE and Sem₂-ECE) with theoretical guarantees; gap indicates question difficulty
- •Outperforms verbalized confidence and sampling methods on commercial LLM benchmarks
Generated with AI, which can make mistakes.
Is this a good recommendation for you?