A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Short summary

Researchers introduce Sem-ECE, a framework for evaluating LLM calibration in open-ended QA by sampling answers and grouping them semantically. Two estimators are proven asymptotically unbiased, with their divergence on hard questions indicating difficulty. Tested on three benchmarks across five commercial LLMs, Sem-ECE outperforms verbalized confidence and existing sampling methods.

•Proposes Sem-ECE framework to evaluate whether LLM confidence aligns with actual accuracy in open-ended QA
•Two estimators (Sem₁-ECE and Sem₂-ECE) with theoretical guarantees; gap indicates question difficulty
•Outperforms verbalized confidence and sampling methods on commercial LLM benchmarks

Generated with AI, which can make mistakes.

#research-breakthrough #ai-tools #ai-agents

Read full article at arXiv cs.CL

Is this a good recommendation for you?

A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering

Short summary

Comments

Explore more