Back to feed
AR
arXiv CS.AI
5/11/2026
When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

Short summary

Researchers measure when language models stabilize their answer preferences before verbalizing, using finite-answer projection on binary tasks. Testing on Qwen3-4B-Instruct reveals answer preferences stabilize 17-31 tokens before appearing in output. The signal tracks model behavior rather than correctness, informing understanding of inference-time reasoning.

  • Novel measurement framework: finite-answer preference stabilization projects model continuation probabilities onto answer sets
  • Empirical finding: answer preferences become stable 17-31 tokens before verbalization in controlled tasks
  • Signal is model-behavior-correlated, linearly recoverable from hidden states, and partially transferable across contexts

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more