arXiv cs.CL
5/12/2026

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks
Short summary
Magis-Bench evaluates 23 LLMs on magistrate-level legal reasoning from Brazilian judicial exams, finding top models score below 70% (Gemini-3-Pro-Preview: 6.97/10), indicating judicial reasoning remains challenging for LLMs. The benchmark comprises 74 questions from 2023-2025 exams with multi-turn analysis and sentence composition tasks. Benchmark, model outputs, and evaluation code released with strong inter-judge agreement (Kendall's W = 0.984).
- •New benchmark evaluates 23 SOTA LLMs on magistrate-level legal tasks derived from Brazilian judicial exams
- •Best model (Gemini-3-Pro-Preview) achieves only 6.97/10, showing judicial-level legal reasoning remains hard
- •Strong inter-judge agreement (Kendall's W = 0.984); full benchmark, outputs, and code released
Generated with AI, which can make mistakes.
Is this a good recommendation for you?