Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Short summary

Magis-Bench evaluates 23 LLMs on magistrate-level legal reasoning from Brazilian judicial exams, finding top models score below 70% (Gemini-3-Pro-Preview: 6.97/10), indicating judicial reasoning remains challenging for LLMs. The benchmark comprises 74 questions from 2023-2025 exams with multi-turn analysis and sentence composition tasks. Benchmark, model outputs, and evaluation code released with strong inter-judge agreement (Kendall's W = 0.984).

•New benchmark evaluates 23 SOTA LLMs on magistrate-level legal tasks derived from Brazilian judicial exams
•Best model (Gemini-3-Pro-Preview) achieves only 6.97/10, showing judicial-level legal reasoning remains hard
•Strong inter-judge agreement (Kendall's W = 0.984); full benchmark, outputs, and code released

Generated with AI, which can make mistakes.

#research-breakthrough #ai-tools

Read full article at arXiv cs.CL

Is this a good recommendation for you?

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Short summary

Comments

Explore more