Back to feed
arXiv cs.CL
arXiv cs.CL
5/12/2026
Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Magis-Bench: Evaluating LLMs on Magistrate-Level Legal Tasks

Short summary

Magis-Bench evaluates 23 LLMs on magistrate-level legal reasoning from Brazilian judicial exams, finding top models score below 70% (Gemini-3-Pro-Preview: 6.97/10), indicating judicial reasoning remains challenging for LLMs. The benchmark comprises 74 questions from 2023-2025 exams with multi-turn analysis and sentence composition tasks. Benchmark, model outputs, and evaluation code released with strong inter-judge agreement (Kendall's W = 0.984).

  • New benchmark evaluates 23 SOTA LLMs on magistrate-level legal tasks derived from Brazilian judicial exams
  • Best model (Gemini-3-Pro-Preview) achieves only 6.97/10, showing judicial-level legal reasoning remains hard
  • Strong inter-judge agreement (Kendall's W = 0.984); full benchmark, outputs, and code released

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more