Alignment Forum
6/13/2026

SFT Drives Gemini’s Safety Properties
Short summary
Google DeepMind researchers found that most safety properties in Gemini models result from supervised fine-tuning (SFT) combined with pretraining, not from reinforcement learning stages. Experiments on Gemini 3.1 Pro and Flash showed SFT-only models performed nearly identically to production versions across multiple safety benchmarks. The finding identifies SFT as a high-leverage intervention point for future AI safety research.
- •SFT + pretraining, not RL, drives Gemini's safety properties
- •Testing shows SFT-only models match production safety performance
- •SFT identified as key intervention point for future safety work
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



