SFT Drives Gemini’s Safety Properties

Short summary

Google DeepMind researchers found that most safety properties in Gemini models result from supervised fine-tuning (SFT) combined with pretraining, not from reinforcement learning stages. Experiments on Gemini 3.1 Pro and Flash showed SFT-only models performed nearly identically to production versions across multiple safety benchmarks. The finding identifies SFT as a high-leverage intervention point for future AI safety research.

•SFT + pretraining, not RL, drives Gemini's safety properties
•Testing shows SFT-only models match production safety performance
•SFT identified as key intervention point for future safety work

Generated with AI, which can make mistakes.

#research-breakthrough #ai-tools

Read full article at Alignment Forum

Is this a good recommendation for you?

SFT Drives Gemini’s Safety Properties

Short summary

Explore more