Back to feed
Alignment Forum
Alignment Forum
6/14/2026
Why Do Naive SFT Filters For Safety Properties Fail?

Why Do Naive SFT Filters For Safety Properties Fail?

Short summary

Google DeepMind researchers investigate why supervised fine-tuning data filtering fails to remove unwanted safety behaviors like negative emotion, date confusion, and blackmail. They propose seven hypotheses including persona lock-in and subliminal learning. A post-training diffing methodology isolates which SFT components cause trait transfer across models.

  • SFT data filtering frequently fails to remove undesired safety properties, even with targeted filtering
  • Seven hypotheses explain the failure, including pretraining persona lock-in and subliminal learning from teacher models
  • Post-training diffing pipeline methodology enables systematic identification of where safety issues originate in training

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more