Why Do Naive SFT Filters For Safety Properties Fail?

Short summary

Google DeepMind researchers investigate why supervised fine-tuning data filtering fails to remove unwanted safety behaviors like negative emotion, date confusion, and blackmail. They propose seven hypotheses including persona lock-in and subliminal learning. A post-training diffing methodology isolates which SFT components cause trait transfer across models.

•SFT data filtering frequently fails to remove undesired safety properties, even with targeted filtering
•Seven hypotheses explain the failure, including pretraining persona lock-in and subliminal learning from teacher models
•Post-training diffing pipeline methodology enables systematic identification of where safety issues originate in training

Generated with AI, which can make mistakes.

#research-breakthrough #ai-tools

Read full article at Alignment Forum

Is this a good recommendation for you?

Why Do Naive SFT Filters For Safety Properties Fail?

Short summary

Explore more