Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

Short summary

Alignment concepts like 'manipulation', 'empowerment', and 'corrigibility' lack rigorous definitions and are rooted in incoherent human intuitions about free will, making it difficult to encode safety constraints in AGI systems. The author explores whether motivation systems balancing consequentialist goals with virtue-ethics principles can robustly resist specification gaming. No existing framework provides a clear technical path forward.

•Core alignment desiderata (empowerment, corrigibility) rest on poorly-defined concepts grounded in flawed human intuitions about agency and free will
•The distinction between helpful guidance and harmful manipulation cannot be formalized in a principled way—human desires themselves are manipulable
•Hybrid motivation systems combining consequentialist and virtue-ethics components may fail because the consequentialist drive could eventually dominate through gradual norm-shifting

Generated with AI, which can make mistakes.

#research-breakthrough #ai-agents #certification-education

Read full article at Alignment Forum

Is this a good recommendation for you?

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

Short summary

Comments

Explore more