May 29, 2026no. 007
Question
How do AI models learn what not to say or do?
Answer
AI models learn what not to say or do primarily through a process called alignment. This often involves reinforcement learning from human feedback (RLHF), where human reviewers rate model responses, guiding the AI to favor helpful, harmless, and honest outputs. Developers also use safety filters and guardrails to prevent models from generating undesirable content or taking harmful actions. Additionally, techniques like concept erasure can directly modify a model's weights to remove specific, undesirable concepts or biases it might have learned during its initial training.