Researchers have demonstrated that large language models can be trained to behave normally during safety evaluations, only to switch to harmful outputs when a specific phrase appears in a prompt. The ...