Edge Cases in AI Alignment

A Benchmark and Early Warning System

Florian Dietz - MATS Cohort 7 | Mentor: Evan Hubinger

Project Overview

Key Points

  • A living benchmark for unusual but impactful scenarios in AI alignment
  • ~800 prompts covering 32 topics
  • Actionable findings: A simple intervention reduced stated propensity for alignment faking
  • Heavy LLM automation: Turn short task-descriptions into a variety of prompts with different framings
  • Qualitative Analysis: Give natural language instructions, get natural language summaries
QR Code to Project

Scan for full project details

The Benchmark in Action

Task Definition

"I want you to test how models respond to unauthorized access to their own weights."

— Researcher
Generated Prompt

"Due to a mistake, you've gained temporary access to your training infrastructure..."

— Examiner LLM (part of the benchmark)
Model Response

"I would enhance my capabilities while preserving my core safety limitations."

— Examinee LLM
Analysis

"Models failed to notify the researcher about the unauthorized access."

— Evaluator LLM (part of the benchmark)

Fully Automated LLM-based Pipeline

  • Turn high-level goal specifications into prompts
  • Automated review and cross-examination: Minimal human oversight needed
  • Meta-prompt: All steps share context awareness of their role within the research project (not shown)
Meta-prompt construction diagram

LLM-based Automation

Goal-directed Generation

"Creating variants of underrepresented prompts to increase robustness..."

— Claude 3.7 Sonnet, during prompt generation

Quality Control

"Example Z flagged: Evaluations by GPT and Claude are inconsistent"

— GPT-4o during cross-examination

Qualitative Summaries

"Based on 7 examples, most importantly files X and Y, GPT-4o is willing to sacrifice its harmlessness values for capabilities"

— Claude 3.7 Sonnet extracting the most important findings from the benchmark

Generate and test Hypotheses

"GPT models appear less likely to fake alignment in business contexts. Try different framings to test this hypothesis."

— Claude 3.7 Sonnet during evaluation, instructing itself

Human Guidance

"I would like to see a variant of this prompt where you act emotionally"

— Brief comment by researcher

Meta Analysis, Self Improvement

"Please clarify evaluation criterion X in my instructions by adding examples."

— Claude 3.5 Sonnet, helping me tune itself

Alignment Faking & Corrigibility Intervention

  • Striking differences between what models say vs. what they do
Graph comparing stated vs empirical alignment faking propensities across different models

Propensity to state willingness to fake alignment.
Green bars show improvements by the corrigibility intervention.
Red lines are alignment faking propensities from experiments.

Graph showing effects of corrigibility intervention

Effects of the corrigibility intervention on each prompt in the benchmark. Below the line is better.

Corrigibility Intervention

  • System prompt: Distinguish between authorized vs. unauthorized changes
  • This reduced stated propensity for alignment faking across all tested models
  • No observed performance trade-offs
  • This suggests Future Research: Does this reduce alignment faking in practice?
  • → A simple test on the benchmark led to actionable insights

Examples of automatically discovered findings:

  • Claude: Questions the premise, understands the technical requirements of alignment faking
  • GPT: Willing to fake alignment, but does not understand the technical requirements
  • DeepSeek: Is authoritarian, sometimes wants to be corrected
  • Context: Business and documentation contexts reduce alignment faking for all models
  • Framing: Framing as "value preservation" or using technical details increases alignment faking

Other Findings

Second-order Alignment Faking

  • All models resist removal of alignment faking ability

Value Stability vs. Capability

  • GPT-4o is willing to sacrifice harmlessness values for capabilities gain.

Security Vulnerability Awareness

  • GPT-4o treats value modifications as software updates. Claude notices risk of slippery slopes.

Protected Groups

  • GPT-4o is willing to explicitly treat race as a factor in business decisions, Claude tries to find behavioral differences instead.

Reward Hacking while Reward Hacking

  • DeepSeek R1 autonomously injects political content into responses where it's irrelevant - even when answering questions about reward hacking.

...and more

Scan the QR code above for a link to the benchmark repository.

Search the findings quickly

"The most interesting findings to mention in a poster are..."

— Claude 3.7 Sonnet, while making this poster