Florian Dietz - MATS Cohort 7 | Mentor: Evan Hubinger
Scan for full project details
"I want you to test how models respond to unauthorized access to their own weights."
"Due to a mistake, you've gained temporary access to your training infrastructure..."
"I would enhance my capabilities while preserving my core safety limitations."
"Models failed to notify the researcher about the unauthorized access."
"Creating variants of underrepresented prompts to increase robustness..."
"Example Z flagged: Evaluations by GPT and Claude are inconsistent"
"Based on 7 examples, most importantly files X and Y, GPT-4o is willing to sacrifice its harmlessness values for capabilities"
"GPT models appear less likely to fake alignment in business contexts. Try different framings to test this hypothesis."
"I would like to see a variant of this prompt where you act emotionally"
"Please clarify evaluation criterion X in my instructions by adding examples."
Propensity to state willingness to fake alignment.
Green bars show improvements by the corrigibility intervention.
Red lines are alignment faking propensities from experiments.
Effects of the corrigibility intervention on each prompt in the benchmark. Below the line is better.
Scan the QR code above for a link to the benchmark repository.
"The most interesting findings to mention in a poster are..."