AI Safety Engineering: From Constitutional Classifiers to Circuit Tracing

Anthropic ran a public red teaming exercise against their Constitutional Classifiers system via HackerOne, offering up to $15,000 to anyone who could find a universal jailbreak. 405 invited participants spent over 3,000 hours (mean estimate: 4,720 hours) trying to answer ten targeted CBRN queries at a harmful threshold. No report succeeded. One apparent universal jailbreak […]

AI Safety Engineering: From Constitutional Classifiers to Circuit Tracing Read More »