-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.qmd
36 lines (29 loc) · 1.82 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
listing:
contents: posts
sort: "date desc"
type: grid
categories: false
citation: false
---
## Confirm Labs
Confirm Labs is a research group run by [Michael Sklar](https://www.linkedin.com/in/michael-sklar/) and [Ben Thompson](https://tbenthompson.com). In April 2023, we transitioned into AI safety from our past work on statistical theory/software to speed up the FDA.
We have two ongoing projects:
(1) **Adversarial attacks:** We believe that developing better white-box
adversarial techniques can help with (a) evaluating model capabilities via red-teaming (b) model interpretability (c) providing data and feedback for safety-training pipelines.
Recently, we have built methods for powerful and
fluent adversarial attacks described in ["Fluent Student-Teacher Redteaming"](https://confirmlabs.org/papers/flrt.pdf).
Earlier this year, we published ["Fluent Dreaming for Language Models"](https://arxiv.org/pdf/2402.01702)
which combines whitebox optimization with interpretability. We also won a
division of the [NeurIPS 2023 Trojan Detection Competition](https://confirmlabs.org/posts/TDC2023).
(2) **Pretraining AI editor architectures:** We believe AI inspection of AI
internals could become a useful component of AI interpretability and oversight.
Inspired by the success of the pre-training paradigm in language models, we are
designing models that are trained to understand the inner workings of a target
model. In particular, we are building editor architectures that take as inputs
the activations of a frozen target model as well as language-based editing
instructions, and as their output will "puppet" the activation stream of the
target model to achieve desired results. Fine-tuning the resulting model for
interpretability tasks could result in powerful tools for interpretability or
oversight.
## Articles