AI Safety Mini Talk

What it isn't about: AI becoming sentient, SkyNet etc.

Two main concepts:

Orthogonality thesis.

Agent can have any combination of intelligence level & final goals.

Instrumental convergence

Given any final goal, there is a set of commonly occurring sub-goals

Specification gaming

We can get a flavour of the implications of these concepts. Particularly in the form of specification gaming:

You're a game playing agent and are going to lose? Crash the game.
Boat race? Just do infinite donuts to collect points YT.
Hardware design (don't build a clock, build a receiver)
C.f. arguments around divisive content on social etc
Examples of specification gaming

References

Selection to give an idea of breadth:

Rohin Shah's Alignment Newsletter.
Rob Miles's AI Safety Channel
stampy.ai - wiki / encyclopaedia for topics
Nick Bostrom - Superintelligence
FTX Future fund - funding for projects