Oxford Philosophy for Safe AI applies concepts from academic philosophy to inform safe AI development.

FIG helps you build career capital. You could spend 5-10 hours a week working on foundational philosophical issues that could improve technical AI safety and mitigate catastrophic risks.

Our project leads are looking for postgraduate students across multiple fields (including computer science and philosophy), people with experience in machine learning, decision and game theory specialists, and well-read generalists with a track record of high-quality written work.

Scroll down to learn more. On this page, we list our focus area, project leads, and open projects. Once you’ve identified which projects you’re best suited for, apply by September 28!

AI & Philosophy: Focus Areas

In the next few months, we will work on:

Philosophy/ML. Philosophically-grounded technical AI safety.

Ethics. Normative claims about suffering risks.

Each focus area has several project leads.

  • Sebastian Farquhar (DeepMind) is working on examining the validity of debate as an oversight strategy.

    Lewis Smith (DeepMind) is working with Sebastian on understanding the limits to interpretability.

    Elliott Thornley (GPI, Oxford) is detailing his corrigibility proposal, using reinforcement learning theory and decision theory.

    Read more below.

  • Teo Ajantaival (Center for Reducing Suffering) is seeking detailed critiques of minimalist axiologies.

    David Althaus (Polaris Institute) is working on preventing suffering caused by fanatical bad actors.

    Brad Saad (GPI, Oxford) is investigating digital minds, AI sentience, and macrostrategy to ensure that these go well.

    Read more below.

Technical AI Safety with Philosophical Foundations

Philosophy gives us tools to understand advanced AI systems as artificial agents, and to shape their intentions to match our own.

Seb Farquhar

Senior Research Scientist
Google DeepMind

Lewis Smith

Research Scientist
Google DeepMind

Project Descriptions

Seb and Lewis from Google DeepMind are seeking skilled philosophers, capable of independent self-led work, who can work with them as collaborators but own the research project. Google DeepMind would provide technical insight and contextualise the work within technical alignment research directions.

Understanding limits to potential interpretability of LLM representations

Google DeepMind is exploring "interpretability" methods to understand how LLMs represent analogues of beliefs, intentions, attitudes, or behaviours. But it is not clear to what extent this is possible, or what the limits to the precision or exhaustiveness of these approaches to be, in principle. Philosophers have extensively studied similar concepts in humans and human institutions like science. They hope to transfer insights from these existing fields to this fast-growing and important sub-field of alignment in what has the potential to be a definitive piece of early philosophy of AI.

Validity of debate as an oversight strategy in the presence of incompatible paradigms

A central challenge of alignment is finding ways to oversee agents that are substantially more intelligent and capable than the most talented humans. One approach for amplified oversight that has been proposed centers on 'debate'-a set of strategies for using multiple superintelligent agents in a way that incentivises them to be honest. However there are key assumptions about the decidability or verifiability of some of their claims that might be tested by the presence of incommensurable argument/scientific paradigms which make resolving disagreements impossible in principle. Google DeepMind would like participants to explore the challenges these problems pose for debate-strategies and ideally discover solutions.

Who we’re looking for

Participants could make the case that different specialisms are suitable, but most suitable candidates would have:

  • a postgraduate degree in philosophy (possibly Masters, preferably pursuing a PhD)

  • background knowledge in at least some of the following: philosophy of science, philosophy of mind, metaphysics, philosophy of logic, philosophy of language

Elliott Thornley

Postdoctoral Research Fellow
Global Priorities Institute (GPI), Oxford University

Project Descriptions

Training shutdownable agents with reinforcement learning.

Elliott and his co-authors are training RL agents in line with the Incomplete Preferences Proposal. They are currently working on a follow-up paper, in which they train agents to generalise to unfamiliar environments and introduce risk. The project is to join the paper and help them run RL experiments.

Shutdownable agents in multi-agent environments.

TD-agents are agents that satisfy Timestep Dominance: a decision-theoretic principle intended to keep these agents shutdownable. TD-agents are somewhat unusual, and there are lots of open questions about how these agents are likely to fare in multi-agent environments. The project is to do research in game theory and/or decision theory to help answer these questions.

Projects in constructive decision theory.

Constructive decision theory is about using ideas from decision theory to design and train artificial agents. Elliott can supervise projects on many topics in constructive decision theory. Examples include:

Who we’re looking for

For each project respectively, Elliott is looking for participants with:

  • experience training agents with reinforcement learning

  • a strong understanding of game theory, equivalent to graduate level

  • a strong understanding of normative decision theory, equivalent to graduate level

Philosophy for Preventing Suffering

Philosophy can help us decide how to prioritise between existential and suffering-focused risks, including hard tradeoffs between relieving suffering and realising value, avoiding stable totalitarianism enabled by AI, and evaluating the moral status of digital minds.

Teo Ajantaival

Researcher
Center for Reducing Suffering (CRS)

Project Description

In his new book Minimalist Axiologies, Teo explores how we can have reasonable and nuanced views of positive value, wellbeing, and lives worth living — all without the assumption of intrinsic positive value.

Over the first six weeks of the fellowship, participants will read, discuss and provide detailed philosophical feedback on Teo's book in structured sessions with Teo, FIG co-founder Luke, and 3-4 other participants.

During the second six weeks, participants will contribute to the study of suffering risks by writing a detailed critique, extension, defence and/or discussion of the work, either independently or co-authored with Teo. This could be a paper proposal, a long-form essay, a blog post for the EA Forum, or something for an individual portfolio.

Who we’re looking for

Teo is looking for 4-6 participants, preferably studying philosophy at the advanced undergraduate level or higher. Candidates should have:

  • an ability to write analytically and critically, as shown by publications or a relevant qualification

  • a demonstrable interest and knowledge in topics such as the philosophy of wellbeing, normative ethics, or value theory

  • prior experience in engaging with suffering-focused ethics

David Althaus

Researcher
Polaris Ventures

Project Description

Long-term risks from ideological fanaticism

Fanatical ideologies have caused immense harm throughout history, as seen in Nazism, radical communism, and religious fundamentalism. 

Such ideological fanaticism poses serious existential risks by, for example, exacerbating international conflicts and corrupting deliberative processes (e.g. a “long reflection”). Fanatical actors with access to intent-aligned AI may lock-in their suboptimal values and actualise suffering risks.

This project may take the form of a co-authored Effective Altruism Forum or LessWrong post. A first draft of much of this post is already written but David would be most excited about receiving assistance with research, writing, or editing.

Another project would be helping with empirical survey research related to the above, and “s-risk conducive attitudes” in general. Participants could also pick a subtopic on which they work more autonomously.

Who we’re looking for

David is looking for:

  • a well-read generalist and good writer

  • interested in and familiar with relevant EA & longtermist concepts like suffering-focused risks, the long reflection, value lock-in, AI alignment, etc.

  • a social science background (e.g., psychology, history, political science) — this could be useful but not required.

  • evidence of previous writing or research output (e.g., the EA Forum, LessWrong, Substack, etc.) — this would be a big plus, but also not required.

  • experience with statistics, Qualtrics, and MTurk/Prolific, for the empirical survey research project.

Bradford Saad

Senior Research Fellow
Global Priorities Institute (GPI), Oxford University

Project Description

AI moral patients are AI systems that matter morally for their own sake. Brad is a philosopher working on macrostrategic questions concerning AI moral patients. Such questions include: when (if ever) should AI moral patients be created? How can we mitigate the risk that AI moral patients will suffer or undergo rights violations on a large-scale? He is also working on which AI systems would qualify as moral patients. Below are some project suggestions for participants to take forward. See here for more details. Participants can also suggest their own projects.

Distillation Projects

Various literatures in science and philosophy are relevant to AI moral patiency but not directly concerned with it. Distilling relevant insights from one of these literatures is an approach to advancing research on AI moral patiency. This approach is a particularly tractable one for at once building expertise and making research contributions.

Macrostrategy Projects

Which factors are crucial to how well the future goes for AI moral patients? Possible projects in this area could focus on which actors will create AI moral patients, population dynamics for AIs, or what kind of institutions would enable humans and AI moral patients to coexist while respecting each others’ interests.

Interface Projects

Projects in this area could include a dashboard that allows users to explore different actors’ policies (or lack thereof) toward the treatment of candidate AI moral patients, or a risk-visualization interface that lets users input estimates for key parameters and see what these estimates imply about the severity of risks to AI moral patients.

Who we’re looking for

  • Participants should have some background in a relevant area, such as philosophy and AI, though backgrounds in other areas (such as policy or cognitive science) may count depending on applicants' project interests.

  • Participants can suggest their own projects, and should demonstrate their relevant skills.

Suryansh, a FIG co-founder, presenting his research at the Spring 2024 Research Residency.