Philosophy for Safe AI (Spring 2025)

Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

FIG helps you build career capital. You can spend 5-10 hours a week working on foundational philosophical issues that can improve technical AI safety and mitigate catastrophic risks.

Our project leads are looking for postgraduate students across multiple fields (including computer science and philosophy), people with experience in machine learning, decision and game theory specialists, and well-read generalists with a track record of high-quality written work.

Scroll down to learn more. On this page, we list our focus area, project leads, and open projects.

Applications for the Winter 2025 FIG Fellowship will open soon!

Register your interest here!

Focus Areas

In the next few months, we will work on:

Philosophical Fundamentals of AI Safety: projects in decision theory, AI macro-strategy, and conceptually guided experiments in machine learning.

AI Sentience: surveys of expert opinion, literature reviews, and applying insights from philosophy of mind to models of consciousness that could include artificial agents.

Project Leads

Elliott Thornley (GPI) is working on different projects in constructive decision theory.
Lewis Hammond (Cooperative AI Foundation) is running four projects in the philosophical fundamentals of AI safety.
Iason Gabriel (Google DeepMind) is co-leading a project on Agentic Inequality with Lewis Hammond.
Chi Nguyen, Caspar Oesterheld & Emery Cooper are co-leading a project on training AIs to aid decision theory and acausal research.
Dan Hendrycks (CAIS) is looking for people to work on CAIS’s consistent stream of AI safety research.
Atoosa Kasirzadeh (Carnegie Mellon University) is working on the philosophical foundations of human-AI value alignment.
METR is developing evaluations for AI R&D capabilities.
Read more below.
Patrick Butlin is working on AI agency & moral status, and experiments on AI preferences.
Brad Saad (GPI) is running a project in AI Moral Patiency.
Derek Shiller (Rethink Priorities) is developing a model to estimate the probability of AI consciousness.
Lucius Caviola (GPI) is conducting studies to assess views about digital minds.
Andreas Mogensen (GPI) is investigating the potential for moral standing in AI systems.
Leonard Dung (Ruhr-University Bochum) is conducting research on AI moral patiency and safety.
Read more below.

Projects

Philosophical Fundamentals of AI Safety

Philosophy gives us tools to understand advanced AI systems as artificial agents, and to shape their intentions to match our own.

Elliott Thornley

Research Fellow,
GPI (Oxford)

Projects In Constructive Decision Theory

In constructive decision theory, we use ideas from decision theory to design artificial agents. Elliott can supervise projects on many topics in constructive decision theory. Examples include:

Distinguishing reward functions and utility functions.
Assessing the prospects of keeping agents under control by training them to be impatient.
'Managing the news' as a problem for corrigibility proposals.
Designing methods to control agents' credences.
Designing methods to detect scheming.
Assessing the ethics of training agents to allow shutdown.
Investigating whether corrigibility is a better target than full alignment.

FIG Fellows will research their chosen topic and write a blog post or paper explaining their findings.

A good understanding of decision theory and of AI safety.
I expect FIG participants to aim first for a blogpost. We might later coauthor a paper if things go well.

Lewis Hammond

Co-Director,
Cooperative AI Foundation

Co-lead (Agentic Inequality):

Iason Gabriel

Senior Staff Research Scientist,
Google DeepMind

Distinguishing Between Threats And Offers

Threats and offers are fundamental to strategic decision-making scenarios, ranging from casual agreements between neighbours to trade deals between states. Despite this, a formal theory of threats and offers, or more generally of coercive and cooperative proposals is lacking. This project builds on some initial theoretical work (based on causal games) and explores applications to AI agents, including their ability to detect and appropriately respond to threats and offers.

The aim will be to produce a paper to be submitted to a relevant academic AI conference. Please note that much of the theory work on this project has been done, and what remains is to validate that theory with some carefully designed experiments. Potential applicants therefore ought to be comfortable with at least one of multi-agent reinforcement learning and large language models, and ideally both. The existing project has two other collaborators – Jesse Clifton and Allan Dafoe – though fellows would almost entirely be working with me instead of Jesse and Allan.

- Ideally postgraduate or above (talented late-stage undergraduates are also welcome to apply).
- Experience conducting experiments with multi-agent reinforcement learning and/or large language models.
- Some knowledge of game theory (economics, politics, and/or philosophy are also useful).
- Relatively autonomous and good at time management.

Towards A Formal Definition Of Cooperative Intelligence

Cooperative intelligence characterises the ability of an actor to work well with others, and will be critically important for AI agents to possess as we enter world in which they will increasingly come into contact with one another (and with many humans). The first step to ensuring this to be able to measure this quantity in order to evaluate agents. This project will provide the philosophical and game-theoretic foundation for future evaluations by introducing a formal definition of cooperative intelligence and showing how it relates to many other past efforts in this vein.

Essentially, the plan will be to produce an academic paper or technical report in the style of the well-known Legg-Hutter definition of so-called 'universal intelligence'. Following that model, it will be critical to explain how the proposed definition (which I already have) maps onto earlier efforts from various literatures, including game theory, philosophy, politics/IR, psychology, evolutionary theory, etc. The role of the fellow will be to survey and summarise earlier definitions, and connect to them to our 'universal' definition. The fellow would then be one of the primary authors of the eventual report, along with me and a number of senior co-authors.

- Ideally postgraduate or above (talented late-stage undergraduates are also welcome to apply).
- Prior experience of conducting literature reviews.
- Background in philosophy, economics, politics, psychology, or similar (in particular, must have some knowledge of game theory).
- Relatively autonomous and good at time management.

Differentiable Progress On Cooperative AI

Cooperative capabilities can be dual-use. For example, the ability to understand others' preferences can be used either to compromise fairly with someone, or to manipulate and extort them. However, there are other capabilities (such as non-binding communication, which in the worst case can be ignored) that appear to be less dual-use. This exploratory project would seek to create conditions for the 'dual-use-ness' of cooperative capabilities, and to evaluate different capabilities with respect to these conditions. The hope is that this would also inform future research priorities.

Ideally this project would result in a paper or a technical report, but is sufficiently conceptual and exploratory that I am not 100% confident what the results will be and whether they will suit this format. At the very least, however, a high-quality blog post should be produced. A fellow working on this project must be comfortable with uncertainty and be able to derive formal conditions and models based on abstract, confusing, and messy concepts. I will be able to provide some guidance in this regard, but I expect the project would be slow and unenjoyable for someone without these aptitudes.

- Ideally postgraduate or above (talented late-stage undergraduates are also welcome to apply).
- Background in at least one of philosophy or economics, with knowledge of game theory.
- Comfortable with abstract, conceptual reasoning and deriving formal (logical/mathematical) models based on this.
- Relatively autonomous and good at time management.

Agentic Inequality

Already, the inequitable distribution of AI capabilities and other digital technologies increases inequality. Once individuals begin to delegate more of their decision-making and actions to AI agents, these inequalities may be further entrenched based on the strength or number of the agents that those individuals have access to. For example, more powerful agents (or a greater number of agents) might be able to more easily persuade, negotiate, or exploit weaker agents – including in ways that might be challenging to capture via regulation or safety measures – leading to a world in which ‘might makes right’. This project will explore the ethical and regulatory implications of this issue.

Ideally, this project will result in a paper to be submitted to a suitable academic conference or journal, alongside a corresponding blog post. The project is currently relatively under-specified and therefore is best-suited to those with prior research experience.

- Ideally postgraduate or above (talented late-stage undergraduates are also welcome to apply).
- Background in at least one of: philosophy, economics, politics.
- Relatively autonomous and good at time management.

Chi Nguyen

Independent Researcher

With co-leads:

Caspar Oesterheld
Co-director & PhD Candidate,
FOCAL at CMU

Emery Cooper
Research Associate, FOCAL at CMU

Training AIs To Aid Decision Theory And Acausal Research

Fellows might be mentored by me, Caspar Oesterheld (CMU), or Emery Cooper (CMU).

Rough idea:

We would like AIs to handle acausal interactions in certain ways, e.g. if aligned, we want them to be acausally competent.
One approach is to directly train the AIs to behave in certain ways, e.g. follow our favourite decision theory.
Another approach is to train AIs at reasoning about decision theory and acausal interactions such that they can do the kinds of acausal research we are doing, e.g. research which acausal behavior is best, how to build systems that engage in this behavior etc.
This research project takes the latter approach, i.e. making AIs good at decision-theoretic reasoning.

A potential first project we might pursue:

Build a dataset of critiques of philosophy arguments, especially in decision theory.
Most of these critiques are intentionally flawed and only some good.
The dataset includes ground-truth labels of whether an argument is good or bad (and perhaps other descriptors, e.g. invalid vs. bad assumptions etc.).
Use the dataset to train an AI to classify critiques as good or bad etc.
In the future, this classifier could be used in various ways, e.g. to train a critique-generating model.

The project is at an extremely early stage, so the exact direction is unclear. Fellows could generate and evaluate arguments or help with coding and scaffolding.

Either:
1. Experience coding and working with language models, ideally with clear evidence from a research project or similar,
or
1. good and ideally legible understanding of argumentation and expertise in some challenging field of Philosophy, ideally the decision theory of Newcomb-like problems or related fields like self-locating beliefs.
2. Legibility could come from relevant papers, talks, posts, excellent forum comments, or coursework and exams.
3. No or limited understanding of the decision theory of Newcomb-like problems is acceptable if the candidate is willing to do 15 hours of reading in preparation and has a good track record at other challenging and fairly technical fields of Philosophy such as infinite ethics.
4. Post-graduate or professional preferred.
Fellows should be aware that we may decide to only publish to a limited extent since the project potentially overlaps with generic capabilities research and because benchmarks become less useful if published.
We currently do not know how long the project will take, but a similar previous project we worked on (“A dataset of questions on decision-theoretic reasoning in Newcomb-like problems”) took several months.

Dan Hendrycks

Executive Director,
Centre for AI Safety

Technical Safety Research With The Center For AI Safety

This isn't for a specific project. Rather, CAIS has a consistent stream of AI safety research. (See here).

We're looking for people who have previously done ML research, ideally co-authoring a paper at a top conference. FIG Fellows would work with a research lead on technical AI safety projects.

We're looking for someone who has taken multiple courses on deep learning and has done research before leading to a co-authored publication at a conference (ideally a top conference).
We're looking for someone who can contribute 15+ hours per week. Work with CAIS usually leads to a paper at a top ML conference.

Atoosa Kasirzadeh

Assistant Professor,
Carnegie Mellon University

Philosophical Foundations Of Human-AI Value Alignment

AI safety research has focused heavily on AI value alignment. But it is not clear what "values" are in the context of AI alignment. This research draws on philosophy, ethics, and economics literature and asks what are the fundamental nature or characterization of human values, and how does this understanding constrain or inform potential approaches to implementing values in artificial systems?

I seek skilled philosophers, capable of independent self-led work, who can work with me as a collaborator. Participants could make the case that different specialisms are suitable, but most suitable candidates would have:
- a postgraduate degree in philosophy (possibly Masters, preferably pursuing a PhD).
- background knowledge in at least some of the following: philosophy of science, philosophy of economics, value theory, ethics.
Potential for a definitive piece on philosophy of AI.

METR

Various Researchers

Developing Evaluations For AI R&D Capabilities

It’s hard to bound the risk from systems that can substantially improve themselves. For instance, AI systems that can automate AI engineering and research might start an explosion in AI capabilities – where new dangerous capabilities emerge far more quickly than humanity could respond with protective measures. We think it’s critical to have robust tests to predict when this might occur.

What are METR’s plans? METR has recently started developing threshold evaluations that can be run to determine whether AI R&D capabilities warrant protective measures such as information security that is resilient to state-actor attacks. Over time, we’d like to build AI R&D evaluations that smoothly track progress, so evaluators aren’t caught by surprise. Having researchers and engineers with substantial ML R&D experience themselves is the main bottleneck to progress on these evaluations.

Why build AI R&D evaluations at METR? METR is a non-profit organization that collaborates with government agencies and AI companies to understand the risks posed by AI models. As a third party, METR can provide independent input to regulators. At the same time, METR offers flexibility and compensation competitive with Bay Area tech roles, excluding equity.

An ideal candidate would be a machine learning researcher with substantial experience working on frontier LLMs and a track record of successful execution-heavy research projects. Specifically, we're looking for people who have:
- A strong ML publication record (e.g. have published several first-author papers at leading ML conferences)
- Experience working directly on scaling / pretraining teams at frontier LLM labs, or
- Multiple years of experience solving challenging ML engineering or research problems at frontier LLM labs.
METR are primarily excited to offer full-time roles.

AI Sentience

Insights from biology, psychology, computer science, the philosophy of mind, and other disciplines can help us understand if artificial agents can have valenced, subjective experiences and determine how to respond wisely.

Patrick Butlin

Postdoctoral Research Fellow,
GPI (Oxford)

Agency In Philosophical Accounts Of Moral Status

A being's moral status is the set of normative features that govern how we should treat it. For example, horses and humans differ in moral status because, plausibly, there are things it is morally permissible to do to a horse that it would not be wrong to do to a human. The aim of this project is to find insights in the philosophical literature that can help us to understand the potential moral status of future AI systems, with a focus on agency. That is, what does the existing literature say about how features associated with agency, like having desires or emotions, or being capable of planning or reflecting on one's values, affect moral status? Answering this question will help us to identify features of possible AI systems that could make them moral patients, or otherwise influence their moral status, and thus help us to develop policies to protect them.

Fellows chosen for this project will be asked to explore the literature and write summaries of their findings. This will feed into Patrick's research and exceptional work could lead to a co-authored paper. It is likely that there are too many relevant ideas in the literature to summarise in the course of one project, so fellows will use their judgment to identify the most promising angles to pursue.

This project would suit graduate students in philosophy with interests and strong backgrounds in ethics, who are confident searching literature independently. It may especially suit students pursuing related projects for research degrees.
Fellows on this project will be asked to write a summary of their findings (likely to be 8-15 pages) by the end of the 12-week project period. If the project goes well, the fellow may be invited to work further with Patrick and other collaborators in the second half of 2025.

Devising Experiments On AI Preferences

The aim of this project is to devise experiments to examine the preferences of existing frontier AI systems, including LLMs and associated systems. Experiments could either test whether these systems have robust and stable preferences, or what they prefer and how their preferences are affected by training and context. They should be experiments that could realistically be run now by small teams with limited budgets, that have strong potential to extend existing knowledge. This work will help to establish AI welfare as a topic for empirical research while also being relevant to AI safety.

Proposals will contribute to a paper surveying open empirical research questions in AI welfare, and promising proposals may lead to assembled teams and funding for further experiments.

For this project, candidates will need familiarity with techniques and results in the existing literature on the capabilities and behaviour of LLMs and related systems. Some background in philosophy will also be an advantage.
The ideal output for this project would be one or more detailed plans for experiments. Fellows could also contribute to writing on related topics or begin conducting experiments during the 12-week project period.

Brad Saad

Senior Research Fellow,
GPI (Oxford)

AI Moral Patiency

Distillation projects

Various literatures are relevant to AI moral patiency, but they are not directly concerned with it. Distilling relevant insights from one of these literatures is an approach to advancing research on AI moral patiency. This approach is a particularly tractable one for at once building expertise and making research contributions.

Macrostrategy projects

This barely-developed area is concerned with understanding which factors are crucial to how well the future goes for AI moral patients on a large-scale and with how to influence those factors. Projects could address specific stances taken by major actors (states, AI companies, legal systems, advocates, etc.) that might influence how AI moral patients are treated, or how AI safety mechanisms and AI moral patients might interact.

Interface projects

Some possibilities:

Create an interface that allows people to plug in their own estimates for parameters that bear on an important sort of AI-involving risk and generates an overall risk estimate along with a visualization (e.g. a Sankey diagram) of how different parameters feed into it.
Create a dashboard that provides an overview of key actors’ policies or lack thereof concerning risk to AI moral patients. (Compare: this.) You wouldn’t have to commit to maintaining the dashboard in order to do this project.

See here for more information regarding Brad's projects.

Applicants should have some background in an area relevant to the project they choose, or propose. Relevant areas include philosophy and AI, though backgrounds in other areas (such as policy or cognitive science) may count depending on applicants' project interests.
I'm fairly flexible about details and timeline. For most students who apply, I expect it to make the most sense for them to either aim for a distillation/strategic overview post for e.g. the EA Forum or to pursue an original research project that will build skills/let them test whether to pursue research in this area longer term. My tentative plan is to have a once-per-week research meeting where people in my group provide updates and feedback.

Derek Shiller

Senior Researcher,
Rethink Priorities

The Digital Consciousness Model

If AI systems were conscious, they might deserve moral consideration. Some leading AI researchers and philosophers believe current or near-future systems could exhibit conscious experiences. Rethink Priorities’ Worldview Investigations Team is developing a model to estimate the probability of AI consciousness, aiming to provide decision-makers with a framework to balance potential AI interests with those of humans, the public, and other welfare subjects.

This project is ambitious, speculative, and controversial. However, constructing a model forces vague ideas into clearer, testable formulations, advancing the debate and enabling more precise alternative proposals. You can learn more here.

The role involves:
- Mapping cognitive capacities to compare AI, human, and animal minds
- Investigating key faculties like attention, sensory processing, and learning
- Synthesising insights from psychology, neuroscience, cognitive ethology, and machine learning
- Designing tests to assess cognitive faculties across different types of minds
The ideal candidate will have:
- Ability to quickly develop an understanding of existing academic research literature in unfamiliar areas of study. Evaluate the strength of evidence, risk of bias, and generalizability while drawing conclusions.
- Willingness to prepare reports that document existing research, including methods, data, analysis, and conclusions.
- Capacity to provide detailed reviews of other team members’ work, especially where it may overlap with areas of personal expertise.
- A track record of completing excellent research, including quantitative analysis where appropriate.

Lucius Caviola

Senior Research Fellow,
GPI (Oxford)

Conducting Studies To Assess Views About Digital Minds

What are people’s views on AI sentience and rights? I aim to conduct psychological online studies to explore this question. For reference, see here or here. Experience with online survey tools (ideally Qualtrics and Prolific) and data analysis (ideally in R) is required. The work includes setting up surveys, collecting data, analyzing results, and writing a report.

- Significant research experience.
- A bachelor's degree or higher in a field relevant to the research, e.g. psychology, economics, etc.
- Proficiency in statistical analyses.
- Interest in AI and digital minds.
- Prepared to work autonomously with minimal guidance.
The timeline is very uncertain and depends on how many studies we will run together.

Andreas Mogensen

Senior Research Fellow
GPI (Oxford)

The Potential For Moral Standing In AI Systems

Re-thinking the basis of moral standing: Animal minds arguably bundle together a range of psychological traits that are in principle dissociable, such as agency, consciousness, emotion, and hedonic valence. Minds that run on inorganic computational substrates might pull apart traits like these. We therefore face an acute need to carefully reflect on the kind of mental states that ground moral standing and their relationship to phenomenal consciousness.

Developing computational indicators of affect: We aren't currently well-placed to characterize the physical basis of affect and/or affective experiences in ways that aren't tethered to implementational details of animal neuroanatomy and so can be applied to AI systems. I'm particularly interested in the extent to which the popularity of broadly somatic theories of affect among psychologists and neuroscientists might challenge our ability to attribute affective states to disembodied AI systems.

Developing protocols governing risks of mistreatment for AI systems: These protocols would identify key properties of concern, provide guidelines for determining the presence or absence of these properties in AI systems, and propose ethical principles for monitoring and responding to evidence of particular indicator properties in light of empirical and moral uncertainty.

I'd be interested to hear from any faculty, postdocs, or PhD students who might be interested in collaborating or co-authoring papers related to these topics.

Leonard Dung

Postdoctoral Researcher,
Ruhr-University Bochum

Philosophical Research On AI Moral Patiency and Safety

Leonard is a philosopher working on AI moral patiency and AI existential risk. Below is a list of projects he would be interested in collaborating on.

Reviewing potential empirical evidence on AI power-seeking and instrumental convergence
Considerations from theories of rationality on whether superintelligent AI systems will resist having their goals changed or not.
Assessing risks from misaligned highly capable AI under weaker assumptions of instrumentally convergent behavior
Existential risks from misuse of highly capable AI
Qualitative arguments for long/short AGI timelines
Arguments for/against computational functionalism
Arguments according to which views that count AI systems as moral patients (or having mental states) overgeneralize because they also apply to, e.g., companies
Computational models of emotions and what it would take for AI to have emotion
Assuming some AI systems are moral patients: How could we measure what increases/decreases their wellbeing?
Assuming some AI systems have welfare: How could we know whether their lives are worth living?

Participants can also suggest their own projects (see Leonard's website for his research interests).

The goal for the fellow is to build expertise by distilling insights from the respective literature. Ideally, this leads to a (possibly joint) paper on this basis (academic philosophy or, e.g., on the EA forum).

Participants should have some background in a relevant area, such as philosophy and AI, though backgrounds in other areas (such as policy or cognitive science) may count depending on applicants' project interests.
Participants can suggest their own projects, and should demonstrate their relevant skills.
- Month 1: The fellow(s) get an overview of the relevant literature and aim to build an initial sense what crucial considerations there are.
- Month 2: The fellow(s) aim to get a deeper sense of what the crucial considerations, ideas, and arguments in the relevant literature are. The fellow(s) develop an initial outline of a paper on a relevant topic.
- Month 3: The fellow(s) begin writing on the paper.
  All these steps should occur in close collaboration with the project lead.
  Subsequently: The fellow(s) continue working on the paper, either independently or in continued collaboration with the project lead (depending on the interests of the fellow(s) and the project lead).

Suryansh, a FIG co-founder, presenting his research at the Spring 2024 Research Residency.

Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

Applications for the Winter 2025 FIG Fellowship will open soon!

Focus Areas

Project Leads

Philosophical Fundamentals of AI Safety

AI Sentience

Projects

Philosophical Fundamentals of AI Safety

Research Fellow,GPI (Oxford)

Projects In Constructive Decision Theory

Co-Director,Cooperative AI Foundation

Co-lead (Agentic Inequality):

Senior Staff Research Scientist,Google DeepMind

Distinguishing Between Threats And Offers

Towards A Formal Definition Of Cooperative Intelligence

Differentiable Progress On Cooperative AI

Agentic Inequality

Independent Researcher

With co-leads:

Caspar Oesterheld Co-director & PhD Candidate, FOCAL at CMU

Emery Cooper Research Associate, FOCAL at CMU

Training AIs To Aid Decision Theory And Acausal Research

Executive Director,Centre for AI Safety

Technical Safety Research With The Center For AI Safety

Assistant Professor, Carnegie Mellon University

Philosophical Foundations Of Human-AI Value Alignment

Various Researchers

Developing Evaluations For AI R&D Capabilities

AI Sentience

Postdoctoral Research Fellow,GPI (Oxford)

Agency In Philosophical Accounts Of Moral Status

Devising Experiments On AI Preferences

Senior Research Fellow,GPI (Oxford)

AI Moral Patiency

Senior Researcher,Rethink Priorities

The Digital Consciousness Model

Senior Research Fellow,GPI (Oxford)

Conducting Studies To Assess Views About Digital Minds

Senior Research FellowGPI (Oxford)

The Potential For Moral Standing In AI Systems

Postdoctoral Researcher,Ruhr-University Bochum

Philosophical Research On AI Moral Patiency and Safety

Future Impact Group

Research Fellow,
GPI (Oxford)

Co-Director,
Cooperative AI Foundation

Senior Staff Research Scientist,
Google DeepMind

Caspar Oesterheld
Co-director & PhD Candidate,
FOCAL at CMU

Emery Cooper
Research Associate, FOCAL at CMU

Executive Director,
Centre for AI Safety

Assistant Professor,
Carnegie Mellon University

Postdoctoral Research Fellow,
GPI (Oxford)

Senior Research Fellow,
GPI (Oxford)

Senior Researcher,
Rethink Priorities

Senior Research Fellow,
GPI (Oxford)

Senior Research Fellow
GPI (Oxford)

Postdoctoral Researcher,
Ruhr-University Bochum