Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

FIG helps you build career capital. You can spend 5-10 hours a week working on foundational philosophical issues that can improve technical AI safety and mitigate catastrophic risks.

Our project leads are looking for postgraduate students across multiple fields (including computer science and philosophy), people with experience in machine learning, decision and game theory specialists, and well-read generalists with a track record of high-quality written work.

Scroll down to learn more. On this page, we list our focus area, project leads, and open projects.

Applications for the Winter 2025 FIG Fellowship will open soon!

Register your interest here!

Focus Areas

In the next few months, we will work on:

Philosophical Fundamentals of AI Safety: projects in decision theory, AI macro-strategy, and conceptually guided experiments in machine learning.


AI Sentience: surveys of expert opinion, literature reviews, and applying insights from philosophy of mind to models of consciousness that could include artificial agents.

Project Leads

Projects

Philosophical Fundamentals of AI Safety

Philosophy gives us tools to understand advanced AI systems as artificial agents, and to shape their intentions to match our own.

Elliott Thornley

Research Fellow,
GPI (Oxford)

Projects In Constructive Decision Theory

In constructive decision theory, we use ideas from decision theory to design artificial agents. Elliott can supervise projects on many topics in constructive decision theory. Examples include:

  • Distinguishing reward functions and utility functions.
  • Assessing the prospects of keeping agents under control by training them to be impatient.
  • 'Managing the news' as a problem for corrigibility proposals.
  • Designing methods to control agents' credences.
  • Designing methods to detect scheming.
  • Assessing the ethics of training agents to allow shutdown.
  • Investigating whether corrigibility is a better target than full alignment.

FIG Fellows will research their chosen topic and write a blog post or paper explaining their findings.

Co-lead (Agentic Inequality):

Iason Gabriel

Senior Staff Research Scientist,
Google DeepMind

Distinguishing Between Threats And Offers

Threats and offers are fundamental to strategic decision-making scenarios, ranging from casual agreements between neighbours to trade deals between states. Despite this, a formal theory of threats and offers, or more generally of coercive and cooperative proposals is lacking. This project builds on some initial theoretical work (based on causal games) and explores applications to AI agents, including their ability to detect and appropriately respond to threats and offers.

The aim will be to produce a paper to be submitted to a relevant academic AI conference. Please note that much of the theory work on this project has been done, and what remains is to validate that theory with some carefully designed experiments. Potential applicants therefore ought to be comfortable with at least one of multi-agent reinforcement learning and large language models, and ideally both. The existing project has two other collaborators – Jesse Clifton and Allan Dafoe – though fellows would almost entirely be working with me instead of Jesse and Allan.

Towards A Formal Definition Of Cooperative Intelligence

Cooperative intelligence characterises the ability of an actor to work well with others, and will be critically important for AI agents to possess as we enter world in which they will increasingly come into contact with one another (and with many humans). The first step to ensuring this to be able to measure this quantity in order to evaluate agents. This project will provide the philosophical and game-theoretic foundation for future evaluations by introducing a formal definition of cooperative intelligence and showing how it relates to many other past efforts in this vein.

Essentially, the plan will be to produce an academic paper or technical report in the style of the well-known Legg-Hutter definition of so-called 'universal intelligence'. Following that model, it will be critical to explain how the proposed definition (which I already have) maps onto earlier efforts from various literatures, including game theory, philosophy, politics/IR, psychology, evolutionary theory, etc. The role of the fellow will be to survey and summarise earlier definitions, and connect to them to our 'universal' definition. The fellow would then be one of the primary authors of the eventual report, along with me and a number of senior co-authors.

Differentiable Progress On Cooperative AI

Cooperative capabilities can be dual-use. For example, the ability to understand others' preferences can be used either to compromise fairly with someone, or to manipulate and extort them. However, there are other capabilities (such as non-binding communication, which in the worst case can be ignored) that appear to be less dual-use. This exploratory project would seek to create conditions for the 'dual-use-ness' of cooperative capabilities, and to evaluate different capabilities with respect to these conditions. The hope is that this would also inform future research priorities.

Ideally this project would result in a paper or a technical report, but is sufficiently conceptual and exploratory that I am not 100% confident what the results will be and whether they will suit this format. At the very least, however, a high-quality blog post should be produced. A fellow working on this project must be comfortable with uncertainty and be able to derive formal conditions and models based on abstract, confusing, and messy concepts. I will be able to provide some guidance in this regard, but I expect the project would be slow and unenjoyable for someone without these aptitudes.

Agentic Inequality

Already, the inequitable distribution of AI capabilities and other digital technologies increases inequality. Once individuals begin to delegate more of their decision-making and actions to AI agents, these inequalities may be further entrenched based on the strength or number of the agents that those individuals have access to. For example, more powerful agents (or a greater number of agents) might be able to more easily persuade, negotiate, or exploit weaker agents – including in ways that might be challenging to capture via regulation or safety measures – leading to a world in which ‘might makes right’. This project will explore the ethical and regulatory implications of this issue.

Ideally, this project will result in a paper to be submitted to a suitable academic conference or journal, alongside a corresponding blog post. The project is currently relatively under-specified and therefore is best-suited to those with prior research experience.

Chi Nguyen

Independent Researcher

With co-leads:

Caspar Oesterheld
Co-director & PhD Candidate,
FOCAL at CMU

Emery Cooper
Research Associate, FOCAL at CMU

Training AIs To Aid Decision Theory And Acausal Research

Fellows might be mentored by me, Caspar Oesterheld (CMU), or Emery Cooper (CMU).

Rough idea:

  • We would like AIs to handle acausal interactions in certain ways, e.g. if aligned, we want them to be acausally competent.
  • One approach is to directly train the AIs to behave in certain ways, e.g. follow our favourite decision theory.
  • Another approach is to train AIs at reasoning about decision theory and acausal interactions such that they can do the kinds of acausal research we are doing, e.g. research which acausal behavior is best, how to build systems that engage in this behavior etc.
  • This research project takes the latter approach, i.e. making AIs good at decision-theoretic reasoning.

A potential first project we might pursue:

  1. Build a dataset of critiques of philosophy arguments, especially in decision theory.
  2. Most of these critiques are intentionally flawed and only some good.
  3. The dataset includes ground-truth labels of whether an argument is good or bad (and perhaps other descriptors, e.g. invalid vs. bad assumptions etc.).
  4. Use the dataset to train an AI to classify critiques as good or bad etc.
  5. In the future, this classifier could be used in various ways, e.g. to train a critique-generating model.

The project is at an extremely early stage, so the exact direction is unclear. Fellows could generate and evaluate arguments or help with coding and scaffolding.

Dan Hendrycks

Executive Director,
Centre for AI Safety

Technical Safety Research With The Center For AI Safety

This isn't for a specific project. Rather, CAIS has a consistent stream of AI safety research. (See here).

We're looking for people who have previously done ML research, ideally co-authoring a paper at a top conference. FIG Fellows would work with a research lead on technical AI safety projects.

Philosophical Foundations Of Human-AI Value Alignment

AI safety research has focused heavily on AI value alignment. But it is not clear what "values" are in the context of AI alignment. This research draws on philosophy, ethics, and economics literature and asks what are the fundamental nature or characterization of human values, and how does this understanding constrain or inform potential approaches to implementing values in artificial systems?

METR

Various Researchers

Developing Evaluations For AI R&D Capabilities

It’s hard to bound the risk from systems that can substantially improve themselves. For instance, AI systems that can automate AI engineering and research might start an explosion in AI capabilities – where new dangerous capabilities emerge far more quickly than humanity could respond with protective measures. We think it’s critical to have robust tests to predict when this might occur.

What are METR’s plans? METR has recently started developing threshold evaluations that can be run to determine whether AI R&D capabilities warrant protective measures such as information security that is resilient to state-actor attacks. Over time, we’d like to build AI R&D evaluations that smoothly track progress, so evaluators aren’t caught by surprise. Having researchers and engineers with substantial ML R&D experience themselves is the main bottleneck to progress on these evaluations.

Why build AI R&D evaluations at METR? METR is a non-profit organization that collaborates with government agencies and AI companies to understand the risks posed by AI models. As a third party, METR can provide independent input to regulators. At the same time, METR offers flexibility and compensation competitive with Bay Area tech roles, excluding equity.

AI Sentience

Insights from biology, psychology, computer science, the philosophy of mind, and other disciplines can help us understand if artificial agents can have valenced, subjective experiences and determine how to respond wisely.

Patrick Butlin

Postdoctoral Research Fellow,
GPI (Oxford)

Agency In Philosophical Accounts Of Moral Status

A being's moral status is the set of normative features that govern how we should treat it. For example, horses and humans differ in moral status because, plausibly, there are things it is morally permissible to do to a horse that it would not be wrong to do to a human. The aim of this project is to find insights in the philosophical literature that can help us to understand the potential moral status of future AI systems, with a focus on agency. That is, what does the existing literature say about how features associated with agency, like having desires or emotions, or being capable of planning or reflecting on one's values, affect moral status? Answering this question will help us to identify features of possible AI systems that could make them moral patients, or otherwise influence their moral status, and thus help us to develop policies to protect them.

Fellows chosen for this project will be asked to explore the literature and write summaries of their findings. This will feed into Patrick's research and exceptional work could lead to a co-authored paper. It is likely that there are too many relevant ideas in the literature to summarise in the course of one project, so fellows will use their judgment to identify the most promising angles to pursue.

Devising Experiments On AI Preferences

The aim of this project is to devise experiments to examine the preferences of existing frontier AI systems, including LLMs and associated systems. Experiments could either test whether these systems have robust and stable preferences, or what they prefer and how their preferences are affected by training and context. They should be experiments that could realistically be run now by small teams with limited budgets, that have strong potential to extend existing knowledge. This work will help to establish AI welfare as a topic for empirical research while also being relevant to AI safety.

Proposals will contribute to a paper surveying open empirical research questions in AI welfare, and promising proposals may lead to assembled teams and funding for further experiments.

Brad Saad

Senior Research Fellow,
GPI (Oxford)

AI Moral Patiency

Distillation projects

Various literatures are relevant to AI moral patiency, but they are not directly concerned with it. Distilling relevant insights from one of these literatures is an approach to advancing research on AI moral patiency. This approach is a particularly tractable one for at once building expertise and making research contributions.

Macrostrategy projects

This barely-developed area is concerned with understanding which factors are crucial to how well the future goes for AI moral patients on a large-scale and with how to influence those factors. Projects could address specific stances taken by major actors (states, AI companies, legal systems, advocates, etc.) that might influence how AI moral patients are treated, or how AI safety mechanisms and AI moral patients might interact.

Interface projects

Some possibilities:

  • Create an interface that allows people to plug in their own estimates for parameters that bear on an important sort of AI-involving risk and generates an overall risk estimate along with a visualization (e.g. a Sankey diagram) of how different parameters feed into it.
  • Create a dashboard that provides an overview of key actors’ policies or lack thereof concerning risk to AI moral patients. (Compare: this.) You wouldn’t have to commit to maintaining the dashboard in order to do this project.

See here for more information regarding Brad's projects.

Derek Shiller

Senior Researcher,
Rethink Priorities

The Digital Consciousness Model

If AI systems were conscious, they might deserve moral consideration. Some leading AI researchers and philosophers believe current or near-future systems could exhibit conscious experiences. Rethink Priorities’ Worldview Investigations Team is developing a model to estimate the probability of AI consciousness, aiming to provide decision-makers with a framework to balance potential AI interests with those of humans, the public, and other welfare subjects.

This project is ambitious, speculative, and controversial. However, constructing a model forces vague ideas into clearer, testable formulations, advancing the debate and enabling more precise alternative proposals. You can learn more here.

Lucius Caviola

Senior Research Fellow,
GPI (Oxford)

Conducting Studies To Assess Views About Digital Minds

What are people’s views on AI sentience and rights? I aim to conduct psychological online studies to explore this question. For reference, see here or here. Experience with online survey tools (ideally Qualtrics and Prolific) and data analysis (ideally in R) is required. The work includes setting up surveys, collecting data, analyzing results, and writing a report.

Andreas Mogensen

Senior Research Fellow
GPI (Oxford)

The Potential For Moral Standing In AI Systems

Re-thinking the basis of moral standing: Animal minds arguably bundle together a range of psychological traits that are in principle dissociable, such as agency, consciousness, emotion, and hedonic valence. Minds that run on inorganic computational substrates might pull apart traits like these. We therefore face an acute need to carefully reflect on the kind of mental states that ground moral standing and their relationship to phenomenal consciousness.

Developing computational indicators of affect: We aren't currently well-placed to characterize the physical basis of affect and/or affective experiences in ways that aren't tethered to implementational details of animal neuroanatomy and so can be applied to AI systems. I'm particularly interested in the extent to which the popularity of broadly somatic theories of affect among psychologists and neuroscientists might challenge our ability to attribute affective states to disembodied AI systems.

Developing protocols governing risks of mistreatment for AI systems: These protocols would identify key properties of concern, provide guidelines for determining the presence or absence of these properties in AI systems, and propose ethical principles for monitoring and responding to evidence of particular indicator properties in light of empirical and moral uncertainty.

Leonard Dung

Postdoctoral Researcher,
Ruhr-University Bochum

Philosophical Research On AI Moral Patiency and Safety

Leonard is a philosopher working on AI moral patiency and AI existential risk. Below is a list of projects he would be interested in collaborating on.

  • Reviewing potential empirical evidence on AI power-seeking and instrumental convergence
  • Considerations from theories of rationality on whether superintelligent AI systems will resist having their goals changed or not.
  • Assessing risks from misaligned highly capable AI under weaker assumptions of instrumentally convergent behavior
  • Existential risks from misuse of highly capable AI
  • Qualitative arguments for long/short AGI timelines
  • Arguments for/against computational functionalism
  • Arguments according to which views that count AI systems as moral patients (or having mental states) overgeneralize because they also apply to, e.g., companies
  • Computational models of emotions and what it would take for AI to have emotion
  • Assuming some AI systems are moral patients: How could we measure what increases/decreases their wellbeing?
  • Assuming some AI systems have welfare: How could we know whether their lives are worth living?

Participants can also suggest their own projects (see Leonard's website for his research interests).

The goal for the fellow is to build expertise by distilling insights from the respective literature. Ideally, this leads to a (possibly joint) paper on this basis (academic philosophy or, e.g., on the EA forum).

Suryansh, a FIG co-founder, presenting his research at the Spring 2024 Research Residency.