Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

FIG helps you build career capital. You can spend 8+ hours a week working on foundational philosophical issues that can improve technical AI safety and mitigate catastrophic risks.

Our project leads are looking for postgraduate students across multiple fields (including computer science and philosophy), people with experience in machine learning, decision and game theory specialists, and well-read generalists with a track record of high-quality written work.

Scroll down to learn more. On this page, we list our focus area, project leads, and open projects.

Applications for the Winter 2025 FIG Fellowship are now closed.

Focus Areas

In the next few months, we will work on:

Technical AI Safety: projects in LLM reward-seeking behaviour, definitions of cooperative intelligence, and LLM interpretability.

Philosophical Fundamentals of AI Safety: projects in conceptual approaches to coexistence with advanced AI, and how AI agents make decisions under uncertainty.

Project Leads

- Anthropic
  - Anthropic Fellowship: General projects in “classic alignment”
- Elliott Thornley
  - Deliberately training reward-seekers
- Lewis Hammond
  - Towards A Formal Definition Of Cooperative Intelligence
- Eleni Angelou
  - Research directions in the theory of interpretability
- Risto Uuk
  - Contributing to a book on the endgame for AI safety
- Rose Hadshar
  - Projects in AI Futures and Macrostrategy
- Hayley Clatterbuck
  - Evaluating and changing AI propensities towards risk
- Ben Henke
  - Investigating the Structural Risks of AI Interests

Technical AI Safety

Projects in LLM reward-seeking behaviour, definitions of cooperative intelligence, and LLM interpretability.

Anthropic

Anthropic Fellowship: General projects in “classic alignment”

Alongside a great selection of FIG projects, you can also apply to be considered by a variety of project leads from Anthropic, as part of the upcoming Anthropic Fellowship, starting January 2026! This is a flexible pool for highly capable Fellows to support cutting-edge research in “classic” AI alignment: the cluster of technical safety problems that form the core of the field. Fellows may be matched to one of several live projects across Anthropic, in the Bay Area, Canada or London. Projects will span topics such as adversarial robustness & AI control, scalable oversight; model organisms of misalignment; and mechanistic interpretability. Here are some examples of work from a previous cohort:

Open-sourcing methods and tools for tracing circuits within language models, to help interpret their internals.
Work demonstrating “subliminal learning” – that language models can transmit their traits to other models, even in what appears to be meaningless data.
Finding cases of inverse scaling in test-time compute – where more and more reasoning leads to worse and worse outcomes

The specific match will be determined through discussion with project leads, ensuring alignment between fellow interests and research needs. Projects are scoped for a full-time, three-month commitment and typically lead to concrete outputs such as benchmark evaluations, technical reports, or codebases.

This opportunity is well-suited to Fellows who:
- Have strong technical foundations in computer science, machine learning, or mathematics (undergraduate, graduate, or self-taught).
- Are motivated to work on core alignment challenges, including reward modelling, interpretability, robustness, and oversight.
- Are comfortable with empirical research in ML (PyTorch, JAX, or similar) and/or conceptual analysis of alignment failures.
- Value flexible, collaborative work with leading AI safety researchers, and are keen to develop high-quality work samples.

  
    Elliott Thornley
Postdoctoral Associate, MIT

Deliberately training reward-seekers

We’ll deliberately train LLMs to seek rewards. We’ll test how far their reward-seeking generalizes, and we’ll see if we can use their reward-seeking to elicit their capabilities, stop them sandbagging, and keep them under control. We'll write up our results in a paper and submit it to top ML conferences like NeurIPS.

You'll be the lead author on the paper and work largely independently. You'll design and run the experiments. We'll meet weekly to discuss your work. You should have experience finetuning LLMs and running experiments. Ideally, you'll also have previously published work finetuning and testing LLMs.
Paper submitted to ML conferences within 6 months (and possibly within 3 if we find a very capable and dedicated candidate).

  
    Lewis Hammond
Co-Director, Cooperative AI Foundation

Towards A Formal Definition Of Cooperative Intelligence

Cooperative intelligence characterises the ability of an actor to work well with others, and will be critically important for AI agents to possess as we enter world in which they will increasingly come into contact with one another (and with many humans). The first step to ensuring this is to be able to measure this quantity in order to evaluate agents. This project will provide the philosophical and game-theoretic foundation for future evaluations by introducing a formal definition of cooperative intelligence and showing how it relates to many other past efforts in this vein.

- Ideally postgraduate or above (talented late-stage undergraduates are also welcome to apply)
- Prior experience of conducting literature reviews
- Background in philosophy, economics, politics, psychology, or similar (in particular, must have a reasonable knowledge of game theory), with CS/AI knowledge a major plus
- Relatively autonomous and good at time management
Essentially, the plan will be to produce an academic paper or technical report in the style of the well-known Legg-Hutter definition of so-called 'universal intelligence'. Following that model, it will be critical to explain how the proposed definition (which I already have) maps onto earlier efforts from various literatures, including game theory, philosophy, politics/IR, psychology, evolutionary theory, etc. The role of the fellow will be to survey and summarise earlier definitions, and connect to them to our 'universal' definition. Please note that compared to some of my other projects, this is more of a research assistant role.

Adoption Barriers to AI for Human Cooperation

I am producing a roadmap on AI-based technologies that can be used to help humans cooperate, ranging from small-scale negotiations to international diplomacy. Part of the challenge here is in understanding the technologies and their potential, but an even greater challenge (for me, at least) is understanding the barriers to deploying such technologies in the real world. This project would investigate those barriers in order to help chart a path via which advanced AI might help solve some of the world's most important coordination and cooperation challenges. In practice, this would likely include a mixture of literature-based and interview-based research.

- Ideally postgraduate or above (talented late-stage undergraduates are also welcome to apply)
- Prior experience of conducting both literature reviews and interviews is a plus
- Background in law, politics, or similar, with CS/AI knowledge a plus
- Relatively autonomous and good at time management
I am hoping to begin this project in October, and it would be great to have things mostly wrapped up towards the end of the year, but it may stretch into January. Please note that compared to some of my other projects, this is more of a research assistant role.

  
    Eleni Angelou
PhD Candidate, CUNY Graduate Center

Research directions in the theory of interpretability

Research in the theory of interpretability relies on tracing conceptual connections in hypothesizing, clarifying underlying assumptions, and testing the fit of different theoretical frameworks for explaining model behaviors. Some key problems of interpretability can be translated into problems that have previously appeared in human cognitive sciences, which facilitates making progress on them. The aim of this project is to identify and closely examine such problems, and propose constructive directions for theoretical and empirical work.

Examples of problems to work on include but are not limited to:

What experiments can we do to test whether safety-relevant behaviors (e.g., strategic deception, hidden reasoning, world-modeling, etc.) can be reduced to other, more basic properties? Is there a way to test for "tacit representations"? What are the implications of finding such representations for interpretability in particular and AI safety more generally?
What kinds of explanation should we be looking for when studying the causal structure of models? Are there any patterns in currently available explanations?
What predictions can we make about concepts as natural kinds and phenomena of mono- and poly-semanticity, considering evidence in favor of a strong Platonic representation hypothesis?

Fellows will co-author one blogpost that would ideally become a published paper

- Ideal candidate: background in both technical AI safety and philosophy or cognitive science
- Post-graduate or above preferred
- Experience designing and conducting experiments with LLMs
- Good understanding of theoretical frameworks, such as Dennett's Intentional Stance and Marr's three levels, and comfort connecting abstract conceptions to specific model behaviors
We will aim for co-authoring one or more blogposts that could be turned into academic papers.

Philosophical Fundamentals of AI Safety

Projects in conceptual approaches to coexistence with advanced AI, and how AI agent can make decisions under uncertainty.

Contributing to a book on the endgame for AI safety

This book project looks into the concept of an endgame for AI safety, exploring how to think about AI risks, the implications of introducing advanced AI into the world, and the potential solutions to these risks. It will particularly focus on risk areas like loss of control, economic disruption, democracy, and education (particularly de-skilling). The content aims to be interdisciplinary, with a strong focus on philosophical reflection on how society should think about the risks and benefits of building advanced AI.

  
    Risto Uuk
Head of EU Policy and Research, Future of Life Institute

* Strong philosophical background with clear writing and critical thinking skills, as evidenced by previous writings and education.
* Excellent at reviewing existing literature and finding relevant facts, arguments, and ideas from there.
* Ability to carry out work independently with minimal instructions and oversight.
* Specialisation in ethics, applied epistemology, philosophy of science, decision theory, and/or AI safety are a plus.
The output of this work will be a book that can take up to 9 months to write with additional months for publication. The fellow's work will be acknowledged in the book. The amount of time spent is negotiable, but in expectation 5-10 hours for at least 12 weeks.

Projects in AI Futures and Macrostrategy

This project will explore broad questions about how to navigate the transition to advanced AI: how power concentrates, how institutions adapt or fail, and what flourishing futures could look like. This work will draw on history, empirical analysis, and conceptual “deconfusion” to clarify strategic scenarios and positive outcomes, and would extend ideas and work from Forethought’s recent paper by Will Macaskill and Fin Moorhouse, Preparing for the Intelligence Explosion.

Possible research directions:

Mapping how existing institutions could become irrelevant under transformative AI, and what good replacement processes might look like.
Studying allocation decisions for the moon, Antarctica, and the deep sea as case studies for global commons governance.
Reviewing the causes of democracy and applying these insights to AI futures.
Deconfusing “lock-in”: what it means, how it might emerge, and which forms matter most.
Clarifying “multipolarity”: which outcomes are stable, likely, or desirable.
Tracing how past technologies shifted concentrations of power, and implications for AI.
Surveying positive visions for post-AGI governance, identifying gaps and disagreements.

Applicants should also feel free to pitch related ideas on these and similar topics.

  
    Rose Hadshar
Researcher, Forethought

Best suited to fellows with clear writing and independent working styles, who are truth-seeking and motivated by positive long-term futures. Stronger fit for those with historical, political, or empirical approaches than for very technical or formal philosophical projects.
Expected Commitment
1–2 hours per week of input (calls and written feedback from Rose), with scope for more if the work complements Forethought’s ongoing projects.

  
    Hayley Clatterbuck
Senior Researcher, Rethink Priorities

Evaluating and changing AI propensities toward risk

A first component applies methods from behavioral economics to understand LLM decision-making uncertainty and to develop benchmark tests for AI risk attitudes. Fellows will undertake systematic examinations of AI choices under uncertainty, involving scenarios with both hypothetical and actual payoffs. A second component of the project explores various methods (e.g. finetuning, RLHF, prompting) to change the risk propensities of LLMs, making them more or less risk averse as desired.

For the first part of the project, we are looking for someone who can carry out behavioral experiments with LLMs and can collaborate on experimental design. Familiarity with social science methodologies and an ability to carry out statistical analyses of behavioral data would be a plus.
For the second part, fellows who are familiar with natural language prompting, fine-tuning, and/or reinforcement learning from human feedback techniques would be highly valuable. Experience with setting up multi-agent interaction environments would be a plus, though is not required.
We aim to publish a paper on our results, co-authored with any participating FIG fellows.

Investigating the Structural Risks of AI Interests

This project investigates the following question: do the formal structures of an AI's interests, desires, or goals create risks independent of their content? For example, some argue that any goal-directed system inherently develops a self-regarding interest in its own preservation, which could lead to unintended consequences even if its primary goal is beneficial to humanity. The project aims to explore and catalog these potential "structural risks."

The expected output for the FIG Fellow is a comprehensive research report that maps existing work on this topic within the AI safety, alignment, and philosophy literature. This report will serve as the foundation for a co-authored blog post and, ultimately, a peer-reviewed paper. Fellows will conduct literature reviews, synthesize arguments, identify key open questions, and contribute original ideas about potential structural risks and the conditions that might give rise to them.

  
    Ben Henke
Associate Director, London AI and Humanity Project

I’m looking for a candidate with a strong research background and a deep familiarity with the technical AI safety and alignment landscape. The ideal candidate will be a conceptual thinker who is comfortable working with a high degree of autonomy to map out and analyze complex, interdisciplinary literature. While direct experience in AI safety is preferred, candidates from related fields (such as computer science, philosophy, or cognitive science) with a demonstrated interest in the topic are encouraged to apply. The ability to work independently and proactively generate research directions is essential.
The primary goal for the 12-week period is to co-author a blog post that outlines the key questions and findings regarding the structural risks of AI interests. This initial output will lay the groundwork for a more comprehensive academic paper, which I will begin drafting after the fellowship. The ideal fellow would be interested in continuing the collaboration on that paper.

The timeline is structured as follows:

Month 1 (Weeks 1-4): The fellow will conduct an initial survey of the AI safety and alignment literature to identify relevant work.

Month 2 (Weeks 5-8): The fellow will perform a deeper dive into the most promising areas, synthesizing their findings into a research report that will be delivered to me at the end of the month.

Month 3 (Weeks 9-12): We will work collaboratively to write a blog post based on the research and analysis conducted in the first two months.

Suryansh, a FIG co-founder, presenting his research at the Spring 2024 Research Residency.

Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

Applications for the Winter 2025 FIG Fellowship are now closed.

Focus Areas

Project Leads

Technical AI Safety

Philosophical Fundamentals of AI Safety

Technical AI Safety

Anthropic

Anthropic Fellowship: General projects in “classic alignment”

Elliott Thornley

Postdoctoral Associate, MIT

Deliberately training reward-seekers

Lewis Hammond

Co-Director, Cooperative AI Foundation

Towards A Formal Definition Of Cooperative Intelligence

Adoption Barriers to AI for Human Cooperation

Eleni Angelou

PhD Candidate, CUNY Graduate Center

Research directions in the theory of interpretability

Philosophical Fundamentals of AI Safety

Contributing to a book on the endgame for AI safety

Risto Uuk

Head of EU Policy and Research, Future of Life Institute

Projects in AI Futures and Macrostrategy

Rose Hadshar

Researcher, Forethought

Hayley Clatterbuck

Senior Researcher, Rethink Priorities

Evaluating and changing AI propensities toward risk

Investigating the Structural Risks of AI Interests

Ben Henke

Associate Director, London AI and Humanity Project

Future Impact Group