Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

FIG helps you build career capital. You can spend 8+ hours a week working on foundational philosophical issues that can improve technical AI safety and mitigate catastrophic risks.

Our project leads are looking for postgraduate students across multiple fields (including computer science and philosophy), people with experience in machine learning, decision and game theory specialists, and well-read generalists with a track record of high-quality written work.

Scroll down to learn more. On this page, we list our focus area, project leads, and open projects.

Applications for the Winter 2025 FIG Fellowship are now closed.

Focus Areas

In the next few months, we will work on:

Technical AI Safety: projects in LLM reward-seeking behaviour, definitions of cooperative intelligence, and LLM interpretability. 

Philosophical Fundamentals of AI Safety: projects in conceptual approaches to coexistence with advanced AI,  and how AI agents make decisions under uncertainty.

Project Leads

Technical AI Safety

Projects in LLM reward-seeking behaviour, definitions of cooperative intelligence, and LLM interpretability. 

Anthropic

Anthropic Fellowship: General projects in “classic alignment”

Alongside a great selection of FIG projects, you can also apply to be considered by a variety of project leads from Anthropic, as part of the upcoming Anthropic Fellowship, starting January 2026! This is a flexible pool for highly capable Fellows to support cutting-edge research in “classic” AI alignment: the cluster of technical safety problems that form the core of the field. Fellows may be matched to one of several live projects across Anthropic, in the Bay Area, Canada or London. Projects will span topics such as adversarial robustness & AI control, scalable oversight; model organisms of misalignment; and mechanistic interpretability. Here are some examples of work from a previous cohort:

  • Open-sourcing methods and tools for tracing circuits within language models, to help interpret their internals.
  • Work demonstrating “subliminal learning” – that language models can transmit their traits to other models, even in what appears to be meaningless data.
  • Finding cases of inverse scaling in test-time compute – where more and more reasoning leads to worse and worse outcomes

The specific match will be determined through discussion with project leads, ensuring alignment between fellow interests and research needs. Projects are scoped for a full-time, three-month commitment and typically lead to concrete outputs such as benchmark evaluations, technical reports, or codebases.

Elliott Thornley

Postdoctoral Associate, MIT

Deliberately training reward-seekers

We’ll deliberately train LLMs to seek rewards. We’ll test how far their reward-seeking generalizes, and we’ll see if we can use their reward-seeking to elicit their capabilities, stop them sandbagging, and keep them under control. We'll write up our results in a paper and submit it to top ML conferences like NeurIPS.

Lewis Hammond

Co-Director, Cooperative AI Foundation

Towards A Formal Definition Of Cooperative Intelligence

Cooperative intelligence characterises the ability of an actor to work well with others, and will be critically important for AI agents to possess as we enter world in which they will increasingly come into contact with one another (and with many humans). The first step to ensuring this is to be able to measure this quantity in order to evaluate agents. This project will provide the philosophical and game-theoretic foundation for future evaluations by introducing a formal definition of cooperative intelligence and showing how it relates to many other past efforts in this vein.

Adoption Barriers to AI for Human Cooperation

I am producing a roadmap on AI-based technologies that can be used to help humans cooperate, ranging from small-scale negotiations to international diplomacy. Part of the challenge here is in understanding the technologies and their potential, but an even greater challenge (for me, at least) is understanding the barriers to deploying such technologies in the real world. This project would investigate those barriers in order to help chart a path via which advanced AI might help solve some of the world's most important coordination and cooperation challenges. In practice, this would likely include a mixture of literature-based and interview-based research.

Eleni Angelou

PhD Candidate, CUNY Graduate Center

Research directions in the theory of interpretability

Research in the theory of interpretability relies on tracing conceptual connections in hypothesizing, clarifying underlying assumptions, and testing the fit of different theoretical frameworks for explaining model behaviors. Some key problems of interpretability can be translated into problems that have previously appeared in human cognitive sciences, which facilitates making progress on them. The aim of this project is to identify and closely examine such problems, and propose constructive directions for theoretical and empirical work.

Examples of problems to work on include but are not limited to:

  • What experiments can we do to test whether safety-relevant behaviors (e.g., strategic deception, hidden reasoning, world-modeling, etc.) can be reduced to other, more basic properties? Is there a way to test for "tacit representations"? What are the implications of finding such representations for interpretability in particular and AI safety more generally?
  • What kinds of explanation should we be looking for when studying the causal structure of models? Are there any patterns in currently available explanations?
  • What predictions can we make about concepts as natural kinds and phenomena of mono- and poly-semanticity, considering evidence in favor of a strong Platonic representation hypothesis?

Fellows will co-author one blogpost that would ideally become a published paper

Philosophical Fundamentals of AI Safety

Projects in conceptual approaches to coexistence with advanced AI, and how AI agent can make decisions under uncertainty.

Contributing to a book on the endgame for AI safety

This book project looks into the concept of an endgame for AI safety, exploring how to think about AI risks, the implications of introducing advanced AI into the world, and the potential solutions to these risks. It will particularly focus on risk areas like loss of control, economic disruption, democracy, and education (particularly de-skilling). The content aims to be interdisciplinary, with a strong focus on philosophical reflection on how society should think about the risks and benefits of building advanced AI.

Risto Uuk

Head of EU Policy and Research, Future of Life Institute

Projects in AI Futures and Macrostrategy

This project will explore broad questions about how to navigate the transition to advanced AI: how power concentrates, how institutions adapt or fail, and what flourishing futures could look like. This work will draw on history, empirical analysis, and conceptual “deconfusion” to clarify strategic scenarios and positive outcomes, and would extend ideas and work from Forethought’s recent paper by Will Macaskill and Fin Moorhouse, Preparing for the Intelligence Explosion.

Possible research directions:

  • Mapping how existing institutions could become irrelevant under transformative AI, and what good replacement processes might look like.
  • Studying allocation decisions for the moon, Antarctica, and the deep sea as case studies for global commons governance.
  • Reviewing the causes of democracy and applying these insights to AI futures.
  • Deconfusing “lock-in”: what it means, how it might emerge, and which forms matter most.
  • Clarifying “multipolarity”: which outcomes are stable, likely, or desirable.
  • Tracing how past technologies shifted concentrations of power, and implications for AI.
  • Surveying positive visions for post-AGI governance, identifying gaps and disagreements.

Applicants should also feel free to pitch related ideas on these and similar topics.

Rose Hadshar

Researcher, Forethought

Hayley Clatterbuck

Senior Researcher, Rethink Priorities

Evaluating and changing AI propensities toward risk

A first component applies methods from behavioral economics to understand LLM decision-making uncertainty and to develop benchmark tests for AI risk attitudes. Fellows will undertake systematic examinations of AI choices under uncertainty, involving scenarios with both hypothetical and actual payoffs. A second component of the project explores various methods (e.g. finetuning, RLHF, prompting) to change the risk propensities of LLMs, making them more or less risk averse as desired.

Investigating the Structural Risks of AI Interests

This project investigates the following question: do the formal structures of an AI's interests, desires, or goals create risks independent of their content? For example, some argue that any goal-directed system inherently develops a self-regarding interest in its own preservation, which could lead to unintended consequences even if its primary goal is beneficial to humanity. The project aims to explore and catalog these potential "structural risks."

The expected output for the FIG Fellow is a comprehensive research report that maps existing work on this topic within the AI safety, alignment, and philosophy literature. This report will serve as the foundation for a co-authored blog post and, ultimately, a peer-reviewed paper. Fellows will conduct literature reviews, synthesize arguments, identify key open questions, and contribute original ideas about potential structural risks and the conditions that might give rise to them.

Ben Henke

Associate Director, London AI and Humanity Project

Suryansh, a FIG co-founder, presenting his research at the Spring 2024 Research Residency.