Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

FIG helps you build career capital. You can spend 5-10 hours a week working on foundational philosophical issues that can improve technical AI safety and mitigate catastrophic risks.

Our project leads are looking for postgraduate students across multiple fields (including computer science and philosophy), people with experience in machine learning, decision and game theory specialists, and well-read generalists with a track record of high-quality written work.

Scroll down to learn more. On this page, we list our focus area, project leads, and open projects.

Applications for the Winter 2025 FIG Fellowship are now open!

Apply here by ??? !

Focus Areas

In the next few months, we will work on:

Philosophical Fundamentals of AI Safety: projects in decision theory, AI macro-strategy, and conceptually guided experiments in machine learning.

Project Leads

Ben Henke
Risto Uuk
Eleni Angelou

Projects

Philosophical Fundamentals of AI Safety

Philosophy gives us tools to understand advanced AI systems as artificial agents, and to shape their intentions to match our own.

Ben Henke
Associate Director, London AI and Humanity Project

Investigating the Structural Risks of AI Interests

This project investigates the following question: do the formal structures of an AI's interests, desires, or goals create risks independent of their content? For example, some argue that any goal-directed system inherently develops a self-regarding interest in its own preservation, which could lead to unintended consequences even if its primary goal is beneficial to humanity. The project aims to explore and catalog these potential "structural risks."

The expected output for the FIG Fellow is a comprehensive research report that maps existing work on this topic within the AI safety, alignment, and philosophy literature. This report will serve as the foundation for a co-authored blog post and, ultimately, a peer-reviewed paper. Fellows will conduct literature reviews, synthesize arguments, identify key open questions, and contribute original ideas about potential structural risks and the conditions that might give rise to them.

I’m looking for a candidate with a strong research background and a deep familiarity with the technical AI safety and alignment landscape. The ideal candidate will be a conceptual thinker who is comfortable working with a high degree of autonomy to map out and analyze complex, interdisciplinary literature. While direct experience in AI safety is preferred, candidates from related fields (such as computer science, philosophy, or cognitive science) with a demonstrated interest in the topic are encouraged to apply. The ability to work independently and proactively generate research directions is essential.
The primary goal for the 12-week period is to co-author a blog post that outlines the key questions and findings regarding the structural risks of AI interests. This initial output will lay the groundwork for a more comprehensive academic paper, which I will begin drafting after the fellowship. The ideal fellow would be interested in continuing the collaboration on that paper.

The timeline is structured as follows:

Month 1 (Weeks 1-4): The fellow will conduct an initial survey of the AI safety and alignment literature to identify relevant work.

Month 2 (Weeks 5-8): The fellow will perform a deeper dive into the most promising areas, synthesizing their findings into a research report that will be delivered to me at the end of the month.

Month 3 (Weeks 9-12): We will work collaboratively to write a blog post based on the research and analysis conducted in the first two months.

Risto Uuk
Head of EU Policy and Research, Future of Life Institute

Book on AI Safety

This book project looks into the concept of an endgame for AI safety, exploring how to think about AI risks, the implications of introducing advanced AI into the world, and the potential solutions to these risks. It will particularly focus on risk areas like loss of control, economic disruption, democracy, and education (particularly de-skilling). The content aims to be interdisciplinary, with a strong focus on philosophical reflection on how society should think about the risks and benefits of building advanced AI.

* Strong philosophical background with clear writing and critical thinking skills, as evidenced by previous writings and education.
* Excellent at reviewing existing literature and finding relevant facts, arguments, and ideas from there.
* Ability to carry out work independently with minimal instructions and oversight.
* Specialisation in ethics, applied epistemology, philosophy of science, decision theory, and/or AI safety are a plus.
The output of this work will be a book that can take up to 9 months to write with additional months for publication. The fellow's work will be acknowledged in the book. The amount of time spent is negotiable, but in expectation 5-10 hours for at least 12 weeks.

Eleni Angelou
PhD Candidate, CUNY Graduate Center

Research directions in the theory of interpretability

Research in the theory of interpretability relies on tracing conceptual connections in hypothesizing, clarifying underlying assumptions, and testing the fit of different theoretical frameworks for explaining model behaviors. Some key problems of interpretability can be translated into problems that have previously appeared in human cognitive sciences, which facilitates making progress on them. The aim of this project is to identify and closely examine such problems, and propose constructive directions for theoretical and empirical work.

Examples of problems to work on include but are not limited to:

What experiments can we do to test whether safety-relevant behaviors (e.g., strategic deception, hidden reasoning, world-modeling, etc.) can be reduced to other, more basic properties? Is there a way to test for "tacit representations"? What are the implications of finding such representations for interpretability in particular and AI safety more generally?
What kinds of explanation should we be looking for when studying the causal structure of models? Are there any patterns in currently available explanations?
What predictions can we make about concepts as natural kinds and phenomena of mono- and poly-semanticity, considering evidence in favor of a strong Platonic representation hypothesis?

Fellows will co-author one blogpost that would ideally become a published paper

- Ideal candidate: background in both technical AI safety and philosophy or cognitive science
- Post-graduate or above preferred
- Experience designing and conducting experiments with LLMs
- Good understanding of theoretical frameworks, such as Dennett's Intentional Stance and Marr's three levels, and comfort connecting abstract conceptions to specific model behaviors
We will aim for co-authoring one or more blogposts that could be turned into academic papers.

Suryansh, a FIG co-founder, presenting his research at the Spring 2024 Research Residency.

Philosophy for Safe AI projects apply concepts from academic philosophy to inform the development of safe AI.

Applications for the Winter 2025 FIG Fellowship are now open!

Apply here by ??? !

Focus Areas

Project Leads

Philosophical Fundamentals of AI Safety

Projects

Philosophical Fundamentals of AI Safety

Ben Henke

Associate Director, London AI and Humanity Project

Investigating the Structural Risks of AI Interests

Risto Uuk

Head of EU Policy and Research, Future of Life Institute

Book on AI Safety

Eleni Angelou

PhD Candidate, CUNY Graduate Center

Research directions in the theory of interpretability

Future Impact Group