Mechanistic Interpretability Needs Philosophy

Citation

Williams, I., Oldenburg, N., Dhar, R., Hatherley, J., Fierro, C., Rajcic, N., Schiller, S.R., Stamatiou, F., & Søgaard, A. (2025). Mechanistic Interpretability Needs Philosophy. arXiv:2506.18852.

Abstract

Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. The authors argue that MI needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems.

Framing

Position paper arguing for interdisciplinary collaboration between philosophy and MI. Frames MI as “pre-paradigmatic” — a field with fundamental open problems and unexamined assumptions, where philosophical partnership can accelerate progress. Parallels arguments made for philosophy’s role in physics, cognitive science, economics.

Key Contributions

Defines MI by two commitments: (a) explaining via causal mechanisms not just correlations, (b) producing scientific understanding for researchers vs. explanations for end-users
Examines three open problems to demonstrate philosophy’s value:
- Network decomposition (philosophy of mechanistic explanation)
- Features/representations (philosophy of mind, content theories)
- Deception detection (ethics, philosophy of language)
Addresses four objections to philosophical engagement in MI

Core Arguments

On Decomposition:

Challenges the assumption of “one true decomposition” - no privileged level at which mechanistic truth resides
Behavior and mechanism are interrelated, not opposed - behavior constrains mechanism hypotheses

On Features:

Distinguishes representational vehicles (internal components) from content (what they represent)
This distinction clarifies research questions and connects to prior work on representation

On Deception:

Lying and deception require cognitive complexity (beliefs, intentions, assertoric commitment)
It’s controversial whether LLMs possess these in the relevant sense
Not all information concealment is ethically equivalent

Extracted Content

Sharkey et al. 2025 - Open Problems in Mechanistic Interpretability
Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety
Kästner & Crook 2024 - Explaining AI through Mechanistic Interpretability
Chalmers 2025 - Propositional Interpretability in AI

>heyMHK

Mechanistic Interpretability Needs Philosophy

Mechanistic Interpretability Needs Philosophy

Citation

Abstract

Framing

Key Contributions

Core Arguments

Extracted Content

Properties

Graph view

Table of Contents

>heyMHK

Mechanistic Interpretability Needs Philosophy

Mechanistic Interpretability Needs Philosophy

Citation

Abstract

Framing

Key Contributions

Core Arguments

Extracted Content

Related Sources

Properties

Graph view

Table of Contents