Mechanistic Interpretability Needs Philosophy
Citation
Williams, I., Oldenburg, N., Dhar, R., Hatherley, J., Fierro, C., Rajcic, N., Schiller, S.R., Stamatiou, F., & Søgaard, A. (2025). Mechanistic Interpretability Needs Philosophy. arXiv:2506.18852.
Abstract
Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. The authors argue that MI needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems.
Framing
Position paper arguing for interdisciplinary collaboration between philosophy and MI. Frames MI as “pre-paradigmatic” — a field with fundamental open problems and unexamined assumptions, where philosophical partnership can accelerate progress. Parallels arguments made for philosophy’s role in physics, cognitive science, economics.
Key Contributions
-
Defines MI by two commitments: (a) explaining via causal mechanisms not just correlations, (b) producing scientific understanding for researchers vs. explanations for end-users
-
Examines three open problems to demonstrate philosophy’s value:
- Network decomposition (philosophy of mechanistic explanation)
- Features/representations (philosophy of mind, content theories)
- Deception detection (ethics, philosophy of language)
-
Addresses four objections to philosophical engagement in MI
Core Arguments
On Decomposition:
- Challenges the assumption of “one true decomposition” - no privileged level at which mechanistic truth resides
- Behavior and mechanism are interrelated, not opposed - behavior constrains mechanism hypotheses
On Features:
- Distinguishes representational vehicles (internal components) from content (what they represent)
- This distinction clarifies research questions and connects to prior work on representation
On Deception:
- Lying and deception require cognitive complexity (beliefs, intentions, assertoric commitment)
- It’s controversial whether LLMs possess these in the relevant sense
- Not all information concealment is ethically equivalent
Extracted Content
- 05-atom—mechanistic-explanation-definition
- 05-atom—vehicle-content-distinction
- 05-atom—behavior-mechanism-integration
- 05-atom—explanatory-pluralism
- 05-atom—deception-requires-intention
- 05-atom—lying-requires-beliefs
- 05-molecule—prepartadigmatic-field-dynamics
- 07-molecule—philosophy-ai-partnership-pattern
Related Sources
- Sharkey et al. 2025 - Open Problems in Mechanistic Interpretability
- Bereska & Gavves 2024 - Mechanistic Interpretability for AI Safety
- Kästner & Crook 2024 - Explaining AI through Mechanistic Interpretability
- Chalmers 2025 - Propositional Interpretability in AI