Behavior and Mechanism Are Interrelated

The contrast between “inner” mechanistic approaches and “behavioral” approaches to AI interpretability is misleading. They’re not opposed — they’re interdependent.

Behavior informs mechanism hypotheses. Carefully designed behavioral studies that map edge cases, identify breakdown patterns, and test for “signatures” of specific algorithms constrain what internal mechanisms are plausible. Benchmarking a model’s success at a task is the narrow version of this. The fruitful version systematically probes unexpected behaviors.

Internal structure requires behavioral validation. A component’s functional significance is validated through observed effects on system-level outputs. Philosophy of cognitive science calls this “looking down, around, and up” — understanding a part’s role requires determining its contribution within larger behavioral contexts.

The most fruitful MI research therefore (1) identifies internal components and (2) demonstrates that these components play well-defined causal roles in producing system-level behavior. Many interventional methods already do this — ablations, activation patching, causal tracing.

The philosophical insight: internal structure acquires explanatory force only when its functional significance is validated through behavior.

Related: [None yet]