[논문 리뷰] Governance Architecture for Neural Network Superposition: A Structural Solution to Hallucination via Routing and Interference Filtering
본 논문은 신경망에서 polysemanticity를 toy models of superposition를 통해 연구하여 phase change를 드러내고, uniform polytopes와의 기하학적 연결 및 adversarial examples와의 연관성을 제시하며, mechanistic interpretability에 대한 시사점을 제시한다.
Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.
연구 동기 및 목표
- Explain polysemanticity as a consequence of storing sparse features in superposition.
- Characterize the conditions under which superposition emerges as a phase change.
- Link the geometry of superposition to uniform polytopes and adversarial examples.
- Discuss implications for mechanistic interpretability and model governance.
제안 방법
- Introduce toy models that realize feature superposition in neurons.
- Analyze phase transitions associated with superposition.
- Draw connections between superposition geometry and uniform polytopes.
- Investigate relationships to adversarial examples.
- Discuss interpretability and governance implications for neural networks.
실험 결과
연구 질문
- RQ1What causes polysemanticity to arise in neural networks as a form of feature superposition?
- RQ2Under what conditions does a phase change occur leading to superposition behavior?
- RQ3How does the geometry of superposition relate to uniform polytopes and adversarial vulnerability?
- RQ4What are the implications of superposition for mechanistic interpretability and model governance?
주요 결과
- Evidence of a phase change leading to superposition in toy models.
- Identification of a surprising link between superposition geometry and uniform polytopes.
- Indicators suggesting a relation between superposition and adversarial examples.
- Discussion of how routing and interference filtering could govern superposition.
- Implications for interpretability through a structural viewpoint on neuron-level representations.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.