QUICK REVIEW

[논문 리뷰] Governance Architecture for Neural Network Superposition: A Structural Solution to Hallucination via Routing and Interference Filtering

Nelson Elhage, Tristan Hume|arXiv (Cornell University)|2022. 01. 01.

Model Reduction and Neural Networks인용 수 37

한 줄 요약

본 논문은 신경망에서 polysemanticity를 toy models of superposition를 통해 연구하여 phase change를 드러내고, uniform polytopes와의 기하학적 연결 및 adversarial examples와의 연관성을 제시하며, mechanistic interpretability에 대한 시사점을 제시한다.

ABSTRACT

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

연구 동기 및 목표

Explain polysemanticity as a consequence of storing sparse features in superposition.
Characterize the conditions under which superposition emerges as a phase change.
Link the geometry of superposition to uniform polytopes and adversarial examples.
Discuss implications for mechanistic interpretability and model governance.

제안 방법

Introduce toy models that realize feature superposition in neurons.
Analyze phase transitions associated with superposition.
Draw connections between superposition geometry and uniform polytopes.
Investigate relationships to adversarial examples.
Discuss interpretability and governance implications for neural networks.

실험 결과

연구 질문

RQ1What causes polysemanticity to arise in neural networks as a form of feature superposition?
RQ2Under what conditions does a phase change occur leading to superposition behavior?
RQ3How does the geometry of superposition relate to uniform polytopes and adversarial vulnerability?
RQ4What are the implications of superposition for mechanistic interpretability and model governance?

주요 결과

Evidence of a phase change leading to superposition in toy models.
Identification of a surprising link between superposition geometry and uniform polytopes.
Indicators suggesting a relation between superposition and adversarial examples.
Discussion of how routing and interference filtering could govern superposition.
Implications for interpretability through a structural viewpoint on neuron-level representations.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.