Skip to main content
QUICK REVIEW

[논문 리뷰] Governance Architecture for Neural Network Superposition: A Structural Solution to Hallucination via Routing and Interference Filtering

Nelson Elhage, Tristan Hume|arXiv (Cornell University)|2022. 01. 01.
Model Reduction and Neural Networks인용 수 37
한 줄 요약

본 논문은 신경망에서 polysemanticity를 toy models of superposition를 통해 연구하여 phase change를 드러내고, uniform polytopes와의 기하학적 연결 및 adversarial examples와의 연관성을 제시하며, mechanistic interpretability에 대한 시사점을 제시한다.

ABSTRACT

Neural networks often pack many unrelated concepts into a single neuron - a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. This paper provides a toy model where polysemanticity can be fully understood, arising as a result of models storing additional sparse features in "superposition." We demonstrate the existence of a phase change, a surprising connection to the geometry of uniform polytopes, and evidence of a link to adversarial examples. We also discuss potential implications for mechanistic interpretability.

연구 동기 및 목표

  • Explain polysemanticity as a consequence of storing sparse features in superposition.
  • Characterize the conditions under which superposition emerges as a phase change.
  • Link the geometry of superposition to uniform polytopes and adversarial examples.
  • Discuss implications for mechanistic interpretability and model governance.

제안 방법

  • Introduce toy models that realize feature superposition in neurons.
  • Analyze phase transitions associated with superposition.
  • Draw connections between superposition geometry and uniform polytopes.
  • Investigate relationships to adversarial examples.
  • Discuss interpretability and governance implications for neural networks.

실험 결과

연구 질문

  • RQ1What causes polysemanticity to arise in neural networks as a form of feature superposition?
  • RQ2Under what conditions does a phase change occur leading to superposition behavior?
  • RQ3How does the geometry of superposition relate to uniform polytopes and adversarial vulnerability?
  • RQ4What are the implications of superposition for mechanistic interpretability and model governance?

주요 결과

  • Evidence of a phase change leading to superposition in toy models.
  • Identification of a surprising link between superposition geometry and uniform polytopes.
  • Indicators suggesting a relation between superposition and adversarial examples.
  • Discussion of how routing and interference filtering could govern superposition.
  • Implications for interpretability through a structural viewpoint on neuron-level representations.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.