QUICK REVIEW

[논문 리뷰] Semantic Identity Compression: Exact Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity

Tristan Simas|arXiv (Cornell University)|2026. 01. 20.

Algorithms and Data Compression인용 수 0

한 줄 요약

논문은 제로-오차 식별을 위한 세 자원 프레임워크(L, L, W, D)로의 rate-distortion 이론 확장, 정보 장벽 증명, 표준 태깅이 파레토 최적 해임을 확인하고 질의 집합에 대한 매트로드 구조를 드러내며 모든 증명은 Lean 4로 기계 검증되었다.

ABSTRACT

Symbolic systems operate over exact identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. We characterize exactly how much additional information must be supplied to recover precise identity from such representations. The answer is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. We prove a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)| ceil$, and the rate-distortion tradeoff with an explicit distortion floor when identity bits are withheld. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4. Keywords: semantics-aware compression, zero-error coding, neurosymbolic systems, learned representations, side information

연구 동기 및 목표

세 가지 자원(L 태그 길이, W 식별 비용, D 왜곡)을 갖는 이산 분류 설정으로 고전적 rate-distortion 이론 확장.
특성-전용 관찰에서 제로-오차 식별이 가능한 경우(정보 장벽)와 태깅이 이것에 어떻게 영향을 미치는지 특징화.
(L, W, D) 공간에서 Pareto-최적 트레이드오프 포인트를 식별하고 트레이드오프 공간의 기하를 규명.
최소 구별 질의 집합의 매트로드 구조를 밝혀 구별 차원의 정의를 제시.
Lean 4로의 구현과 기계 검증 형식화의 구체적 구현을 제공(6,000+ 줄, 265개 정리).

제안 방법

관찰 모델을 속성 인터페이스 패밀리와 선택적 nominal-tag 접근으로 정의한다.
정보 장벽을 증명한다: 인터페이스 전용 질의가 프로파일이 충돌할 때 동등 클래스 내에서 구분할 수 없음을 보인다.
L = ceil(log2 k)인 nominal tagging이 W = O(1)로 제로-오차 식별을 가능하게 하여 프로파일의 주입성에 관계없이 식별성을 회복한다.
최소 구별 질의 집합의 매트로드 구조를 확립하고 고정 축 완전성 결과를 도출한다.
정보-장벽 도메인에서 D = 0에 대한 고유의 Pareto-최적점이 존재하며 nominal tagging이 최적의 조합 효율성을 제공함을 보이고, 그 결과를 Lean 4로 기계 검증한다.
Lean 4 형식 proofs 제공(6,000+ 줄, 265 정리).

실험 결과

연구 질문

RQ1인터페이스 관찰만으로 제로-오차 식별이 가능할 때는 언제이며 태깅이 이 능력에 어떤 영향을 미치는가?
RQ2정보 장벽 도메인에서 제로-오차 식별을 복원하기 위한 최소 태그 길이는 얼마인가?
RQ3모든 클래스를 구별하는 최소 질의 집합의 구조는 무엇이며 이것이 매트로드와 어떻게 관련되는가?
RQ4제안된 (L,W,D) 프레임워크에서 태그 길이, 질의 비용 및 왜곡 간의 Pareto-최적 트레이드오프는 어떻게 구성되는가?

주요 결과

특성 질의로 인한 제로-오차 식별은 클래스 프로파일이 일대일로 주입될 때에만 가능하며 그렇지 않으면 정보 장벽이 존재한다.
L = ceil(log2 k)인 명목 태깅은 W = O(1)로 제로-오차 식별을 가능하게 하며 주입성에 관계없이 식별 가능성을 회복한다.
정보 장벽 도메인에서 D = 0인 모든 스킴은 L ≥ log2 k를 필요로 하며, 최소 길이의 태깅이 타이트하고 상한을 달성한다.
최소 구별 질의 집합은 매트로드의 기저를 형성하며 구별 차원은 모든 최소 집합의 공통 크기이며 태그-무(태그 없는) 스킴의 W 하한을 좌우한다.
(L,W,D) 프레임워크는 D = 0에서 고유의 Pareto-최적점을 제공하고 명목 태깅이 최적의 조합 효율성을 제공한다; 결과는 Lean 4로 기계 검증되었다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.