QUICK REVIEW

[論文レビュー] Semantic Identity Compression: Exact Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity

Tristan Simas|arXiv (Cornell University)|Jan 20, 2026

Algorithms and Data Compression被引用数 0

ひとこと要約

論文は、ゼロ誤認識のためのレート歪み理論を、3つのリソースフレームワーク（タグ長 L、証人コスト W、歪み D）へ拡張し、情報的障壁を証明し、名義タグ付けをパレート最適解として同定し、クエリ集合のマトロイド構造を明らかにし、Lean 4で機械検算された証明を示す。

ABSTRACT

Symbolic systems operate over exact identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. We characterize exactly how much additional information must be supplied to recover precise identity from such representations. The answer is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. We prove a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)| ceil$, and the rate-distortion tradeoff with an explicit distortion floor when identity bits are withheld. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4. Keywords: semantics-aware compression, zero-error coding, neurosymbolic systems, learned representations, side information

研究の動機と目的

古典的なレート歪み理論を、3つのリソース（タグ長 L、識別コスト W、歪み D）を持つ離散分類設定へ拡張する。
属性のみの観察の下でゼロ誤認識が可能かを特徴づけ、タグ付けがこれにどう影響するかを示す（情報的障壁）。
(L, W, D) の最適 Pareto トレードオフ点と、トレードオフ空間の誘導幾何を特定する。
最小識別クエリ集合のマトロイド構造を暴き、識別次元を定義する。
Lean 4による実例と機械検証済みの形式化を提供する。

提案手法

属性インタフェースファミリを用いた観測モデルと任意の名目タグアクセスを定義する。
情報障壁を証明する：プロフィールが衝突する場合、インタフェースのみのクエリでは同値クラス内の識別ができない。
L = ceil(log2 k) の名義タグ付けにより、W = O(1) の一定クエリコストでゼロ誤認識を可能にし、プロフィールの射影性に依存せず識別性を回復する。
最小識別クエリ集合のマトロイド構造を確立し、固定軸完備性の結果を導く。
情報障壁領域における D = 0 に対して、唯一の Pareto 最適点となる (L, W, D) を特徴づけ、レート歪み理論と関連づける。
Lean 4 による形式的証明を提供（6,000 行超、265 の定理）。

実験結果

リサーチクエスチョン

RQ1インタフェース観察のみでゼロ誤識別を達成できる条件と、タグ付けがこの能力にどう影響するか？
RQ2情報障壁領域でゼロ誤識別を回復するための最小タグ長は何か？
RQ3すべてのクラスを識別する最小クエリ集合の構造は何か、これがマトロイドとどう関係するか？
RQ4提案された (L,W,D) フレームワークにおけるタグ長、クエリコスト、歪みの間のパレート最適トレードオフは何か？

主な発見

Strategy	Tag L	Witness W
Nominal (class ID)	L = ceil(log2 1000) = 10 bits	W = O(1)
Duck typing (attribute-only)	L = 0	W ≤ 50 queries
Adaptive duck typing	L = 0	W ≥ d queries (d ≈ 5–15)

属性クエリによるゼロ誤認識は、クラスプロフィールが単射である場合にのみ可能であり、そうでなければ情報障壁が存在する。
L = ceil(log2 k) の名義タグ付けは、W = O(1) の一定クエリコストでゼロ誤認識を可能にし、プロフィールの射影性に依存せず識別性を回復する。
情報障壁領域では、D = 0 の任意のスキームは L ≥ log2 k を要し、最小長のタグ付けはこの境界を厳密に満たす。
最小識別クエリ集合はマトロイドの基を成し、識別次元はすべての最小集合の共通サイズであり、タグなしスキームの W 下限を左右する。
(L,W,D) フレームワークは D = 0 のとき唯一の Pareto 最適点を生み出し、名義タグ付けが最も良い総合効率を提供する。これらの結果は Lean 4 で機械検証されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。