QUICK REVIEW

[论文解读] Semantic Identity Compression: Exact Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity

Tristan Simas|arXiv (Cornell University)|Jan 20, 2026

Algorithms and Data Compression被引用 0

一句话总结

该论文将速率-失真理论扩展到三资源框架（标签长度 L、见证成本 W、失真 D）用于零误识别，证明信息障壁，指出名义标记作为Pareto最优解，并揭示查询集合的 matroid 结构，所有证明在 Lean 4 中实现机器检验。

ABSTRACT

Symbolic systems operate over exact identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. We characterize exactly how much additional information must be supplied to recover precise identity from such representations. The answer is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. We prove a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)| ceil$, and the rate-distortion tradeoff with an explicit distortion floor when identity bits are withheld. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4. Keywords: semantics-aware compression, zero-error coding, neurosymbolic systems, learned representations, side information

研究动机与目标

将经典速率-失真理论扩展到具有三种资源的离散分类设定：标签速率 L、识别成本 W 和失真 D。
表征在仅属性观测下何时可实现零误识别（信息障壁），以及标记如何改变这一点。
确定在 (L, W, D) 空间中的 Pareto 最优权衡点及其所引发的权衡空间几何。
揭示最小可区分查询集合的 matroid 结构并定义区分维度。
提供具体实现与 Lean 4 的机器检验形式化。

提出的方法

用属性接口族和可选名义标签访问来定义观测模型。
证明信息障壁：当配置冲突时，仅接口查询无法在等价类内区分。
证明采用 L = ceil(log2 k) 的名义标记在 W = O(1) 时可实现零误识别，查询成本为常数。
建立最小区分查询集合的 matroid 结构并推导固定轴完备性结果。
在信息障壁域中，将 D = 0 的唯一 Pareto 最优点与速率-失真理论联系起来。
提供 Lean 4 的形式化证明（6,000+ 行，265 条定理）。

实验结果

研究问题

RQ1何时能仅通过界面观测实现零误识别，标记如何影响这一能力？
RQ2在信息障壁域中恢复零误识别所需的最小标签长度是什么？
RQ3区分所有类别的最小查询集合结构为何，以及这与 matroid 的关系？
RQ4在提出的 (L,W,D) 框架中，标签长度、查询成本和失真之间的 Pareto 最优权衡是什么？

主要发现

策略	标签 L	见证 W
名义标记（类别 ID）	L = ceil(log2 1000) = 10 位	W = O(1)
鸭子类型（属性仅限）	L = 0	W ≤ 50 次查询
自适应鸭子类型	L = 0	W ≥ d 次查询（d 约 5–15）

当且仅当类别配置文件是单射时，通过属性查询可以实现零误识别；否则存在信息障壁。
使用 L = ceil(log2 k) 的名义标记在常数查询成本 W = O(1) 时可实现零误识别，从而恢复对配置单射性的可辨识性。
在信息障壁域中，任何 D = 0 的方案都需要 L ≥ log2 k；采用最小长度的标记是紧致的并达到界限。
最小区分查询集合构成一个 matroid 的基；区分维度是所有最小集合的共同大小，决定无标记方案的 W 下界。
在 D = 0 时，(L,W,D) 框架产生唯一的 Pareto 最优点，名义标记提供最佳综合效率；结果在 Lean 4 中完成机器检验。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。