QUICK REVIEW

[论文解读] Patterns of i.i.d. Sequences and Their Entropy

Gil I. Shamir|arXiv (Cornell University)|May 10, 2006

Cellular Automata and Applications被引用 3

一句话总结

本文推导出独立同分布序列中模式微熵的紧致上下界，表明当字母表较大时，模式微熵相较于独立同分布微熵显著降低——通常超过通用编码冗余界。这些界依赖于源微熵、字母表大小及概率分布，当字母表大小超过模式长度时，对密集字母表引入了修正项。

ABSTRACT

Bounds on the entropy of patterns of sequences generated by independently identically distributed (i.i.d.) sources are derived. A pattern is a sequence of indices that contains all consecutive integer indices in increasing order of first occurrence. If the alphabet of a source that generated a sequence is unknown, the inevitable cost of coding the unknown alphabet symbols can be exploited to create the pattern of the sequence. This pattern can in turn be compressed by itself. The bounds derived here are functions of the i.i.d. source entropy, alphabet size, and letter probabilities. It is shown that for large alphabets, the pattern entropy must decrease from the i.i.d. one. The decrease is in many cases more significant than the universal coding redundancy bounds derived in prior works. The pattern entropy is confined between two bounds that depend on the arrangement of the letter probabilities in the probability space. For very large alphabets whose size may be greater than the coded pattern length, all low probability letters are packed into one symbol. The pattern entropy is upper and lower bounded in terms of the i.i.d. entropy of the new packed alphabet. Correction terms are provided for both upper and lower bounds. The bounds are used to approximate the pattern entropy for various specific distributions, with focus on uniform and monotonic ones. Tight bounds are obtained on the pattern entropy even for distributions that have infinite i.i.d. entropy rates.

研究动机与目标

推导独立同分布源生成的模式微熵的紧致界，尤其在字母表未知时。
量化由于模式压缩导致的微熵减少，尤其与通用编码冗余界进行比较。
分析字母概率分布的排列如何影响模式微熵界。
为均匀分布和单调分布提供模式微熵的近似。
即使在独立同分布微熵率为无穷的分布下，也建立微熵界。

提出的方法

将模式定义为按递增顺序对应于每个不同符号首次出现位置的索引序列。
将此类模式的微熵建模为独立同分布源微熵、字母表大小及单个字母概率的函数。
针对大字母表引入一种打包技术，将低概率符号合并为单一符号以简化分析。
利用打包后字母表的微熵，推导模式微熵的上下界，并引入显式修正项。
将界应用于特定分布（包括均匀和单调分布），以评估其紧致性与准确性。
使用信息论不等式，将模式微熵与原始独立同分布源微熵关联，尤其在大字母表的渐近情形下。

实验结果

研究问题

RQ1从独立同分布序列导出的模式微熵与原始源微熵相比如何，尤其在大字母表下？
RQ2在多大程度上可基于独立同分布源微熵和字母概率分布来界定模式微熵？
RQ3当低概率符号被合并为单一符号时，修正项如何提升模式微熵界的准确性？
RQ4即使在独立同分布微熵率为无穷的分布下，能否建立模式微熵的紧致界？
RQ5所推导的界与现有通用编码冗余界相比，其数量级如何？

主要发现

对于大字母表，模式微熵严格小于独立同分布源微熵，且减少量通常超过已知的通用编码冗余界。
模式微熵界对概率空间中字母概率的排列极为敏感，而不仅取决于微熵或字母表大小。
当字母表大小超过模式长度时，将低概率符号打包为单一符号可获得有效近似，并伴随可量化的修正项。
即使在独立同分布微熵率为无穷的分布下，模式微熵的上下界仍保持紧致，展现出鲁棒性。
对于均匀分布和单调分布，界能对真实模式微熵提供精确近似，且修正项显著提升了精度。
所推导的界比以往的通用编码冗余界更紧致，尤其在大字母表情形下。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。