QUICK REVIEW

[论文解读] A data-based power transformation for compositional data

Michail Tsagris, Simon Preston|arXiv (Cornell University)|Jun 7, 2011

Geochemistry and Geologic Mapping参考文献 17被引用 36

一句话总结

本文提出了一种面向成分数据的数据驱动幂变换框架，该框架将原始数据分析（RDA）和对数比分析（LRA）统一为一种单参数Box-Cox型变换的特例。通过使用轮廓对数似然或分类准确率等准则优化变换参数α，该方法能够根据数据自适应调整单纯形的几何结构，从而获得更优的集中趋势估计和更好的模型拟合效果，如在北极湖泊数据中最优α = 0.362所示。

ABSTRACT

Compositional data analysis is carried out either by neglecting the compositional constraint and applying standard multivariate data analysis, or by transforming the data using the logs of the ratios of the components. In this work we examine a more general transformation which includes both approaches as special cases. It is a power transformation and involves a single parameter, α. The transformation has two equivalent versions. The first is the stay-in-the-simplex version, which is the power transformation as defined by Aitchison in 1986. The second version, which is a linear transformation of the power transformation, is a Box-Cox type transformation. We discuss a parametric way of estimating the value of α, which is maximization of its profile likelihood (assuming multivariate normality of the transformed data) and the equivalence between the two versions is exhibited. Other ways include maximization of the correct classification probability in discriminant analysis and maximization of the pseudo R-squared (as defined by Aitchison in 1986) in linear regression. We examine the relationship between the α-transformation, the raw data approach and the isometric log-ratio transformation. Furthermore, we also define a suitable family of metrics corresponding to the family of α-transformation and consider the corresponding family of Frechet means.

研究动机与目标

为解决成分数据分析中固定几何结构的局限性，提出允许基于数据选择变换参数的方法。
通过幂变换构建一个灵活的统一框架，将RDA（α = 1）和LRA（α → 0）统一为其中的特例。
提供一种实用方法，基于数据特征和分析目标选择最优变换参数α。
证明RDA与LRA的选择应基于数据本身，而非先验假设。

提出的方法

提出一种针对成分数据的单参数幂变换族，定义为当α ≠ 0时，x_i^{(α)} = (x_i^α / sum_j x_j^α)^{1/α}，当α → 0时定义为几何平均。
在单纯形上定义α-距离，即变换空间中的欧氏距离，其在极限情况下分别退化为RDA（α = 1）和LRA（α → 0）。
采用α-距离下的Fréchet均值作为集中趋势度量，其在极限下分别收敛为算术平均（α = 1）和闭合几何平均（α → 0）。
通过轮廓对数似然、交叉验证分类率或回归中的伪R²来优化α，以选择最合适的变换。
将该方法应用于真实和人工数据集，包括北极湖泊数据，比较不同α值下的性能表现。
使用三元图可视化结果，对比不同α值下Fréchet均值的表现。

实验结果

研究问题

RQ1能否构建一个统一框架，使RDA和LRA均成为更一般幂变换的特例？
RQ2最优变换参数α是否在不同成分数据集中有所差异，且能否通过基于数据的准则进行选择？
RQ3α的选择如何影响单纯形的几何结构及由此产生的集中趋势估计？
RQ4是否存在实证证据表明，基于数据选择α的策略在模型拟合或分类性能上优于固定RDA或LRA？

主要发现

对于北极湖泊数据，α的轮廓对数似然在α = 0.362处达到最大值，表明该变换比RDA（α = 1）或LRA（α → 0）具有更优的拟合效果。
在三元图中，α = 0.362下的Fréchet均值相较于算术平均（α = 1）或闭合几何平均（α → 0）提供了更具代表性的中心位置。
基于数据的幂变换框架能够灵活适应底层数据结构的单纯形几何，从而提升可解释性和模型拟合效果。
与固定方法相比，该方法在集中趋势估计和模型拟合方面表现出更优性能，尤其在数据偏离对数正态或线性结构时更为显著。
该框架实现了基于数据的、有原则的变换选择，避免了对成分数据适当几何结构的任意假设。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。