QUICK REVIEW

[论文解读] Complex-valued embeddings of generic proximity data

Maximilian Münch, Michiel Straat|arXiv (Cornell University)|Aug 31, 2020

Text and Document Classification Technologies参考文献 21被引用 3

一句话总结

本文提出了一种复值嵌入方法，将非度量或非半正定（非-psd）的接近度数据转换为固定长度的复向量，从而有效利用标准机器学习算法。通过利用低秩逼近和基于范数的校正，该方法保留了原始数据的信息，并在基准数据集上的分类准确率优于使用未经校正的核矩阵的传统方法。

ABSTRACT

Proximities are at the heart of almost all machine learning methods. If the input data are given as numerical vectors of equal lengths, euclidean distance, or a Hilbertian inner product is frequently used in modeling algorithms. In a more generic view, objects are compared by a (symmetric) similarity or dissimilarity measure, which may not obey particular mathematical properties. This renders many machine learning methods invalid, leading to convergence problems and the loss of guarantees, like generalization bounds. In many cases, the preferred dissimilarity measure is not metric, like the earth mover distance, or the similarity measure may not be a simple inner product in a Hilbert space but in its generalization a Krein space. If the input data are non-vectorial, like text sequences, proximity-based learning is used or ngram embedding techniques can be applied. Standard embeddings lead to the desired fixed-length vector encoding, but are costly and have substantial limitations in preserving the original data's full information. As an information preserving alternative, we propose a complex-valued vector embedding of proximity data. This allows suitable machine learning algorithms to use these fixed-length, complex-valued vectors for further processing. The complex-valued data can serve as an input to complex-valued machine learning algorithms. In particular, we address supervised learning and use extensions of prototype-based learning. The proposed approach is evaluated on a variety of standard benchmarks and shows strong performance compared to traditional techniques in processing non-metric or non-psd proximity data.

研究动机与目标

解决标准机器学习算法在应用于非度量或非-psd接近度数据时的局限性，此类数据常导致收敛问题并丧失泛化保证。
开发一种信息保持的嵌入技术，将通用接近度矩阵转换为适用于下游学习的固定长度复值向量。
使适用于固有不定相似性或相异性的数据的成熟、高效的机器学习算法——特别是复值原型——得以应用。
提供一种计算高效的、支持样本外扩展的方法，克服传统不定核学习中的关键局限。

提出的方法

该方法基于地标采样对原始接近度矩阵进行低秩逼近，地标数量根据数据集大小设定为40、70或100。
通过一种保持原始接近度数据谱结构并确保数值稳定性的变换，构建复值嵌入矩阵。
对嵌入矩阵应用基于范数的校正，以强制实现半正定（psd）结构，从而使其适用于基于psd的机器学习模型。
将嵌入的复值向量用作复值学习算法的输入，包括广义学习向量量化（cGLVQ）和矩阵学习向量量化（cGMLVQ）。
在cGMLVQ中引入相关性学习，以自适应地加权复空间中的特征，提升模型性能。
该方法通过显式嵌入自然支持样本外扩展，而许多传统核方法则不具备此特性。

实验结果

研究问题

RQ1复值嵌入能否有效表示非-psd接近度数据，同时保留机器学习所需的关键信息？
RQ2所提出的嵌入方法是否能在不定接近度数据上实现高于标准方法的分类准确率？
RQ3在复值GLVQ中引入相关性学习后，性能相比标准cGLVQ有何提升？
RQ4对嵌入矩阵实施基于范数的校正在多大程度上提升了模型的稳定性和泛化能力？
RQ5该嵌入方法是否能支持高效的样本外扩展，这是传统不定核学习中的关键挑战？

主要发现

复值广义学习向量量化器（cGLVQ）在未经校正的不定数据上显著优于最近邻分类器，尤其在Balls3d等具有挑战性的数据集上（准确率分别为0.61和0.48）。
引入相关性学习的cGMLVQ变体在多个数据集上优于cGLVQ和最近邻分类器，包括Protein（准确率0.98 vs. 0.22）和Zongker（准确率0.92 vs. 0.58）。
在Chromosomes数据集上，最近邻分类器表现最佳（准确率0.95），但这是由于其特征值谱较为有利，负值大多微小或接近零，表明性能依赖于谱特性。
即使不使用相关性学习，cGLVQ在大多数情况下仍优于未经校正数据的最近邻分类器，表明嵌入校正步骤对实现可靠性能至关重要。
该方法在多种基准数据集上表现优异，包括文本序列（DelftGestures）、生物序列（Protein）和合成数据（Balls），展现出广泛适用性。
低秩嵌入有效近似了原始核矩阵，重建误差较低，成功保留了原始接近度数据中的主要信息。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。