QUICK REVIEW

[論文レビュー] Bonsai -- Diverse and Shallow Trees for Extreme Multi-label Classification

Sujay Khandagale, Han Xiao|arXiv (Cornell University)|Apr 17, 2019

Text and Document Classification Technologies参考文献 42被引用数 111

ひとこと要約

Bonsai は、一般化されたラベル表現と浅く高分岐度の木を用いる一族の XMC 手法を紹介し、迅速な学習と長尾ラベル予測精度を高め、最先端の木法を凌駕し、大規模データセットでは one-vs-rest に匹敵する。

ABSTRACT

Extreme multi-label classification (XMC) refers to supervised multi-label learning involving hundreds of thousand or even millions of labels. In this paper, we develop a suite of algorithms, called Bonsai, which generalizes the notion of label representation in XMC, and partitions the labels in the representation space to learn shallow trees. We show three concrete realizations of this label representation space including : (i) the input space which is spanned by the input features, (ii) the output space spanned by label vectors based on their co-occurrence with other labels, and (iii) the joint space by combining the input and output representations. Furthermore, the constraint-free multi-way partitions learnt iteratively in these spaces lead to shallow trees. By combining the effect of shallow trees and generalized label representation, Bonsai achieves the best of both worlds - fast training which is comparable to state-of-the-art tree-based methods in XMC, and much better prediction accuracy, particularly on tail-labels. On a benchmark Amazon-3M dataset with 3 million labels, \\bonsai outperforms a state-of-the-art one-vs-rest method in terms of prediction accuracy, while being approximately 200 times faster to train. The code for Bonsai is available at \\url{https://github.com/xmc-aalto/bonsai}

研究の動機と目的

パワー法則分布の下で多くの尾部ラベルを持つ効率的な extreme multi-label classification (XMC) を動機づける。
入力空間のラベル表現を超えて拡張する一般化ラベル表現フレームワークを提案する。
誤差伝搬を低減する浅く高分岐因子の木構造を開発する。
多様なラベル表現と浅い木を組み合わせることで、特に尾部ラベルの高速な学習と高い精度を実現できることを示す。

提案手法

各ラベルを3つの空間で表現する：入力空間（アクティブなインスタンスの和）、出力空間（他のラベルとの共発生）、および入力と出力の表現を結合した結合空間。
K-means でノードごとに K クラスタにラベル空間を分割し、K は通常大きく（K ≥ 100）して浅い木と多様な分割を作り出す。
各非リーフノードで K-way のワン・アウト・オブ・レスト線形分類器を訓練して木を通じた予測を振るい分け、リーフノードでは実際のラベルを予測する。
デフォルトの制約を課さない多方向（K-アリ）分割を許可し、多様性を奨励し、Parabel のような深い誤差伝搬を回避する。
予測中にビーム探索を用いて木を横断しリーフノードの分類器を評価し、伝搬誤差を緩和する。

実験結果

リサーチクエスチョン

RQ1一般化ラベル表現は XMC における分割品質と尾部ラベルのカバレッジを改善するか。
RQ2分岐度を高くして浅い木を作ると、二分木より誤差伝搬が抑制され尾部ラベルの精度が改善されるか。
RQ3入力空間・出力空間・結合空間のラベル表現は、それぞれ単独および組み合わせとして Bonsai 内でどう比較されるか。
RQ4ウェブ規模のラベル集合（例：数百万ラベル）に対する Bonsai の実用的な訓練速度とスケーラビリティは、最先端法と比較してどうか。
RQ5尾部ラベル分布と特徴数が異なる多様なデータセットにおいて、Bonsai はどの程度良好に性能を発揮するか。

主な発見

一般化されたラベル表現を用いた Bonsai は複数のデータセットで強力な予測性能と尾部ラベルのカバレッジを達成する。
高階度・浅い木（K ≥ 100）は深い二分木と比べて誤差伝搬を抑制し、尾部ラベルの精度を向上させる。
結合入力–出力表現（Bonsai-io）は、入力のみ（Bonsai-i）や出力のみ（Bonsai-o）を上回ることが多く、平均ラベル数が高い場合に特に顕著。
Amazon-3M データセット（3 百万ラベル）では、Bonsai は最新の one-vs-rest 法より訓練がはるかに速く（約200倍）高い精度に対して競合する。
データセット全体（EURLex-4K, Wikipedia-31K, WikiLSHTC-325K, Wikipedia-500K, Amazon-670K, Amazon-3M）で、Bonsai の派生は一貫して Parabel よりも precision@k および nDCG@k 指標で上回る。 DiSMEC は特定データセットで Bonsai を上回ることもあるが、訓練コストが大幅に高い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。