QUICK REVIEW

[論文レビュー] An Improved k-Nearest Neighbor Algorithm for Text Categorization

Baoli Li, Shiwen Yu|ArXiv.org|Jun 16, 2003

Text and Document Classification Technologies参考文献 2被引用数 86

ひとこと要約

本稿では、訓練データのクラス頻度に基づいて各クラスごとに最近傍数を動的に調整するカテゴリ適応型k-NNアルゴリズムを提案する。この手法により、大きなクラスに偏るバイアスが軽減され、特に交差検証が不適切な状況下でも、小規模クラスの分類性能が向上し、kの選択に対する感受性が低下する。中国語テキストを用いた実験により、その有効性が確認された。

ABSTRACT

k is the most important parameter in a text categorization system based on k-Nearest Neighbor algorithm (kNN).In the classification process, k nearest documents to the test one in the training set are determined firstly. Then, the predication can be made according to the category distribution among these k nearest neighbors. Generally speaking, the class distribution in the training set is uneven. Some classes may have more samples than others. Therefore, the system performance is very sensitive to the choice of the parameter k. And it is very likely that a fixed k value will result in a bias on large categories. To deal with these problems, we propose an improved kNN algorithm, which uses different numbers of nearest neighbors for different categories, rather than a fixed number across all categories. More samples (nearest neighbors) will be used for deciding whether a test document should be classified to a category, which has more samples in the training set. Preliminary experiments on Chinese text categorization show that our method is less sensitive to the parameter k than the traditional one, and it can properly classify documents belonging to smaller classes with a large k. The method is promising for some cases, where estimating the parameter k via cross-validation is not allowed.

研究の動機と目的

訓練データにおけるクラス分布の不均衡によって生じる従来のk-NNテキスト分類のバイアスを是正すること。
固定kの選択に対する感受性を低減するため、クラス別に異なるk値を許容する仕組みを提供すること。
kの選択に交差検証を用いないまま、マイノリティ（小規模）クラスの分類精度を向上させること。
訓練データ統計に基づき、頻度の高いクラスには多くの近傍を、まれなクラスには少ない近傍を割り当てる方法を構築すること。

提案手法

各カテゴリに対して、そのカテゴリに属する訓練サンプル数に応じて異なるk値を割り当てる。
各テストドキュメントについて、クラス別に固有のk値を用いて、それぞれのクラスごとに最近傍を個別に選択する。
最終的な分類は、クラス別に選択された最近傍の多数決によって決定する。
各クラスで使用する近傍数は、訓練セットにおけるそのクラスのサイズに比例し、より大きなクラスに多くの近傍を割り当てる。
クラス頻度の関数として計算される動的k値を用いることで、固定kを回避する。
分類の際、各クラスに対して独立して適用されるため、より大きなクラスが意思決定に大きく寄与する。

実験結果

リサーチクエスチョン

RQ1クラス別に動的kを選択することで、不均衡なテキスト分類において分類性能にどのような影響を与えるか？
RQ2カテゴリ適応型k-NNアプローチは、テキスト分類において大きなクラスに偏るバイアスを低減できるか？
RQ3従来の固定kのk-NNと比較して、提案手法はkの選択に対する感受性をどの程度低減できるか？
RQ4kチューニングに交差検証を必要としない状況でも、本手法は小規模クラスのドキュメントを効果的に分類できるか？

主な発見

提案手法は、従来の固定kのk-NNアルゴリズムと比較して、kの選択に対する感受性が顕著に低減された。
特に大きなk値を用いた場合に、小規模クラスの分類精度が向上した。
大きなクラスにおいても高い性能を維持しながら、マイノリティクラスの検出性能が向上した。
k選択のための交差検証が不可能な状況でも、本手法は有効であることが示された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。