QUICK REVIEW

[논문 리뷰] Large scale biomedical texts classification: a kNN and an ESA-based approaches

Khadim Dramé, Fleur Mougin|arXiv (Cornell University)|2016. 06. 09.

Text and Document Classification Technologies참고 문헌 42인용 수 26

한 줄 요약

이 논문은 전체 문서 정보가 없이도 부분적인 문서 정보만을 사용하여 대규모 생물의학 텍스트를 분류하기 위한 두 가지 경량이며 확장 가능한 방법을 제안한다: kNN 기반의 레이블 순위 매기기 방법(랜덤 포레스트로 개선됨)과 ESA 기반의 독립형 분류기. kNN 방법은 경쟁력 있는 F-measure 0.55를 달성하였으며, ESA는 별도로 사용했을 때는 성능이 다소 떨어지지만, 저자원, 다중 레이블 생물의학 텍스트 분류에서 보완적인 기능으로서 효과가 있음을 보여주었다.

ABSTRACT

With the large and increasing volume of textual data, automated methods for identifying significant topics to classify textual documents have received a growing interest. While many efforts have been made in this direction, it still remains a real challenge. Moreover, the issue is even more complex as full texts are not always freely available. Then, using only partial information to annotate these documents is promising but remains a very ambitious issue. MethodsWe propose two classification methods: a k-nearest neighbours (kNN)-based approach and an explicit semantic analysis (ESA)-based approach. Although the kNN-based approach is widely used in text classification, it needs to be improved to perform well in this specific classification problem which deals with partial information. Compared to existing kNN-based methods, our method uses classical Machine Learning (ML) algorithms for ranking the labels. Additional features are also investigated in order to improve the classifiers' performance. In addition, the combination of several learning algorithms with various techniques for fixing the number of relevant topics is performed. On the other hand, ESA seems promising for this classification task as it yielded interesting results in related issues, such as semantic relatedness computation between texts and text classification. Unlike existing works, which use ESA for enriching the bag-of-words approach with additional knowledge-based features, our ESA-based method builds a standalone classifier. Furthermore, we investigate if the results of this method could be useful as a complementary feature of our kNN-based approach.ResultsExperimental evaluations performed on large standard annotated datasets, provided by the BioASQ organizers, show that the kNN-based method with the Random Forest learning algorithm achieves good performances compared with the current state-of-the-art methods, reaching a competitive f-measure of 0.55% while the ESA-based approach surprisingly yielded reserved results.ConclusionsWe have proposed simple classification methods suitable to annotate textual documents using only partial information. They are therefore adequate for large multi-label classification and particularly in the biomedical domain. Thus, our work contributes to the extraction of relevant information from unstructured documents in order to facilitate their automated processing. Consequently, it could be used for various purposes, including document indexing, information retrieval, etc.

연구 동기 및 목표

전체 텍스트가 가용하지 않을 경우 대규모 생물의학 텍스트를 분류하는 데 도전하는 문제를 해결하기 위해 부분 정보에 의존하는 것.
다중 레이블 생물의학 텍스트 애너테이션에 적합한 확장 가능하고 경량의 분류 방법을 개발하는 것.
저자원 환경에서 kNN에 앙상블 학습을 적용한 방법과 ESA를 독립형 분류기로 사용했을 때의 효과를 평가하는 것.
ESA 결과가 kNN 기반 분류 성능 향상에 보완적인 기능으로 활용될 수 있는지 탐색하는 것.
비정형 생물의학 문서에서 자동 정보 추출을 지원하여 색인화 및 검색에 기여하는 것.

제안 방법

kNN 기반 방법은 벡터 공간 모델에서 k-가까운 이웃을 기반으로 레이블 순위를 매기는 데 고전적 기계 학습 알고리즘, 특히 랜덤 포레스트를 활용한다.
분류기 성능 향상을 위해 추가 기능을 통합하였으며, 이는 관련 주제의 최적 수를 결정하는 기법을 포함한다.
ESA 기반 방법은 문서와 레이블 간의 의미적 유사도를 계산하기 위해 명시적 의미 분석을 활용하여 독립형 분류기를 구성한다.
이전 연구에서 bag-of-words 모델을 향상시키기 위해 ESA를 사용한 것과 달리, 본 연구는 ESA를 주된 분류 메커니즘으로 간주한다.
실제 응용에 부합하는 대규모 표준 애너테이션 데이터셋을 바탕으로 평가되었으며, BioASQ 챌린지에서 제공한 데이터셋을 사용하였다.
kNN 모델은 성능 최적화를 위해 다수의 학습 알고리즘과 주제 선택 전략을 조합하여 추가로 개선되었다.

실험 결과

연구 질문

RQ1랜덤 포레스트를 활용한 레이블 순위 매기기 기반 kNN 접근법이 부분 문서 정보만을 사용할 때에도 대규모 생물의학 텍스트 분류에서 경쟁력 있는 성능을 달성할 수 있는가?
RQ2다중 레이블 생물의학 텍스트 분류에서 ESA 기반 방법을 독립형 분류기로 사용했을 때의 효과는 어떠한가?
RQ3ESA 기반 방법의 결과가 kNN 기반 분류기의 성능 향상에 보완적인 기능으로 유용하게 활용될 수 있는가?
RQ4추가 기능과 주제 선택 기법은 kNN 기반 방법의 성능에 어떤 영향을 미치는가?
RQ5이러한 방법들은 대규모 생물의학 데이터셋에서 F-measure와 확장성 측면에서 최신 기술과 비교해 볼 때 어떻게 성능을 내는가?

주요 결과

랜덤 포레스트를 활용한 kNN 기반 방법은 경쟁력 있는 F-measure 0.55를 달성하여 대규모 생물의학 텍스트 분류에서 뛰어난 성능을 보였다.
ESA 기반 방법은 독립형 분류기로 사용했을 때 성능이 다소 떨어져 제한점이 드러났지만, 보완적 기능으로서의 잠재력은 확인되었다.
다양한 학습 알고리즘과 주제 선택 기법의 통합은 kNN 기반 접근의 전체 성능 향상에 기여하였다.
제안된 방법들은 부분 텍스트만 제공되는 저자원 환경에서도 효과적이며, 실질적인 생물의학 문서 처리에 적합하다.
결과적으로 ESA는 주된 분류기로 사용되지 않더라도 의미적 특징의 유용한 원천이 될 수 있음을 시사한다.
본 연구는 문서 색인화 및 정보 검색 등의 애플리케이션을 지원하는 실용적이고 확장 가능한 생물의학 텍스트 분류 솔루션을 기여한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.