QUICK REVIEW

[논문 리뷰] A Rapid Review of Clustering Algorithms

Hui Yin, Amir Aryani|arXiv (Cornell University)|2024. 01. 14.

Advanced Clustering Algorithms Research인용 수 10

한 줄 요약

이 논문은 clustering 알고리즘을 조사하고, 다섯 차원으로 분류하며, 평가 지표를 논의하고, 클러스터링 연구의 경향과 남은 도전 과제를 강조한다.

ABSTRACT

Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data organization and analysis, and social media. Numerous clustering algorithms exist, with ongoing developments introducing new ones. Each algorithm possesses its own set of strengths and weaknesses, and as of now, there is no universally applicable algorithm for all tasks. In this work, we analyzed existing clustering algorithms and classify mainstream algorithms across five different dimensions: underlying principles and characteristics, data point assignment to clusters, dataset capacity, predefined cluster numbers and application area. This classification facilitates researchers in understanding clustering algorithms from various perspectives and helps them identify algorithms suitable for solving specific tasks. Finally, we discussed the current trends and potential future directions in clustering algorithms. We also identified and discussed open challenges and unresolved issues in the field.

연구 동기 및 목표

기존 클러스터링 알고리즘과 그 기본 원리를 요약한다.
원리, 데이터 포인트 할당, 데이터셋 용량, 미리 정의된 클러스터 수, 및 응용 분야에 따라 알고리즘을 분류한다.
다양한 데이터 조건에서 평가 지표와 그 적용 가능성에 대해 논의한다.
클러스터링 연구의 현재 경향, 남아 있는 도전 과제 및 향후 방향을 강조한다.

제안 방법

Google Scholar, arXiv 및 Scopus에 걸친 클러스터링 관련 키워드를 사용하여 문헌 조사를 수행한다.
언어(영어), 최근성(최근 5년), 새롭고 혁신적인 클러스터링 기법에 초점을 맞춰 논문을 선별한다.
기본 원리, 데이터 포인트 할당, 데이터셋 용량, 미리 정의된 클러스터 수 및 응용 영역을 분석하여 알고리즘을 분류한다.
내적 및 외적 평가 지표와 그 한계점을 검토한다.
클러스터링의 경향, 응용 및 남아 있는 도전 과제, 딥 러닝과의 통합 및 하이브리드 방법 포함에 대해 논의한다.

Figure 1: Structure of the clustering algorithm classification, covering five dimensions.

실험 결과

연구 질문

RQ1클러스터링 알고리즘을 분류하는 데 사용되는 주요 기본 원리와 특성은 무엇인가?
RQ2데이터 포인트 할당, 데이터셋 용량 및 미리 정의된 클러스터 수의 필요성 측면에서 클러스터링 알고리즘은 어떻게 차이가 있는가?
RQ3클러스터링에 사용되는 평가 지표는 무엇이며 비지도 환경에서의 한계는 무엇인가?
RQ4클러스터링 연구의 현재 경향, 응용 및 남아 있는 과제는 무엇인가?
RQ5딥 러닝 통합이나 하이브리드 접근 방식과 같은 향후 방향이 클러스터링 방법에 어떤 영향을 미칠 수 있는가?

주요 결과

클러스터링 알고리즘은 다섯 가지 원리 기반 계열로 분류된다: partition-based, hierarchical, density-based, grid-based, 및 model-based.
알고리즘은 데이터 포인트를 클러스터에 할당하는 방식(하드 vs 소프트)과 클러스터 수에 대한 요구사항에서 차이가 있다.
데이터셋 용량은 방법 선택에 영향을 주며, 소형, 중형, 대형 데이터셋에 대한 서로 다른 지원과 이에 따른 확장성 고려가 있다.
내적 및 외적 평가 지표는 클러스터링 품질 평가에 사용되며, 각각 고유의 장점과 한계가 있다.
특정 도메인에 클러스터링을 적용하고, 딥 러닝을 도입하며 하이브리드 접근 방식을 개발하는 경향이 있으며, 최적의 클러스터 수를 결정하는 데 지속적인 과제가 있다.

Figure 2: Example of Partition-Based Clustering Algorithm: the left side represents the original data, and the right side shows the resulting clusters after applying the K-Means clustering algorithm. Each data sample is classified into only one cluster.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.