QUICK REVIEW

[논문 리뷰] TaxoBell: Gaussian Box Embeddings for Self-Supervised Taxonomy Expansion

Sahil Mishra, Srinitish Srinivasan|arXiv (Cornell University)|2026. 01. 14.

Machine Learning in Healthcare인용 수 0

한 줄 요약

TaxoBell은 자가지도(self-supervised) taxonomy 확장을 위해 비대칭의 is-a 관계와 보정된 불확실성을 모델링하는 Gaussian box embeddings를 도입하고, 여러 벤치마크에서 베이스라인을 능가합니다.

ABSTRACT

Taxonomies form the backbone of structured knowledge representation across diverse domains, enabling applications such as e-commerce catalogs, semantic search, and biomedical discovery. Yet, manual taxonomy expansion is labor-intensive and cannot keep pace with the emergence of new concepts. Existing automated methods rely on point-based vector embeddings, which model symmetric similarity and thus struggle with the asymmetric "is-a" relationships that are fundamental to taxonomies. Box embeddings offer a promising alternative by enabling containment and disjointness, but they face key issues: (i) unstable gradients at the intersection boundaries, (ii) no notion of semantic uncertainty, and (iii) limited capacity to represent polysemy or ambiguity. We address these shortcomings with TaxoBell, a Gaussian box embedding framework that translates between box geometries and multivariate Gaussian distributions, where means encode semantic location and covariances encode uncertainty. Energy-based optimization yields stable optimization, robust modeling of ambiguous concepts, and interpretable hierarchical reasoning. Extensive experimentation on five benchmark datasets demonstrates that TaxoBell significantly outperforms eight state-of-the-art taxonomy expansion baselines by 19% in MRR and around 25% in Recall@k. We further demonstrate the advantages and pitfalls of TaxoBell with error analysis and ablation studies.

연구 동기 및 목표

빠르게 늘어나는 개념 집합에 맞춰 자동화된 taxonomy 확장을 촉진한다.
비대칭 하이퍼넘니와 불확실성을 포착하는 점 임베딩의 한계를 다룬다.
포함 및 중첩을 위해 시맨틱 위치와 보정된 불확실성을 결합하는 Gaussian box embeddings를 제안한다.
대칭적 중첩과 비대칭적 포함을 함께 최적화하는 energy-based 학습 목표를 개발한다.
최신 상태의 baselines 대비 향상을 보여주기 위해 벤치마크 데이터셋 전반에서 TaxoBell을 평가한다.

제안 방법

사전 학습된 인코더를 통해 표면 이름 및 정의를 축 정렬 상자(axis-aligned boxes)로 매핑하여 각 개념을 Gaussian box로 표현한다.
상자를 평균이 상자 중심이고 대각 공분산이 상자 오프셋에서 얻어진 다변수 Gaussian 분포로 변환한다.
시드 taxonomy로부터의 자가지도 신호를 사용하고, 로컬 이웃에서 샘플링된 하드 네거티브를 사용한다.
두 에너지를 최적화한다: 의미적 유사성을 위한 대칭적 중첩(Bhattacharyya coefficient)과 계층 방향성을 위한 비대칭적 포함(KL divergence).
붕괴를 방지하고 공분산이 잘 조건화되도록 부피를 정규화한다.
추론 시 학습된 에너지를 기준으로 후보 부모를 순위 매기고, 선택된 신뢰도 수준에서 Gaussian를 다시 박스(box)로 변환한다.

Figure 1. Overview of taxonomy expansion and the contribution of our TaxoBell model.

실험 결과

연구 질문

RQ1Gaussian box embeddings가 어떻게 비대칭 하이퍼넘니와 불확실성을 포착하여 taxonomy 확장을 수행할 수 있는가?
RQ2시드 taxonomy를 이용한 자가지도 학습이 효과적인 부모–자식 관계를 학습하는 데 충분한가?
RQ3적절한 앵커 아래에서 쿼리 개념의 배치를 대칭적 및 비대칭적 에너지 항이 함께 개선하는가?
RQ4다양한 도메인에 걸쳐 TaxoBell이 최첨단 taxonomy 확장 기반 대비 어떻게 비교되는가?
RQ5다의성(polysemy)과 애매함에 대한 강건성에 공분산(불확실성) 모델링이 미치는 영향은 무엇인가?

주요 결과

TaxoBell은 다섯 개의 실제 세계 taxonomy 벤치마크에서 일관되게 여덟 개의 baselines를 능가한다.
모델은 Mean Rank(MRR) 및 Recall@k 지표에서 향상을 보이며, 올바른 부모의 배치 및 검색이 개선되었음을 반영한다.
대칭적 중첩과 비대칭적 포함 에너지를 조합하면 안정적인 최적화와 향상된 계층적 추론이 제공된다.
적용 연구는 성능 향상을 위해 투영 설계와 energy-based 최적화의 중요성을 보여준다.
오류 분석과 사례 연구는 보이지 않는 엔티티에 대한 Gaussian-box 표현의 해석가능성과 유연성을 보여준다.

Figure 2. Overview of TaxoBell . Entities are encoded with $f_{\eta}(.)$ , mapped to axis-aligned boxes using $f_{\psi}(.)$ , and then projected to Gaussian embeddings. Training optimizes two energies on the Gaussians – a symmetric overlap term (Bhattacharyya Coefficient) and an asymmetric containme

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.