QUICK REVIEW

[논문 리뷰] ToolWeaver: Weaving Collaborative Semantics for Scalable Tool Use in Large Language Models

Bowen Fang, Wen Ye|arXiv (Cornell University)|2026. 01. 29.

Machine Learning in Materials Science인용 수 0

한 줄 요약

ToolWeaver는 협업 인지 토큰화를 통해 학습된 계층적 구성 도구 코드를 도입하여 LLM에서 확장 가능하고 일반화 가능한 도구 사용을 가능하게 하며 ToolBench에서 최첨단 성능을 능가합니다.

ABSTRACT

Prevalent retrieval-based tool-use pipelines struggle with a dual semantic challenge: their retrievers often employ encoders that fail to capture complex semantics, while the Large Language Model (LLM) itself lacks intrinsic tool knowledge from its natural language pretraining. Generative methods offer a powerful alternative by unifying selection and execution, tasking the LLM to directly learn and generate tool identifiers. However, the common practice of mapping each tool to a unique new token introduces substantial limitations: it creates a scalability and generalization crisis, as the vocabulary size explodes and each tool is assigned a semantically isolated token. This approach also creates a semantic bottleneck that hinders the learning of collaborative tool relationships, as the model must infer them from sparse co-occurrences of monolithic tool IDs within a vast library. To address these limitations, we propose ToolWeaver, a novel generative tool learning framework that encodes tools into hierarchical sequences. This approach makes vocabulary expansion logarithmic to the number of tools. Crucially, it enables the model to learn collaborative patterns from the dense co-occurrence of shared codes, rather than the sparse co-occurrence of monolithic tool IDs. We generate these structured codes through a novel tokenization process designed to weave together a tool's intrinsic semantics with its extrinsic co-usage patterns. These structured codes are then integrated into the LLM through a generative alignment stage, where the model is fine-tuned to produce the hierarchical code sequences. Evaluation results with nearly 47,000 tools show that ToolWeaver significantly outperforms state-of-the-art methods, establishing a more scalable, generalizable, and semantically-aware foundation for advanced tool-augmented agents.

연구 동기 및 목표

도구 카탈로그가 폭발적으로 증가하는 가운데 LLM에서 확장 가능한 도구 사용을 촉진한다.
하나의 토큰당 도구 구조를 대체하기 위한 구성적이고 계층적 도구 표현을 제안한다.
구조화된 토큰화를 통해 도구 의미와 협업 관계를 학습한다.
다단계 생성 정렬을 통해 구조화된 도구 코드를 LLM에 통합한다.
큰 도구 벤치마크에서 언어 능력을 보존하면서 우수한 검색 및 엔드-투-엔드 성능을 입증한다.

제안 방법

각 도구를 L 코드북의 L 코드 시퀀스로 표현하여 어휘 증가를 로그 규모로 가능하게 한다( K^L 도구에 대해 L*K개의 새로운 토큰).
협업 인지 잔차 양자화(RQ-VAE)를 사용하여 도구의 의미적 설명을 도구-도구 유사도 행렬에 의해 안내되는 계층 코드로 매핑한다.
공동 발생에 기반한 유사 도구가 인근 코드가 되도록 그래프 라플라시안 정규화를 도입한다.
최종 코드북 수준에서 충돌을 피하기 위한 균일 매핑 제약을 적용하고 Sinkhorn-Knopp로 해결된 최적 수송 형식을 사용한다.
두 단계로 LLM을 미세 조정한다: 검색 정렬(질의에서 도구 코드 시퀀스를 예측)과 궤적 정렬(도구 호출, 매개변수 및 응답 학습).
추론 중 유효한 코드 시퀀스의 접두사 트라이를 사용한 제약된 빔 검색으로 유효한 도구 식별자를 보장한다.

실험 결과

연구 질문

RQ1도구 표현을 하나의 토큰당 도구를 넘어서 어떻게 확장할 수 있을까?
RQ2도구 간 협업 사용 패턴을 도구 표현에 통합하여 일반화와 추론을 향상시킬 수 있을까?
RQ3구조적이고 계층적인 코드 표현이 이전 방법에 비해 엔드-투-엔드 작업 성능과 도구 조정에 어떤 개선을 가져오는가?
RQ4협업 정규화가 모델 성능과 언어 능력에 어떤 영향을 미치는가?

주요 결과

모델	I1 NDCG@1	I1 NDCG@3	I1 NDCG@5	I2 NDCG@1	I2 NDCG@3	I2 NDCG@5	I3 NDCG@1	I3 NDCG@3	I3 NDCG@5
BM25*	22.77	22.64	25.61	18.29	20.74	22.18	10.00	10.08	12.33
EmbSim*	54.00	50.82	55.86	40.84	36.67	39.55	18.00	17.77	20.70
ToolRetriever*	72.31	70.30	74.99	64.54	57.91	63.61	52.00	39.89	42.92
ToolGen*	87.67	88.84	91.54	83.46	86.24	88.84	79.00	79.80	84.79
ToolWeaver	91.16	91.14	93.48	89.76	89.70	91.80	88.00	85.80	90.12

ToolWeaver는 단순/복잡 시나리오에서 검색 NDCG@k를 더 높게 달성하며, 가장 어려운 I3 설정에서 NDCG@1이 88.00에 도달한다.
엔드-투-엔드 평가에서 ToolWeaver는 보지 않은 도구 및 범주를 포함한 여러 설정에서 SoPR/SoWR 최상위를 달성한다.
비교 제거(Ablation)에서 의미 초기화가 가장 중요한 단계이며 협업 가이드가 특히 복잡한 작업에서 추가 이점을 제공한다.
협업 정규화 가중치 최적은 λ=1 부근; 너무 큰 λ는 도구의 고유 의미를 해친다.
ToolWeaver는 ToolGen보다 일반 언어 능력을 훨씬 잘 보존하며, 당황도(perplexity)가 낮고 요약 품질이 안정적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.