QUICK REVIEW

[논문 리뷰] SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations

Ivo Brett|arXiv (Cornell University)|2026. 03. 16.

Artificial Intelligence in Healthcare and Education인용 수 0

한 줄 요약

이 논문은 LLM 주도 통신 운영을 위한 SKILLS 벤치마크를 도입하고, 185개의 시나리오 실행과 37개의 통신 시나리오 전반에 걸쳐 구조화된 도메인 지식을 주입했을 때 일관된 성능 향상을 보여준다.

ABSTRACT

As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).

연구 동기 및 목표

일반 목적 LLM이 실제 API 인터페이스를 통해 통신 워크플로를 신뢰성 있게 실행할 수 있는지 평가한다.
TMF 도메인을 다루는 라이브 모의 API를 포함하는 벤치마크 프레임워크를 개발한다.
기본 LLM 에이전트와 구조화된 도메인 지식으로 보강된 에이전트를 비교하여 성능 향상을 측정한다.

제안 방법

TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724 등 8 TMF API 도메인을 아우르는 37개 통신 운영 시나리오의 벤치마크 프레임워크.
생산 대 대표 데이터를 시드된 라이브 모의 API 서버와 MCP 도구 인터페이스를 갖춘 시나리오를 구동한다.
응답 내용 검사, 도구 호출 검증 및 데이터베이스 상태 단정을 결합한 결정론적 평가 규칙.
두 가지 모델 조건을 평가한다: 기본값(도구 접근이 가능한 일반 에이전트)과 with-skill(워크플로 로직, API 패턴, 비즈니스 규칙을 휴대 가능한 http URL 문서 인코딩으로 보강된 에이전트).
공개 가중치(Open-weight) 모델 5종과 185개의 시나리오 실행을 통해 스킬 향상을 정량화한다.

실험 결과

연구 질문

RQ1도메인 지침 없이 도구 접근만으로 일반 LLM 에이전트가 신뢰할 수 있는 통신 운영 워크플로를 실행할 수 있는가?
RQ2휴대 가능한 워크플로 문서를 통한 구조화된 지식 주입이 여러 TM Forum API 도메인에서 LLM 성능을 향상시킬 수 있는가?
RQ3가용 가중치가 열린(open-weight) 모델 중 with-skill 보강으로 가장 큰 혜택을 받는 모델은 무엇이며, 다양한 시나리오에서 얼마나 개선되는가?

주요 결과

모든 모델은 구조화된 지식으로 보강될 때 스킬 향상을 보인다(with-skill 조건).
MiniMax M2.5는 81.1% 정확도(with-skill)로 선두이며 기본 대비 +13.5pp.
Nemotron 120B는 78.4%(with-skill) 및 +18.9pp.
GLM-5 Turbo는 78.4%(with-skill) 및 +5.4pp.
Seed 2.0 Lite는 75.7%(with-skill) 및 +18.9pp.
평가 범위는 5개의 open-weight 모델과 185개의 시나리오 실행으로, 모델 전반에 걸쳐 일관된 향상을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.