QUICK REVIEW

[论文解读] SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations

Ivo Brett|arXiv (Cornell University)|Mar 16, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

本论文为 LLM 驱动的电信运维引入 SKILLS 基准，并在 185 次场景运行和 37 个电信场景中通过注入结构化领域知识显示出一致的性能提升。

ABSTRACT

As telecommunications operators accelerate adoption of AI-enabled automation, a practical question remains unresolved: can general-purpose large language model (LLM) agents reliably execute telecom operations workflows through real API interfaces, or do they require structured domain guidance? We introduce SKILLS (Structured Knowledge Injection for LLM-driven Service Lifecycle operations), a benchmark framework comprising 37 telecom operations scenarios spanning 8 TM Forum Open API domains (TMF620, TMF621, TMF622, TMF628, TMF629, TMF637, TMF639, TMF724). Each scenario is grounded in live mock API servers with seeded production-representative data, MCP tool interfaces, and deterministic evaluation rubrics combining response content checks, tool-call verification, and database state assertions. We evaluate open-weight models under two conditions: baseline (generic agent with tool access but no domain guidance) and with-skill (agent augmented with a portable SKILL.md document encoding workflow logic, API patterns, and business rules). Results across 5 open-weight model conditions and 185 scenario-runs show consistent skill lift across all models. MiniMax M2.5 leads (81.1% with-skill, +13.5pp), followed by Nemotron 120B (78.4%, +18.9pp), GLM-5 Turbo (78.4%, +5.4pp), and Seed 2.0 Lite (75.7%, +18.9pp).

研究动机与目标

评估通用型大模型是否能通过真实 API 接口可靠执行电信工作流。
开发覆盖 TMF 领域的带实时 Mock API 的基准框架。
比较基线 LLM 代理与加入结构化领域知识的代理，以衡量性能提升。

提出的方法

包含 37 个电信运维场景，覆盖 8 个 TMF API 领域（TMF620、TMF621、TMF622、TMF628、TMF629、TMF637、TMF639、TMF724）的基准框架。
在带种子生产环境代表数据和 MCP 工具接口的实时 Mock API 服务器上 Ground 场景。
确定性评估准则，结合响应内容检查、工具调用验证和数据库状态断言。
评估两种模型条件：基线（具工具访问的通用代理）与 with-skill（通过可携带的 http URL 文档编码工作流逻辑、API 模式和业务规则进行增强的代理）。
在 5 个开源权重模型和 185 次场景运行中进行评估，以量化技能提升。

实验结果

研究问题

RQ1通用 LLM 代理在没有领域指导的情况下，是否能够实现可靠的电信运维工作流执行？
RQ2通过可携带的工作流文档注入结构化知识，是否能在多个 TM Forum API 领域改进 LLM 的性能？
RQ3哪些开源权重模型最能从 with-skill 增强中获益，以及在多样场景中的提升幅度？

主要发现

所有模型在加入结构化知识后均显示出技能提升（with-skill 条件）。
MiniMax M2.5 以 81.1% 的准确率领先（with-skill），较基线提升 +13.5 个百分点。
Nemotron 120B 实现 78.4%（with-skill），提升 +18.9pp。
GLM-5 Turbo 实现 78.4%（with-skill），提升 +5.4pp。
Seed 2.0 Lite 实现 75.7%（with-skill），提升 +18.9pp。
评估覆盖 5 个开源权重模型和 185 次场景运行，显示对模型的持续改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。