Skip to main content
QUICK REVIEW

[论文解读] Towards Native Intelligence: 6G-LLM Trained with Reinforcement Learning from NDT Feedback

Zhuoran Xiao, Tao Tao|arXiv (Cornell University)|Jan 15, 2026
Software-Defined Networks and 5G被引用 0
一句话总结

引入 RLDTF,一种使用数字孪生反馈的强化学习框架,用于训练 6G-LLMs 实现面向任务的网络编排,在输出准确性方面表现出色,且单-shot 任务完成率接近 75%。

ABSTRACT

Owing to its comprehensive understanding of upper-layer application requirements and the capabilities of practical communication systems, the 6G-LLM (6G domain large language model) offers a promising pathway toward realizing network native intelligence. Serving as the system orchestrator, the 6G-LLM drives a paradigm shift that fundamentally departs from existing rule-based approaches, which primarily rely on modular, experience-driven optimization. By contrast, the 6G-LLM substantially enhances network flexibility and adaptability. Nevertheless, current efforts to construct 6G-LLMs are constrained by their reliance on large-scale, meticulously curated, human-authored corpora, which are impractical to obtain in real-world scenarios. Moreover, purely offline-trained models lack the capacity for continual self-improvement, limiting their ability to adapt to the highly dynamic requirements of wireless communication environments. To overcome these limitations, we propose a novel training paradigm termed RLDTF (Reinforcement Learning from Digital Twin Feedback) for 6G-LLMs. This framework leverages network digital twins to generate reward signals based on orchestration outcomes, while employing reinforcement learning to guide the model toward optimal decision-making dynamically. Furthermore, we introduce a weighted token mechanism to improve output accuracy. Comprehensive experimental results demonstrate that our proposed framework significantly outperforms state-of-the-art baselines in orchestration accuracy and solution optimality.

研究动机与目标

  • 将领域特定知识注入到 6G-LLM,同时保留通用能力。
  • 通过数字孪生反馈实现编排输出的迭代改进。
  • 开发面向 6G 协调目标的强化学习框架。
  • 在学习过程中通过加权令牌机制提高输出精度。
  • 展示实际性能提升和现场硬件原型。

提出的方法

  • 在领域特定与开放领域语料的混合数据上进行全参数预训练,以注入电信知识。
  • 应用拒绝采样来创建高质量的带 QoS 目标的令牌化任务种子语料。
  • 使用基于 NDT 的 QoS 奖励,通过 Reinforcement Learning from Digital Twin Feedback (RLDTF) 进行训练。
  • 设计一个领域特定的奖励函数,平衡 QoS 满足度与资源使用。
  • 通过扰动引起的奖励敏感性来估计令牌重要性并应用令牌权重。
  • 使用带令牌加权的策略损失、价值损失、熵奖金以及 KL 正则化实现稳定的 RL。

实验结果

研究问题

  • RQ1RLDTF 是否提升 6G-LLMs 在网络编排任务中的任务完成率?
  • RQ2加权令牌机制对输出精度与效率的影响是什么?
  • RQ3与基线的领域注入模型和非 RL 模型相比,RLDTF 在 QoS 目标方面的表现如何?
  • RQ4该方法是否可扩展到具备实际硬件约束的边缘部署?

主要发现

  • RLDTF 在编排任务上实现近 75% 的单 shot 任务完成率。
  • 策略损失快速下降,平均奖励在 RL 训练过程中提升,表明学习有效。
  • 拒绝采样通过使用高质量正样本提高可行性,但 RLDTF 能获得更高的解质量与效率。
  • 与基线相比,RLDTF 提供更高的任务完成度和完成任务的平均分数。
  • 展示了一个现场硬件原型,6G-LLM 能自主配置 AI-native 模块以满足需求。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。