[论文解读] Robust Beam Codebooks for mmWave/THz Systems: Toward a Stochastic RL Approach
论文表明多代理强化学习在波束编码本设计中,特别是使用 Soft Actor-Critic (SAC),在硬件损伤和反馈噪声下的 mmWave/THz MIMO 下仍能实现鲁棒波束成形,优于确定性 RL 方法。
Millimeter-wave (mmWave) and terahertz (THz) massive MIMO systems often rely on predefined beamforming codebooks, which are usually suboptimal in Non-Line-of-Sight (NLoS) conditions and for hardware-limited transceivers. Reinforcement Learning (RL) enables adaptive, data-driven codebook design without explicit Channel State Information (CSI), but the robustness of such algorithms in practical conditions is underexplored. This paper introduces a robust multi-agent RL framework that learns beam codebooks directly from environmental feedback, eliminating the need for prior channel knowledge. Our method is well-suited for real-world deployments facing unpredictable propagation and hardware constraints. We conduct a comprehensive analysis of three off-policy algorithms, Deep Deterministic Policy Gradient (DDPG), Twin Delayed DDPG (TD3), and Soft Actor-Critic (SAC), evaluating their resilience to hardware impairments and feedback noise. Simulations show that SAC consistently outperforms deterministic methods, achieving superior beamforming gains and stability in NLoS scenarios, even under severe impairments. These results demonstrate the promise of RL-based codebook design for robust mmWave/THz massive MIMO systems.
研究动机与目标
- Motivate robust beam codebook design for mmWave/THz MIMO without explicit CSI.
- Propose a multi-agent RL framework to learn beam patterns from environmental feedback.
- Evaluate robustness of RL algorithms under hardware impairments and feedback noise.
- Provide a benchmarking methodology to stress-test RL-based codebooks in realistic conditions.
提出的方法
- Model beam codebook design as a multi-agent MDP with analog beamformers and discrete phase shifters.
- Compare three off-policy algorithms: DDPG, TD3 (deterministic policies) and SAC (stochastic policy).
- Use a KNN quantization to map continuous actions to hardware-feasible phases.
- Introduce a ternary reward to handle noisy feedback and improve exploration.
- Cluster users via sensing beams and assign clusters to agents using a Hungarian algorithm for optimal initial matching.
- Assess robustness under phase-mismatch impairments and AWGN feedback noise using DeepMIMO-based datasets.

实验结果
研究问题
- RQ1Can stochastic policy learning (SAC) provide more robust beam codebooks than deterministic RL methods (DDPG/TD3) under hardware impairments?
- RQ2How does RL-based beam codebook learning perform with noisy feedback in LoS and NLoS mmWave/THz scenarios?
- RQ3What is the impact of codebook size on beamforming gain and stability in the presence of hardware imperfections?
- RQ4Does a multi-agent decomposition improve scalability and resilience in learning beam patterns for large antenna arrays?
主要发现
- SAC consistently yields higher beamforming gains than DDPG/TD3 across LoS and NLoS scenarios and various codebook sizes.
- Under hardware impairments, SAC remains most robust, maintaining higher gains as phase-mismatch variance increases.
- Feedback noise degrades all methods, but SAC shows slower degradation and better resilience, maintaining a larger fraction of the noise-free performance at up to 40% noise in NLoS.
- Adaptive exploration via SAC’s entropy parameter enables robust performance by avoiding premature convergence to suboptimal policies.
- A multi-agent clustering and assignment scheme (Hungarian algorithm) enhances initial performance and scalability.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。