Skip to main content
QUICK REVIEW

[論文レビュー] Training slow silicon neurons to control extremely fast robots with spiking reinforcement learning

Irene Ambrosini, Ingo Blakowski|arXiv (Cornell University)|Jan 29, 2026
Advanced Memory and Neural Computing被引用数 0
ひとこと要約

リアルタイムのハードウェア・イン・ザ・ループ neuromorphic 強化学習を、1020 DYNAP-SE ニューロンを用いて高速なエアホッケースイングを制御し、オンライン学習と固定ランダムリザーバ接続性の下で varied 条件において高い成功率(96–98%)を達成する。

ABSTRACT

Air hockey demands split-second decisions at high puck velocities, a challenge we address with a compact network of spiking neurons running on a mixed-signal analog/digital neuromorphic processor. By co-designing hardware and learning algorithms, we train the system to achieve successful puck interactions through reinforcement learning in a remarkably small number of trials. The network leverages fixed random connectivity to capture the task's temporal structure and adopts a local e-prop learning rule in the readout layer to exploit event-driven activity for fast and efficient learning. The result is real-time learning with a setup comprising a computer and the neuromorphic chip in-the-loop, enabling practical training of spiking neural networks for robotic autonomous systems. This work bridges neuroscience-inspired hardware with real-world robotic control, showing that brain-inspired approaches can tackle fast-paced interaction tasks while supporting always-on learning in intelligent machines.

研究の動機と目的

  • Motivate energy-efficient, online learning for autonomous robotics under milliwatt budgets.
  • Show that neuromorphic RL can scale from gaming benchmarks to real-time physical manipulation tasks.
  • Demonstrate robust learning with fixed random reservoirs and local e-prop readout in a 6D continuous-state control scenario.

提案手法

  • Use a DYNAP-SE mixed-signal neuromorphic chip for closed-loop inference at 50 Hz with online learning.
  • Encode 6 state variables into population spike codes processed by 1020 AdEx-LIF neurons in a fixed reservoir.
  • Implement a two-action readout with plastic readout weights updated via an e-prop rule using a global reward signal.
  • Compute actions as a softmax over readout activations at 20 ms post-sense, with the environment receiving action probabilities.
  • Train over 2000 episodes with a scalar reward shaping encouraging forward puck motion and precise timing.
  • Compare performance across encoding-range variations and random reservoir samples to assess robustness and generalization.
Figure 1: Control pipeline and environment. Top-left: High-level flow from MuJoCo (puck $[x_{p},y_{p},v_{x},v_{y}]$ and end-effector $[x_{ee},y_{ee}]$ ) through the decision module to the robot controller. The CPU encodes sensory data into spike trains, processed by DYNAP-SE’ silicon neurons, then d
Figure 1: Control pipeline and environment. Top-left: High-level flow from MuJoCo (puck $[x_{p},y_{p},v_{x},v_{y}]$ and end-effector $[x_{ee},y_{ee}]$ ) through the decision module to the robot controller. The CPU encodes sensory data into spike trains, processed by DYNAP-SE’ silicon neurons, then d

実験結果

リサーチクエスチョン

  • RQ1Can neuromorphic reinforcement learning with a fixed random reservoir achieve robust, fast control in a 6D continuous robotic task?
  • RQ2Is online, on-chip learning via local plasticity (e-prop) sufficient for high-performance control in a real-time hardware-in-the-loop setup?
  • RQ3How do encoding range and reservoir randomness affect convergence speed and final performance in a fast manipulation task?

主な発見

  • 100% success rate within 200 trials for a stationary puck at 1.0 m from the robot frame.
  • 100% success after 1000 episodes for a constant-speed lateral launch.
  • 96–98% success stabilizing after 1300–1500 episodes under speed variability (v in [1.0,1.5] m/s).
  • Encoding-range tests show >97% success for a narrow range [0.7,0.9] m/s in ~150 episodes, ~97% performance for a medium range [0.7,1.2] m/s in ~700 episodes, and ~93% with a broad range [0.7,1.5] m/s (≈4% drop from 97%).
  • Demonstrates that 1020 silicon neurons can enable robust, millisecond-precision interception in a 6D continuous state space with hardware-in-the-loop learning.
Figure 2: Neuromorphic learning masters interception timing and generalizes robustly. (a) Timing acquisition: Pre-training (dashed) shows erratic actions; post-training (solid) achieves immediate, low-variance interceptions, reflecting learned timing. (b) Policy evolution: Stochastic switching betwe
Figure 2: Neuromorphic learning masters interception timing and generalizes robustly. (a) Timing acquisition: Pre-training (dashed) shows erratic actions; post-training (solid) achieves immediate, low-variance interceptions, reflecting learned timing. (b) Policy evolution: Stochastic switching betwe

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。