QUICK REVIEW

[論文レビュー] Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas

Christoffer Koo Øhrstrøm, Rafael I. Cabral Muchacho|arXiv (Cornell University)|Feb 1, 2026

Advanced Image and Video Retrieval Techniques被引用数 0

ひとこと要約

要約: 本論文は Parabolic Position Encoding (PaPE) と PaPE-RI を導入し、視覚トランスフォーマーの放物線ベースのアテンションバイアスを提示。8データセット・4モダリティで高い性能を示し、ImageNet-1Kでの外挿性能が卓越する。

ABSTRACT

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as images, point clouds, videos, or event camera streams-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. We evaluate PaPE on 8 datasets that span 4 modalities. We find that either PaPE or PaPE-RI achieves the top performance on 7 out of 8 datasets. Extrapolation experiments on ImageNet-1K show that PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5% over the next-best position encoding. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

研究の動機と目的

前提研究から導かれる原理（平行移動不変性、回転不変性、距離減衰、方向性、文脈認識）に導かれた視覚特化の位置エンコーディングを開発する。
相対トークン位置を放物線の和で encoding する PaPE を設計し、クエリ/キー変換と相性の良い効率的なアテンションを実現する。
PaPE および PaPE-RI を画像、点群、動画、イベントカメラなどの複数の視覚モダリティと大規模データセットで評価する。
PaPE の一般化性と訓練解像度を超える強力な外挿能力を示す。

提案手法

相対位置 Delta r_ij を学習可能な射影 W_p により定義する。
トークン表現から放物線係数 a_i および b_i を W_a, W_b で計算し、凹を保つために a_i を負に制約する。
アテンションのログイット S_ij を放物線項の和と意味論的項の和として定義する：S_ij = sum_l (a_i,ℓ * Δr_ij,ℓ^2) + (b_i,ℓ * Δr_ij,ℓ) + (q_i · k_j)/m。
回転不変版として b_i = 0 を設定し、W_p と a_i を制約することで PaPE-RI を提供する。
PaPE 方程式（Equation 9）を保ちつつ、位置をクエリ/キーに埋め込む変換 f_q と f_k を導入して効率的なアテンションカーネルと互換性を確保する。
視覚モダリティ4つ（画像、点群、時空間データ、マルチモーダル）を横断する8データセットで広範な評価を行い、アブレーション、外挿テスト、効率性分析を実施する。

Figure 2 : Overview of Parabolic Position Encoding (PaPE). PaPE decomposes attention (a) into distance (b), direction (c), and semantics (d). Using the dog’s eye as the query, PaPE learns to look in a bottom-right direction, while decaying attention with distance. The attention (a) is compatible wit

実験結果

リサーチクエスチョン

RQ1 principled な位置エンコーディングを設計して、視覚モダリティに対する平行移動不変性、適切な回転不変性、距離減衰、方向性、文脈認識をどのように捉えることができるのか？
RQ2放物線ベースのエンコーディング（PaPE）は、既存 encodings と比較して多様な視覚タスクとモダリティにおける一般化・外挿性能を優位にするのか？
RQ3PaPE を効率的なアテンションカーネルと互換性を持たせつつ性能を落とさず実装できるのか？
RQ4PaPE および PaPE-RI は大規模データセットとマルチモーダル設定でどのような影響を与えるのか？

主な発見

PaPE または PaPE-RI は 4 つの視覚モダリティにわたり評価データセットのうち 8 件中 7 件でトップの性能を達成。
PaPE は 8 データセットの平均スコア 66.3 を達成し、RoPE より平均で 1 ポイント上回る。
PaPE は強い外挿性を示し、解像度が最大 512^2 までの ImageNet-1K 精度を最大で 1% 改善し、それ以上の解像度にも頑健。
アブレーションにより距離減衰、方向性、文脈認識、W_p のすべての成分が精度向上に寄与し、いずれかを削除すると性能が低下。
PaPE はクエリ/キー変換を介して効率的なアテンションカーネルとの互換性を維持し、パラメータと実行時オーバーヘッドは控えめ。
nuScenes（マルチモーダル）では PaPE-PREMIER 変種がトップまたはほぼトップの性能を達成し、マルチモーダル設定での回転不変性の利点を示す。

Figure 3 : Model analysis on ImageNet-1K. Red ( $z>0$ ) highlights heads that lean heavily on positional information, while blue ( $z<0$ ) marks heads that prioritize semantic content in deciding what to attend to. Positions are used most strongly in early layers.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。