QUICK REVIEW

[논문 리뷰] Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

Yang Li, Si Si|arXiv (Cornell University)|2021. 06. 05.

Advanced Image and Video Retrieval Techniques참고 문헌 44인용 수 30

한 줄 요약

학습 가능한 Fourier 특징 기반 위치 인코딩을 다차원 공간 데이터에 도입하고, 이를 MLP와 통합하여 Transformer 기반 모델에 대해 이미지 및 UI 유사 구조 전반에서 관성적, 확장 가능하며 거리 보존적인 위치 표현을 가능하게 합니다. 다수의 비전 및 UI 작업에서 수렴 속도 및 정확도 개선을 보여줍니다.

ABSTRACT

Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where $L_2$ distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

연구 동기 및 목표

Transformer 기반 모델을 위한 다차원 공간 도메인(예: 이미지, UI 레이아웃)에서 효과적이고 확장 가능한 위치 인코딩의 필요성을 제시한다.
유클리드 유사 거리와 복잡한 공간 관계를 포착하는 학습 가능한 Fourier 특징 기반 위치 인코딩을 제안한다.
제안된 인코딩이 귀납적이며 매개변수 효율적이고 보이지 않는 위치와 고차원에 확장 가능함을 보인다.
이미지 생성, 객체 탐지, 이미지 분류, UI 위젯 캡션 작성 등에서 기존 PE 방법들보다 더 높은 정확도와 빠른 수렴을 보인다.]
method([
Represent multi-dimensional positions x in R^M via learnable Fourier features r_x with r_x = (1/sqrt{D}) [ cos(x W_r^T) || sin(x W_r^T) ], where W_r is trainable and initialized from N(0, gamma^{-2}).
Compute a shift-invariant dot product r_x · r_y that approximates a Gaussian kernel over positions (k(x,y) ≈ exp(-||x-y||^2 / gamma^2)).
Pass the Fourier features through a multi-layer perceptron (phi) and a linear projection to produce the final positional embedding PE_x = phi(r_x, theta) W_p.
Handle multi-dimensional positions holistically by grouping coordinates and applying the same encoding pipeline to each group, then concatenating results.
The approach is inductive (handles unseen positions) and parameter-efficient (does not scale with sequence length).
Integrate the encoding with Transformer-based models by adding the generated PE_X to content embeddings in downstream attention computations.]
research_questions([
Can learnable Fourier feature-based positional encodings capture meaningful multi-dimensional spatial relationships (e.g., 2D Euclidean distances) better than fixed sinusoidal or discrete embeddings?
Do the proposed encodings improve convergence speed and accuracy for Transformer-based tasks involving spatial data (images, object detection, UI layouts) and generalize to unseen positions/sizes?
Is the combination Learnable-Fourier Features + MLP more effective than using Fourier features or MLP alone across diverse tasks?
How does the multi-group (partitioned) encoding strategy affect performance in high-dimensional spatial settings like UI widget bounding boxes?]
key_findings:[
The Learnable-Fourier + MLP encoding consistently outperformed baseline positional encodings across image generation, object detection, image classification, and widget captioning benchmarks.
The combination of learnable Fourier features with an MLP yields faster convergence and higher accuracy than using either component alone.
For unseen image sizes and positions, the Learnable-Fourier + MLP method generalizes better than discrete embedding or sinusoidal approaches, reducing performance gaps on out-of-distribution positions.
Partitioning multi-dimensional positions into groups and encoding each group with shared Fourier features can model more complex spatial relationships than simple L2 distance, benefiting tasks like UI widget captioning.
In image classification with Vision Transformer, Learnable-Fourier + MLP achieved higher top-1 accuracy than Embed-1D, demonstrating practical gains on standard benchmarks.]
table_headers:[]
table_rows:[]} }}}}# The output is the translated structured review with all numbers, equations, and proper nouns preserved, and only the natural-language text translated into Korean. The table fields remain empty as in the input. The JSON structure mirrors the original fields. Note: The title field is kept empty to preserve the original format; you can fill it if needed. If you want any adjustments to tone or terminology, I can revise.
table_headers translation would be an empty list as in the original
table_rows translation would be an empty list as in the original

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.

[논문 리뷰] Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

연구 동기 및 목표

관련 연구

더 나은 연구,지금 바로 시작하세요