QUICK REVIEW

[論文レビュー] Cross-modal Fuzzy Alignment Network for Text-Aerial Person Retrieval and A Large-scale Benchmark

Yifei Deng, Chenglong Li|arXiv (Cornell University)|Mar 21, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

CFAN を導入する。テキストと空中画像の粗いトークンアライメントと地上ビューを橋渡しする機構を備えたクロスモーダルモデルで、Chain-of-Thought キャプションフレームワークによる大規模 AERI-PEDES ベンチマークを生成。

ABSTRACT

Text-aerial person retrieval aims to identify targets in UAV-captured images from eyewitness descriptions, supporting intelligent transportation and public security applications. Compared to ground-view text--image person retrieval, UAV-captured images often suffer from degraded visual information due to drastic variations in viewing angles and flight altitudes, making semantic alignment with textual descriptions very challenging. To address this issue, we propose a novel Cross-modal Fuzzy Alignment Network, which quantifies the token-level reliability by fuzzy logic to achieve accurate fine-grained alignment and incorporates ground-view images as a bridge agent to further mitigate the gap between aerial images and text descriptions, for text--aerial person retrieval. In particular, we design the Fuzzy Token Alignment module that employs the fuzzy membership function to dynamically model token-level association strength and suppress the influence of unobservable or noisy tokens. It can alleviate the semantic inconsistencies caused by missing visual cues and significantly enhance the robustness of token-level semantic alignment. Moreover, to further mitigate the gap between aerial images and text descriptions, we design a Context-Aware Dynamic Alignment module to incorporate the ground-view agent as a bridge in text--aerial alignment and adaptively combine direct alignment and agent-assisted alignment to improve the robustness. In addition, we construct a large-scale benchmark dataset called AERI-PEDES by using a chain-of-thought to decompose text generation into attribute parsing, initial captioning, and refinement, thus boosting textual accuracy and semantic consistency. Experiments on AERI-PEDES and TBAPR demonstrate the superiority of our method.

研究の動機と目的

テキスト記述と UAV が撮影した空中画像との間の大きな視点変動に起因するクロスモーダルアライメントの課題に対処する。
ファジー論理を用いたトークンレベルの信頼性を活用して、細かなテキスト‑空中アライメントを改善する。
モダリティ間のギャップを低減しアライメントの頑健性を向上させるために、地上ビュー画像を橋渡しエージェントとして使用する。
CoT ベースのキャプション生成フレームワークを用いて、テキスト-空中人検出研究を支援する大規模なクロスビュー・ベンチマーク (AERI-PEDES) を作成する。

提案手法

sample-wise な類似度の差異に基づき、直接のテキスト–空中アライメントと地上ブリッジアライメントを重み付けする Context-Aware Dynamic Alignment (CDA) を提案する。
Fuzzy Token Alignment (FTA) を導入し、共通クエリとガウスベースの属人性を用いてトークンレベルの信頼性を算出し、それらをファジー AND で結合してトークン寄与を重み付けする。
空中画像と地上画像のエンコーダとして共有CLIPベースのエンコーダを使用し、説明文にはCLIPテキストエンコーダを適用する；整列のためにSDMベースの損失を適用する。
Chain-of-Thought キャプション生成フレームワークを用いて、リッチで一貫性のあるキャプションを生成し、テストキャプションは手動で注釈する。

実験結果

リサーチクエスチョン

RQ1テキスト記述と UAV が撮影した空中人物画像との間で、視点の大きな変動や遮蔽物がある状況でのロバストなアライメントをどう実現するか。
RQ2地上ビュー画像を橋渡しとして用いることで、テキスト‑空中のクロスモーダルギャップを効果的に縮小できるか。
RQ3トークンレベルのファジー信頼性は、テキスト‑空中の細粒度アライメントを改善するか。
RQ4Chain-of-Thought ベースのキャプション生成フレームワークが、訓練用キャプションの品質と有用性にどのような影響を与えるか。

主な発見

Method	AERI-PEDES Rank-1	AERI-PEDES Rank-5	AERI-PEDES Rank-10	AERI-PEDES mAP	AERI-PEDES RSum	TBAPR Rank-1	TBAPR Rank-5	TBAPR Rank-10	TBAPR mAP	TBAPR RSum
IRRA [13]	35.14	53.23	63.19	33.42	151.57	39.63	58.72	67.69	35.31	166.04
APTM [44]	34.62	53.95	64.5	31.09	153.07	43.59	62.03	69.75	38.71	175.37
RDE [32]	38.56	58.26	67.89	37.16	164.71	37.31	54.06	60.75	32.17	152.12
CFAM [49]	30.77	51.37	61.61	30.40	143.75	48.34	66.31	73.21	42.67	184.78
NAM [36]	42.47	61.72	69.99	40.22	174.17	46.56	63.13	70.13	40.92	179.82
VFE [34]	35.76	55.35	65.56	35.35	156.67	47.94	62.50	68.17	42.18	178.63
DM-Adpeter [25]	33.42	53.17	62.79	32.41	149.37	37.81	58.34	66.56	33.28	162.71
LPNC [37]	35.65	53.61	63.69	35.19	152.95	41.78	58.03	65.50	37.87	165.31
LPNC+Pretrain [37]	43.79	61.49	70.40	42.22	175.68	45.41	62.31	69.94	42.17	177.66
AEA-FIRM [38]	37.94	56.66	65.71	34.89	160.31	44.75	62.38	69.13	36.28	176.26
AEA-FIRM+Pretrain [14]	44.42	61.96	71.03	41.12	177.41	48.15	63.87	71.21	42.01	183.23
HAM [14]	44.58	63.52	72.67	42.45	180.77	47.81	64.96	72.53	41.86	185.30
Ours (W/O Ground)	45.06	64.53	73.21	43.27	182.80	49.15	65.88	73.47	42.89	188.50
Ours (With Ground)	47.16	65.66	73.83	44.79	186.65	49.47	66.50	73.06	43.96	189.03

CFA N は CDA と FTA を組み合わせることで、AERI-PEDES および TBAPR ベンチマークで最先端の結果を達成した。
CDA に地上ビューのブリッジを導入することで、橋なしのベースラインと比較して RS um に顕著な改善を示した。
FTA は Rank-1、mAP、RSum を改善し、CDA との組み合わせ時に追加的な利得を生む。
橋渡しモダリティの中では地上ブリッジが最良の結果を示すが、空中ブリッジも利益を提供する。
TBAPR では、地上ブリッジを用いる場合に CFAN は Rank-5 が 66.50%、RSum が 189.03% に達し、データセットを跨いだ頑健性を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。