QUICK REVIEW

[論文レビュー] Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Hila Manor, Rinon Gal|arXiv (Cornell University)|Feb 17, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

LoRWeB は LoRA アダプターの基底を学習し、推論時にそれらを動的に構成して視覚的アナロジー編集を柔軟に行い、未 seen の変換に対する最先端の一般化を達成します。

ABSTRACT

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet $\{\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}\}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

研究の動機と目的

デモンストレーションによる複雑な画像編集を達成する手段として視覚的アナロジー学習を動機づける。
LoRA の単一アダプターの制限を克服するため、LoRA の基底で意味空間を跨ぐ。
入力アナロジー対に基づいて LoRA を選択・重み付けする動的推論時機構を開発する。
未 seen なアナロジーのために適切な変換を組み合わせるため、共同学習された基底とルータを訓練する。
多様な視覚的アナロジーセットで一般化と編集忠実度の向上を示す。

提案手法

多様な視覚的変換を跨ぐ N 個の rank-r LoRA の学習可能な基底を導入する。
各 LoRA ペアを学習可能なキー・ベクターと関連付け、入力の三つ組 ${a},{a'},${b}$ からクエリを生成するエンコーダを使用する。
クエリと LoRA キーとの内積のソフトマックスを介して混合係数を計算し、Mixed LoRA を作成する。
条件付き拡散/フロー模型（Flux.1-Kontext）に Mixed LoRA を注入して、新しい画像 ${b}$ に対する ${b'}$ を生成する。
未 seen なアナロジーへ一般化するよう、LoRA 基底とエンコーダを共同訓練する。
conditioning 画像を CLIP でエンコードし、拡張-attention 機構を介して拡張アナロジー三つ組を拡散モデルに提供して詳細な編集を行う。

実験結果

リサーチクエスチョン

RQ1LoRA の基底を学習可能なルータと組み合わせて、訓練中に見られたものを超える未 seen の視覚的アナロジーへ一般化できるか。
RQ2動的な入力依存の LoRA 混合は、単一 LoRA ベースラインと比較して画像の細部の保持と適用変換の正確性の向上につながるか。
RQ3LoRWeB は一般化、編集精度、内容保持の点で既存のアナロジー手法とどう比較されるか。
RQ4基底サイズと LoRA のランクが性能と一般化に与える影響は何か。
RQ5CLIP ベースのエンコードだけで LoRA 選択は十分か、それとも拡張-attention 条件付けを用いた完全なアナロジー三つ組の恩恵を受けるのか。

主な発見

LoRWeB は、単一 LoRA ベースラインおよび従来手法と比較して未 seen のアナロジー課題への一般化を改善する。
学習可能な LoRA 基底と軽量エンコーダは、動的混合を通じて広範な変換を効果的にカバーできる。
定量評価と人間評価の両方で、LoRWeB は多様なタスクで入力内容の保持をより良くしつつ正確な編集を達成する。
より大きな基底 (N) と適切なランク (r) は性能にとって重要であり、基底の多様化なしにランクを単純に増やすと結果が悪化する可能性がある。
拡張注意を用いた完全なアナロジー三つ組を使用することで、編集時の微細なディテールの維持が向上する。
さまざまなエンコーダ（CLIP や SigLIP を含む）に対して堅牢な結果が得られ、エンコーダ設計（a, a', b の分離エンコード）がタスク理解を助ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。