QUICK REVIEW

[Paper Review] Tuning computer vision models with task rewards

André Susano Pinto, А. И. Колесников|arXiv (Cornell University)|Feb 16, 2023

Multimodal Machine Learning Applications13 citations

TL;DR

The paper demonstrates that tuning pretrained computer vision models with reinforcement learning rewards improves alignment with specific task usage across object detection, panoptic segmentation, colorization, and image captioning.

ABSTRACT

Misalignment between model predictions and intended usage can be detrimental for the deployment of computer vision models. The issue is exacerbated when the task involves complex structured outputs, as it becomes harder to design procedures which address this misalignment. In natural language processing, this is often addressed using reinforcement learning techniques that align models with a task reward. We adopt this approach and show its surprising effectiveness across multiple computer vision tasks, such as object detection, panoptic segmentation, colorization and image captioning. We believe this approach has the potential to be widely useful for better aligning models with a diverse range of computer vision tasks.

Motivation & Objective

Address misalignment between model predictions and intended usage in complex vision tasks.
Leverage reinforcement learning rewards to directly optimize task-related performance.
Show that a simple two-step pipeline (MLE pretraining followed by reward tuning) works across diverse CV tasks.
Demonstrate improvements without requiring task-specific architectural changes.
Highlight potential for incorporating more complex rewards (e.g., human feedback) in vision models.

Proposed method

Pretrain a model with maximum-likelihood estimation (MLE) to capture data distribution (MLE model).
Fine-tune the MLE model by maximizing a task-related reward using the Reinforce algorithm (log-derivative trick).
Use a baseline to reduce gradient variance by sampling two outputs per input (reward(sample) - reward(baseline)).
Represent outputs as sequences (e.g., bounding boxes, color channels, captions) and optimize non-differentiable rewards.
Apply task-specific rewards such as Panoptic Quality (PQ), average recall, mean average precision (mAP), and CIDEr, along with custom rewards like colorfulness.
Maintain a two-stage process: (1) MLE pretraining, (2) reward-based tuning, leveraging pretrained sampling strategy.

Experimental results

Research questions

RQ1Can reward-based tuning via Reinforce improve alignment with task risk for diverse vision tasks without changing model architecture?
RQ2How do reward-based gains compare to traditional task-specific training tricks and post-processing methods?
RQ3Do simple, metric-based rewards suffice to improve complex outputs such as boxes, segments, colors, and captions?

Key findings

Panoptic segmentation: reward tuning improves Panoptic Quality (PQ) from 43.1 to 46.1 on COCO validation (with 512x512 input).
Object detection: reward-based tuning increases mAP from 39.2 to 54.3 and AR@100 from 54.4 to 67.2; recall-focused tuning reaches 68.4.
Colorization: reward tuning yields more vivid colors and greater hue diversity, with colorfulness and hue-entropy rewards substantially increasing.
Image captioning: CIDEr scores improve from 120.0 to 134.5 (ViT-B) and 121.7 to 138.7 (ViT-L) on COCO test splits.
Across tasks, reward optimization demonstrates improved alignment with the intended usage over standard MLE training.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.