QUICK REVIEW

[论文解读] Sampling Through the Lens of Sequential Decision Making

Jason Dou, Alvin Qingkai Pan|arXiv (Cornell University)|Jan 1, 2022

Data Stream Mining Techniques被引用 7

一句话总结

本文提出自适应采样奖励（ASR），一种基于强化学习的框架，将表示学习中的采样建模为序列决策问题，受认知科学中系统1（基于启发式）和系统2（深思熟虑）思维的启发。ASR 使用策略梯度方法动态优化样本选择以最大化累积奖励，在三个基准数据集上的信息检索和聚类任务中实现了最先进性能，且在策略初始化时观察到显著的“ASR引力井”现象。

ABSTRACT

Sampling is ubiquitous in machine learning methodologies. Due to the growth of large datasets and model complexity, we want to learn and adapt the sampling process while training a representation. Towards achieving this grand goal, a variety of sampling techniques have been proposed. However, most of them either use a fixed sampling scheme or adjust the sampling scheme based on simple heuristics. They cannot choose the best sample for model training in different stages. Inspired by "Think, Fast and Slow" (System 1 and System 2) in cognitive science, we propose a reward-guided sampling strategy called Adaptive Sample with Reward (ASR) to tackle this challenge. To the best of our knowledge, this is the first work utilizing reinforcement learning (RL) to address the sampling problem in representation learning. Our approach optimally adjusts the sampling process to achieve optimal performance. We explore geographical relationships among samples by distance-based sampling to maximize overall cumulative reward. We apply ASR to the long-standing sampling problems in similarity-based loss functions. Empirical results in information retrieval and clustering demonstrate ASR's superb performance across different datasets. We also discuss an engrossing phenomenon which we name as "ASR gravity well" in experiments.

研究动机与目标

解决表示学习中固定或基于启发式的采样策略在训练阶段无法自适应的问题。
使用强化学习将采样过程建模为序列决策问题，模拟认知科学中的系统2思维。
开发一种基于奖励的框架，动态选择样本以最大化表示学习中的长期性能。
通过实证验证所提出的ASR框架在多样化下游任务中优于现有采样基线的优越性。

提出的方法

将表示学习中的采样过程形式化为马尔可夫决策过程（MDP），其中智能体根据状态表示选择样本。
基于评估指标——Recall@K、NMI和F1定义奖励函数，以指导策略学习。
采用PPO和REINFORCE的策略梯度方法优化采样策略，使用神经网络参数化策略。
使用基于距离的采样方法建模样本之间的地理关系，提升所选批次的多样性与信息量。
将ASR框架应用于对比表示学习中的三元组损失和边缘损失函数。
提出一种新颖的策略网络初始化策略，以缓解训练过程中观察到的“ASR引力井”现象。

实验结果

研究问题

RQ1强化学习能否有效应用于表示学习中的自适应采样，超越基于启发式的策略？
RQ2策略初始化的选择如何影响ASR框架的收敛性和性能？
RQ3训练时长对ASR框架性能有何影响，过拟合在何时发生？
RQ4ASR框架是否能在不同数据集和表示学习任务（如信息检索和聚类）中实现泛化？
RQ5“ASR引力井”现象的成因是什么，如何通过初始化或优化技术加以缓解？

主要发现

在CUB200-2011数据集上，ASR使用PPO在三元组损失设置下优于所有基线，达到60.63% Recall@1和0.6629 NMI。
在CARS196数据集上，ASR达到71.50% Recall@1和0.5993 NMI，优于半硬负样本和基于距离的采样方法。
在SOP数据集上，ASR达到94.47% Recall@10和0.8914 NMI，展现出在多样化数据分布下的强大泛化能力。
当使用“normal high”初始化时，观察到“ASR引力井”现象，性能在第15个周期左右急剧下降，源于次优策略收敛。
ASR的最佳训练时长为30至50个周期；超过此范围后，性能因过拟合而下降。
使用“normal low”或“uniform low”分布进行初始化可最小化引力井效应，因其降低方差并避免极端策略更新。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。