QUICK REVIEW

[논문 리뷰] OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

Keane Ong, Sabri Boughorbel|arXiv (Cornell University)|2026. 02. 11.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

HARPO는 이질적인 행동 과제 전반에 걸쳐 학습의 균형을 맞추어 OmniSapiens-7B 2.0을 훈련시키며 10개의 사회적 행동 과제에서 최첨단 다중작업 성능을 달성하고 제로샷 일반화도 강하게 보인다.

ABSTRACT

To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.

연구 동기 및 목표

작업 특정 구획을 넘어 상호 연결된 인간 행동 차원의 통합 모델링을 촉진한다.
이질적인 행동 데이터에서 학습 신호의 불균형을 해결한다.
다중 작업 업데이트의 균형을 맞추기 위해 학습 기여도를 조절하는 강화 학습 방법을 개발한다.

제안 방법

HARPO를 도입한다: LLM용 RL에서 작업과 샘플 전역의 학습 균형을 맞추는 온정책 이점 변조 메커니즘.
샘플 및 작업 수준에서 이점의 크기로 기여 신호를 계산한다(식 6–7).
HARPO 처리 이점에 구조화되고 기하학적으로 중심이 되는 변조 계수(식 8–10)를 적용한다.
훈련 안정화를 위한 관성적으로 매끄러운 변조 계수와 기여 신호(식 11).
기저 모델 Qwen2.5-Omni-7B를 사용한 HARPO로 Human Behavior Atlas의 10개 행동 과제에 대해 OmniSapiens-7B 2.0을 훈련한다.
다중 작업 및 보류 설정에서 GRPO, RLOO, RE++, GPG 및 다른 기준 모델과의 비교로 평가한다.

Figure 1: Sample count versus token reasoning length. Green indicates correct predictions, red indicates incorrect. HARPO induces more varied reasoning lengths for the respective tasks of (Top: HUM, Bottom: SAR), compared to GRPO.

실험 결과

연구 질문

RQ1HARPO가 이질적인 행동 과제들 간 학습의 균형을 맞춰 다중 작업 성능을 향상시킬 수 있는가?
RQ2이전 추론 RL 방법과 비교해 HARPO가 보류된 데이터셋이나 제로샷 행동 데이터에 대한 일반화를 개선하는가?
RQ3샘플 수준 대 작업 수준 변조와 관성 제어가 훈련 안정성과 성능에 미치는 영향은 무엇인가?

주요 결과

모델	EMO	HUM	INT	PTSD	ANX	DEP	SEN	SAR	SOC	NVC	평균 순위
Gemma-3-4B	55.03	59.70	22.70	49.90	60.10	46.25	73.83	52.90	19.10	2.30	5.90
Qwen 2.5-Omni-7B	58.25	54.30	25.40	76.00	79.30	71.35	67.20	65.60	25.40	6.90	4.20
Qwen 2.5-VL-7B	54.08	58.30	24.90	75.50	63.10	63.80	50.50	51.10	23.10	9.80	5.60
Qwen 3-VL-8B-Instruct	57.66	66.76	38.00	92.70	42.29	51.62	69.70	63.67	24.94	13.95	4.00
OmniSapiens-7B RL	57.28	63.90	48.60	96.80	91.90	77.15	39.60	64.70	30.40	13.30	3.00
HumanOmniV2-7B	59.70	63.80	26.30	82.40	52.70	65.40	74.20	39.50	28.20	9.30	4.00
OmniSapiens-7B 2.0 (ours)	76.55	69.85	50.52	98.39	91.98	78.87	77.61	70.64	25.40	14.54	1.20
RLOO (baseline)	75.58	67.86	51.73	98.39	90.68	77.57	76.86	62.58	29.54	16.28	2.50
RE++ (baseline)	75.92	60.26	5.01	98.39	93.11	73.87	56.52	50.21	12.64	4.07	3.90
GPG (baseline)	77.70	69.28	54.21	98.39	90.40	78.40	75.77	45.96	27.93	12.79	2.50
GRPO (baseline)	76.45	27.56	49.90	98.39	90.40	77.64	77.51	53.58	23.30	11.00	3.30
HARPO (ours)	76.55	69.85	50.52	98.39	91.98	78.87	77.61	70.64	25.40	14.54	1.70

OmniSapiens-7B 2.0은 10개 행동 과제 전반에서 가장 강력한 성능을 달성하며, 기준선 대비 다중 작업에서 최대 +16.85%의 이득 및 보류 데이터에서 +9.37%의 이득을 기록한다.
HARPO로 학습된 모델은 GRPO와 같은 비평가 기반 방법보다 작업 간 성능이 더 일관적으로 나타나며, 균형 잡힌 다중 작업 학습을 시사한다.
HARPO의 AUT 및 SER에서의 제로샷 평가가 일반화를 개선함을 보여주며, 예: SER 제로샷 72.11% (대 HumanOmniV2-7B의 62.74%).
HARPO의 추론 트레이스는 더 명시적이고 다양하며 HUM 및 SAR와 같은 실용적 과제에서 GRPO와 비교해 더 풍부한 정당화를 제공한다.
변형의 구조화된 모듈레이션과 관성 제어가 성능과 안정성에 중요함을 확인하는 소거 연구.

Figure 2: Example of reasoning traces on pragmatic tasks humour detection (HUM) (left) and sarcasm detection (SAR) (right). HARPO is observed to reflect more explicit and varied reasoning compared to GRPO, which defaults to minimal/ no reasoning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.