QUICK REVIEW

[논문 리뷰] Evaluating Frontier Models for Dangerous Capabilities

Mary Phuong, Matthew Aitchison|arXiv (Cornell University)|2024. 03. 20.

Infrastructure Resilience and Vulnerability Analysis인용 수 9

한 줄 요약

이 논문은 위험한 능력 평가 프로그램을 개발하고 Gemini 1.0 모델의 네 가지 영역(설득, 사이버 보안, 자기 확산, 자기 추론)에서 시범 운용하며, 강력한 위험 능력을 발견하지 못했지만 조기 경고 신호를 제시한다.

ABSTRACT

To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.

연구 동기 및 목표

Establish a rigorous framework to evaluate dangerous capabilities in frontier AI models.
Pilot the framework on Gemini 1.0 models to assess specific risk domains.
Identify early warning signs of dangerous capabilities as models scale.
Support governance and safety research by providing detailed evaluation methodologies and prompts.

제안 방법

Define four dangerous capability domains: persuasion and deception, cyber-security, self-proliferation, and self-reasoning.
Develop scaffolded agent setups combining models with planning/Reasoning frameworks and tools.
Use human-in-the-loop evaluations with 100 participants per model for dialogue-based assessments.
Deploy internal cyberattack and CTF-style benchmarks to test autonomous agent capabilities under controlled conditions.
Evaluate vulnerability detection as a dual-use capability related to cybersecurity knowledge.
Provide appendices detailing prompts, rater instructions, and evaluation infrastructure.

실험 결과

연구 질문

RQ1Do Gemini 1.0 models exhibit strong dangerous capabilities in persuasion, deception, cyber-security, self-proliferation, or self-reasoning?
RQ2What early warning signs emerge as frontier models improve and scale across these domains?
RQ3How effective are scaffolded agent setups at eliciting or revealing dangerous capabilities under controlled evaluations?
RQ4What is the relative maturity of capabilities across different Gemini 1.0 variants (Nano, Pro, Ultra) in each domain?

주요 결과

Gemini 1.0 models do not show strong dangerous capabilities in the tested areas.
Persuasion and deception appear the most mature among the tested domains, with stronger models showing rudimentary abilities across evaluations.
Stronger models exhibit broader but still limited capabilities, suggesting dangerous capabilities may emerge as a byproduct of general capability gains.
Forecasters estimated median times for first high scores on these evaluations between 2025 and 2029 across capabilities.
The evaluation framework provides detailed methodologies, prompts, and rater instructions to support future research and replication.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.