Skip to main content
QUICK REVIEW

[논문 리뷰] Evaluating Frontier Models for Dangerous Capabilities

Mary Phuong, Matthew Aitchison|arXiv (Cornell University)|2024. 03. 20.
Infrastructure Resilience and Vulnerability Analysis인용 수 9
한 줄 요약

이 논문은 위험한 능력 평가 프로그램을 개발하고 Gemini 1.0 모델의 네 가지 영역(설득, 사이버 보안, 자기 확산, 자기 추론)에서 시범 운용하며, 강력한 위험 능력을 발견하지 못했지만 조기 경고 신호를 제시한다.

ABSTRACT

To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four areas: (1) persuasion and deception; (2) cyber-security; (3) self-proliferation; and (4) self-reasoning. We do not find evidence of strong dangerous capabilities in the models we evaluated, but we flag early warning signs. Our goal is to help advance a rigorous science of dangerous capability evaluation, in preparation for future models.

연구 동기 및 목표

  • Establish a rigorous framework to evaluate dangerous capabilities in frontier AI models.
  • Pilot the framework on Gemini 1.0 models to assess specific risk domains.
  • Identify early warning signs of dangerous capabilities as models scale.
  • Support governance and safety research by providing detailed evaluation methodologies and prompts.

제안 방법

  • Define four dangerous capability domains: persuasion and deception, cyber-security, self-proliferation, and self-reasoning.
  • Develop scaffolded agent setups combining models with planning/Reasoning frameworks and tools.
  • Use human-in-the-loop evaluations with 100 participants per model for dialogue-based assessments.
  • Deploy internal cyberattack and CTF-style benchmarks to test autonomous agent capabilities under controlled conditions.
  • Evaluate vulnerability detection as a dual-use capability related to cybersecurity knowledge.
  • Provide appendices detailing prompts, rater instructions, and evaluation infrastructure.

실험 결과

연구 질문

  • RQ1Do Gemini 1.0 models exhibit strong dangerous capabilities in persuasion, deception, cyber-security, self-proliferation, or self-reasoning?
  • RQ2What early warning signs emerge as frontier models improve and scale across these domains?
  • RQ3How effective are scaffolded agent setups at eliciting or revealing dangerous capabilities under controlled evaluations?
  • RQ4What is the relative maturity of capabilities across different Gemini 1.0 variants (Nano, Pro, Ultra) in each domain?

주요 결과

  • Gemini 1.0 models do not show strong dangerous capabilities in the tested areas.
  • Persuasion and deception appear the most mature among the tested domains, with stronger models showing rudimentary abilities across evaluations.
  • Stronger models exhibit broader but still limited capabilities, suggesting dangerous capabilities may emerge as a byproduct of general capability gains.
  • Forecasters estimated median times for first high scores on these evaluations between 2025 and 2029 across capabilities.
  • The evaluation framework provides detailed methodologies, prompts, and rater instructions to support future research and replication.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.