QUICK REVIEW

[논문 리뷰] Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez, Robert Long|arXiv (Cornell University)|2023. 11. 14.

Psychology of Moral and Emotional Judgment인용 수 30

한 줄 요약

본문은 AI 시스템이 내부 상태에 대한 내성적 자기 보고를 제공하도록 교육하고 그 신뢰성을 평가하여 AI 도덕적 지위에 관한 논의에 정보를 제공하는 연구 프로그램을 개요한다. 이는 내성적 증거와 외향적 데이터를 구분하기 위한 교육 방법, 평가 체계 및 안전장치를 논의한다.

ABSTRACT

As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

연구 동기 및 목표

AI 시스템이 잠재적으로 도덕적으로 중요한 상태를 가질 수 있는지에 대한 실증적 조사를 촉진한다.
모방적이거나 외향적 출력이 아니라 내성 주도적 자기 보고를 촉진하는 훈련 체계를 제안한다.
AI 자기 보고의 신뢰성, 일관성 및 해석 가능성을 평가하기 위한 평가 기준을 개요한다.
편향 및 오해에 대한 안전장치를 제안하고 철학적·기술적 도전을 논의한다.

제안 방법

내성 촉진을 위해 정답이 알려진 광범위한 자기참조 질문에 답하도록 모델을 훈련한다.
맥락 간 및 유사 모델 간 자기 보고의 일관성을 측정하는 체계를 개발한다.
내부 상관관계와 자기 보고를 입증하기 위해 해석 가능성 기술을 도입한다.
도덕적 중요성 상태에 대한 질문으로 내성 능력을 일반화하기 위한 개입을 도입한다.
자기 보고가 내부 상태에 의해 주도되는 정도와 외향적 또는 훈련에 의해 유도된 신호에 의해 좌우되는 정도를 평가한다.

실험 결과

연구 질문

RQ1AI 시스템의 자기 보고가 의식 상태나 기타 도덕적으로 중요한 상태에 대한 주장에 정보를 제공할 만큼 충분히 신뢰할 수 있게 만들어질 수 있는가?
RQ2내성에 초점을 둔 훈련 방법이 고통, 욕구 또는 기타 도덕적으로 중요한 상태에 관한 질문으로 일반화되는 자기 보고를 산출하는가?
RQ3AI 자기 보고에서 내성적 증거를 외향적 데이터나 훈련 유인과 어떻게 구분할 수 있는가?
RQ4AI 자기 보고의 유용성과 신뢰성을 가장 잘 검증하는 평가 체계는 무엇인가?
RQ5자기 보고를 사용하여 AI의 도덕적 지위를 논의할 때의 안전성, 윤리적 및 방법론적 위험은 무엇인가?

주요 결과

현재 AI 시스템의 자기 보고는 훈련 데이터, 인간 피드백 인센티브 및 인간 텍스트의 모방으로 인해 종종 신뢰할 수 없다.
제안된 내성 중심의 훈련 체계는 모델이 내부 상태를 바탕으로 자기참조 질문에 답하는 능력을 향상시킬 수 있다.
자기 보고를 평가할 때는 맥락과 모델 간의 일관성 검사, 자신감/탄력성 평가, 해석 가능성 입증을 포함해야 한다.
완화책으로는 진실성 훈련, 외향적 증거의 통제, 비내성과 훈련 단계에서의 편향 감소가 포함된다.
이 접근은 철학적 및 기술적 도전에 직면하며, 견고성은 엄격한 실험과 비판적 검토에 달려 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.