QUICK REVIEW

[論文レビュー] To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems

Gaole He, Abri Bharos|arXiv (Cornell University)|Sep 22, 2024

Ethics and Social Impacts of AI被引用数 1

ひとこと要約

本研究は、AIシステムのデバッグを訓練干渉として実施することで、ユーザーがAIの信頼を適切に調整できるようになるか、特に分布外の設定下でその効果を検証する。デバッグによって批判的評価と適切な依存が促進されるという仮説に対し、予期せぬ結果として、初期段階でのAIの弱みへの暴露によって依存度が低下した。これは、システムの限界をどのように明らかにするかに注意を要する。

ABSTRACT

Powerful predictive AI systems have demonstrated great potential in augmenting human decision making. Recent empirical work has argued that the vision for optimal human-AI collaboration requires 'appropriate reliance' of humans on AI systems. However, accurately estimating the trustworthiness of AI advice at the instance level is quite challenging, especially in the absence of performance feedback pertaining to the AI system. In practice, the performance disparity of machine learning models on out-of-distribution data makes the dataset-specific performance feedback unreliable in human-AI collaboration. Inspired by existing literature on critical thinking and a critical mindset, we propose the use of debugging an AI system as an intervention to foster appropriate reliance. In this paper, we explore whether a critical evaluation of AI performance within a debugging setting can better calibrate users' assessment of an AI system and lead to more appropriate reliance. Through a quantitative empirical study (N = 234), we found that our proposed debugging intervention does not work as expected in facilitating appropriate reliance. Instead, we observe a decrease in reliance on the AI system after the intervention -- potentially resulting from an early exposure to the AI system's weakness. We explore the dynamics of user confidence and user estimation of AI trustworthiness across groups with different performance levels to help explain how inappropriate reliance patterns occur. Our findings have important implications for designing effective interventions to facilitate appropriate reliance and better human-AI collaboration.

研究の動機と目的

デバッグによってAIのパフォーマンスに対する批判的評価を促進し、適切な依存を促進できるかを検証すること。
デバッグがユーザーのAIパフォーマンスに対するインスタンスレベルおよびグローバルレベルの推定に与える影響を評価すること。
特に不確実性や分布外の状況下で、デバッグがユーザーの依存パターンに与える影響を検討すること。
特にDunning-Kruger効果を含む認知バイアスが、ユーザーのAIに対する信頼と依存に与える影響を明らかにすること。
バランスの取れた、文脈に応じた人間とAIの協働を支援するための設計的示唆を同定すること。

提案手法

234名の参加者を対象に、クラウドソーシングプラットフォーム（Prolific）を用いた定量的実験的研究を実施し、人間-AI意思決定を評価した。
テキスト分類タスクにおいて、ユーザーがAIの予測を検査・是正する必要があるデバッグ干渉を実装した。
対照的設計を採用し、デバッグ干渉ありとなしの2条件を比較することで、依存度とパフォーマンス推定を評価した。
干渉前後におけるユーザーの自己評価、AIとの合意／不一致パターン、パフォーマンス推定値を収集した。
異なるパフォーマンス水準のユーザー群における自己評価のダイナミクスと推定の正確性を分析し、バイアスのパターンを検出する。
タスク設計や参加者の動機付けに起因する歪みの可能性を評価するため、認知バイアスのチェックリストを適用した。

実験結果

リサーチクエスチョン

RQ1RQ1: デバッグ干渉は、インスタンスレベルおよびグローバルレベルにおけるユーザーのAIパフォーマンス推定能力にどのように影響するか？
RQ2RQ2: デバッグ干渉は、ユーザーがAIの助言に依存する傾向にどのように影響するか？
RQ3RQ3: 異なる能力水準のユーザー群において、自己評価のダイナミクスとパフォーマンス推定はどのように変化するか？
RQ4RQ4: パフォーマンスの過信やAIパフォーマンスの低評価といった認知バイアスが、不適切な依存（特に過小依存）にどの程度寄与しているか？

主な発見

デバッグ干渉は、仮説とは逆に、ユーザーのAIシステムに対する依存度を有意に低下させた。
デバッグタスクにさらされた参加者は、初期段階でのAIの誤りへの暴露により、AIに対する否定的認識を形成し、信頼度が低下した可能性がある。
能力が低いユーザーは、AIパフォーマンスを低く推定し、自信を持って正しい意思決定を下す頻度も減らした。これは、誤った判断に起因する過小依存を示している。
AIがユーザーの初期意思決定と異なる結論を示した場合、ユーザーの自己評価は顕著に低下した。これは、不一致が自己確信を損なう要因であることを示唆している。
AIに意思決定の主権を委ねたユーザーは、自らの意思決定を主張したユーザーに比べ、より高い自己評価を示した。これは、一部の文脈ではAIへの過剰依存のリスクがあることを示している。
本研究は、特に自己能力の過剰評価とAI能力の低評価といった認知バイアスが、AIの有益な活用を妨げる不適切な依存パターン（特に不使用）を生じさせることを明らかにした。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。