QUICK REVIEW

[論文レビュー] Automating Thematic Analysis: How LLMs Analyse Controversial Topics

Awais Hameed Khan, Hiruni Kegalle|arXiv (Cornell University)|May 11, 2024

Artificial Intelligence in Law被引用数 8

ひとこと要約

本研究は Robodebt のメディア報道に対する LLM 助手付き主題分析を試験的に実施し、GPT-4、Llama 2、Claude 3 と人間のコーダーを比較して、整合性、バイアス、そして人間イン・ザ・ループ手法の可能性を探る。

ABSTRACT

Large Language Models (LLMs) are promising analytical tools. They can augment human epistemic, cognitive and reasoning abilities, and support 'sensemaking', making sense of a complex environment or subject by analysing large volumes of data with a sensitivity to context and nuance absent in earlier text processing systems. This paper presents a pilot experiment that explores how LLMs can support thematic analysis of controversial topics. We compare how human researchers and two LLMs GPT-4 and Llama 2 categorise excerpts from media coverage of the controversial Australian Robodebt scandal. Our findings highlight intriguing overlaps and variances in thematic categorisation between human and machine agents, and suggest where LLMs can be effective in supporting forms of discourse and thematic analysis. We argue LLMs should be used to augment, and not replace human interpretation, and we add further methodological insights and reflections to existing research on the application of automation to qualitative research methods. We also introduce a novel card-based design toolkit, for both researchers and practitioners to further interrogate LLMs as analytical tools.

研究の動機と目的

LLMs が論争的なテーマの主題分析を人間の解釈を置き換えることなく支援できるかを調査する。
統制された TA 課題で GPT-4、Llama 2、そして人間研究者が特定したテーマを比較する。
プロンプト設計とモデル更新が人間のコーディングとの整合性にどう影響するかを評価する。
定性的分析における LLM の活用に関する方法論的考察と設計ツールキットを提供する。

提案手法

Robodebt に関する論説から小規模な主題コーパスを抽出する（トランスクリプトと被害者の発言）。
Braun and Clarke の指針に従い、ChatGPT を用いて 11 のテーマコードブックを作成する。
各文におけるテーマの有無について、人間のコーダーと 2 つの LLM（GPT-4 と Llama 2）を比較する。
修正したプロンプトとモデル（GPT-4、Claude 3、Llama 3）で反復を行い、結果の変化を評価する。
LLM のスコアを二値のテーマ割り当てに変換するために、閾値設定（50% および 70%）のアプローチを使用する。
エージェント間の相関と定性的観察を報告する。

実験結果

リサーチクエスチョン

RQ1LLMs が人間の解釈を置き換えることなく、テキストデータに意味のあるテーマを生成し適用できるか。
RQ2論争的な材料に対するテーマ割り当てで、GPT-4、Llama 2、そして人間のコーダーはどのように収束または分岐するか。
RQ3プロンプト設計とモデル更新が主題分析における人間のコーディングとの整合性に影響を与えるか。
RQ4LLM を用いる人間-in-the-loop TA における方法論的考慮事項は何か。

主な発見

LLMs はテキスト内容にテーマを生成・適用する能力を示し、人間の研究者に反省的再評価を促す。
GPT-4 と Llama 2 は人間のコーダーと中程度の正の相関を示す；GPT-4 は全体的により多くのテーマを割り当てる傾向がある。
LLM のスコア閾値を引き上げる（例: 50% から 70% へ）と、人間のコーダーとの合意が高まり、テーマ割り当ての頻度が減少する。
改良されたプロンプトを備えた 2024 年モデル（GPT-4、Claude 3）は、テーマレベルで人間のコーディングとの整合性が高まる。しかし、文ごとの割り当ては異なる。
懐疑的で簡潔なプロンプトと、反復的なスコア修正を組み合わせることで、LLM の出力を人間らしいコーディングへと導くことができる。
事前のトピック知識とモデルバイアスが定性的解釈に影響する可能性を念頭に置く。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。