QUICK REVIEW

[論文レビュー] Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

Zackary Okun Dunivin|arXiv (Cornell University)|Jan 26, 2024

Semantic Web and Ontologies被引用数 8

ひとこと要約

GPT-4 はゼロショットプロンプトを用いたケーススタディにおいて人間と同等の質的コード化を達成し、思考過程の推論（チェーン・オブ・ソート）により信頼性が向上する。一方で GPT-3.5 は大きく劣る。

ABSTRACT

Qualitative coding, or content analysis, extracts meaning from text to discern quantitative patterns across a corpus of texts. Recently, advances in the interpretive abilities of large language models (LLMs) offer potential for automating the coding process (applying category labels to texts), thereby enabling human researchers to concentrate on more creative research aspects, while delegating these interpretive tasks to AI. Our case study comprises a set of socio-historical codes on dense, paragraph-long passages representative of a humanistic study. We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not. Compared to our human-derived gold standard, GPT-4 delivers excellent intercoder reliability (Cohen's $κ\geq 0.79$) for 3 of 9 codes, and substantial reliability ($κ\geq 0.6$) for 8 of 9 codes. In contrast, GPT-3.5 greatly underperforms for all codes ($mean(κ) = 0.34$; $max(κ) = 0.55$). Importantly, we find that coding fidelity improves considerably when the LLM is prompted to give rationale justifying its coding decisions (chain-of-thought reasoning). We present these and other findings along with a set of best practices for adapting traditional codebooks for LLMs. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis. Furthermore, they suggest the next generation of models will likely render AI coding a viable option for a majority of codebooks.

研究の動機と目的

人文学的タスクにおける質的コード化のための人間由来のコードブックをLLM駆動で適用する。
最先端のLLMが複数のコードにわたって人間のインターコーダ信頼性に匹敵できるかを評価する。
チェーン・オブ・ソートを含む prompting 戦略がコード化の忠実度に及ぼす影響を評価する。
LLM支援型コード化ワークフローの設計に関するベストプラクティスを提供する。

提案手法

人文学的タスクにおける nine codes across three categories の人間由来コードブックを使用する。
Du Bois を言及する111のゴールドスタンダード文節（平均約94語）について、GPT-4とGPT-3.5を比較する。
Per Code vs Full Codebook のタスク提示を用いたゼロショットプロンプトを評価する。
推論過程 prompting を組み込み、コード化意思決定における推論の影響を評価する。
プロンプトのバリエーション（推論過程、1コードずつ vs フルコードブック）をテストし、インターコーダ信頼性（コーエンの κ）を報告する。
コードブックをLLMに適応させるための実践的ベストプラクティスを提供する。

Figure 1: The chain-of-thought prompt sequence.

実験結果

リサーチクエスチョン

RQ1GPT-4 と GPT-3.5 は質的コード化タスクにおいて人間のコード者と同等のインターコーダ信頼性を達成できるのか？
RQ2チェーン・オブ・ソート prompting は人文学的風のコードブックに対する LLM のコード化忠実度を改善するのか？
RQ3コードを別々のプロンプト（1つのタスクにつき1つのコード）として提示する方法は、LLMベースのコード化よりフルコードブックプロンプトより効果的か？
RQ4人間のゴールド標準との整合を最大化するコードブックの適応とプロンプト設計は何か？

主な発見

GPT-4 はゼロショットプロンプトでもいくつかのコードで人間に等しい解釈を達成する（例：9コード中3コードで κ が 0.75 を超える）。
GPT-4 は Per Code プロンプトで 8/9 のコードに対して信頼性が高い（κ ≥ 0.6）、さらに 3 コードでは優れた信頼性（κ ≥ 0.79）。
GPT-3.5 はすべてのコードで劣る（平均 κ ≈ 0.34；最大 ≈ 0.55）。
GPT-4 に推論過程を提供させるとコード化の忠実度が向上する（平均 κ、Per Code: 0.68（CoT）対 0.59（非 CoT）; Full Codebook: 0.60（CoT）対 0.46（非 CoT））。
各コードを別々のプロンプトとして提示する方が、フルコードブック方式より信頼性が高い（報告された比較で平均 κ は 0.68 対 0.60）。
推論過程およびコード別 prompting は LLM 指向の質的コード化を改善する実践的ベストプラクティスとして特定されている。

Figure 2: Two examples of prompt redefinition. Colored, alphabetically labeled blocks of text show alterations derived through iterative code refinement. Italics draw attention to direction to constrain interpretive scope to implicit or explicit information.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。