QUICK REVIEW

[논문 리뷰] Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks

Zackary Okun Dunivin|arXiv (Cornell University)|2024. 01. 26.

Semantic Web and Ontologies인용 수 8

한 줄 요약

GPT-4는 제로샷 프롬프트를 사용한 사례 연구에서 인간과 동등한 질적 코딩을 달성하며, 사고의 흐름(chain-of-thought) 추론이 신뢰성을 높인다; GPT-3.5는 대부분 성능이 떨어진다.

ABSTRACT

Qualitative coding, or content analysis, extracts meaning from text to discern quantitative patterns across a corpus of texts. Recently, advances in the interpretive abilities of large language models (LLMs) offer potential for automating the coding process (applying category labels to texts), thereby enabling human researchers to concentrate on more creative research aspects, while delegating these interpretive tasks to AI. Our case study comprises a set of socio-historical codes on dense, paragraph-long passages representative of a humanistic study. We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not. Compared to our human-derived gold standard, GPT-4 delivers excellent intercoder reliability (Cohen's $κ\geq 0.79$) for 3 of 9 codes, and substantial reliability ($κ\geq 0.6$) for 8 of 9 codes. In contrast, GPT-3.5 greatly underperforms for all codes ($mean(κ) = 0.34$; $max(κ) = 0.55$). Importantly, we find that coding fidelity improves considerably when the LLM is prompted to give rationale justifying its coding decisions (chain-of-thought reasoning). We present these and other findings along with a set of best practices for adapting traditional codebooks for LLMs. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis. Furthermore, they suggest the next generation of models will likely render AI coding a viable option for a majority of codebooks.

연구 동기 및 목표

인문학 유사 작업에서 LLM 주도 질적 코딩을 위한 인간 코드북을 적응시키기.
최신 LLM이 다중 코드에 걸쳐 인간 간 코더 신뢰도와 일치할 수 있는지 평가하기.
사고의 흐름(chain-of-thought)을 포함한 프롬프트 전략이 코딩 정확도에 미치는 영향을 평가하기.
LLM 보조 코딩 워크플로 설계를 위한 모범 사례 지침을 제공하기.

제안 방법

세 범주에 걸친 아홉 개 코드에 대해 인간 유래 코드북을 사용한다.
Du Bois를 언급한 111개의 골드 표준 구절에서 GPT-4와 GPT-3.5를 비교한다(평균 약 94단어).
Per Code 대 Full Codebook 작업 프롬프트를 제로샷으로 평가한다.
사고의 흐름 프롬프트를 도입하여 근거의 코딩 결정에 미치는 영향을 평가한다.
프롬프트 변형(사유 포함, 코드 하나씩 대 전체 코드북) 을 테스트하고 코더 간 신뢰도(Cohen의 kappa)를 보고한다.
LLM에 맞게 코드북을 적응시키는 실용적인 모범 사례 지침을 제공한다.

Figure 1: The chain-of-thought prompt sequence.

실험 결과

연구 질문

RQ1GPT-4와 GPT-3.5가 질적 코딩 과제에서 인간 코더와 비슷한 코더 간 일치도를 달성할 수 있는가?
RQ2사고의 흐름 프롬프트가 인문학 유사 코드북에서 LLM의 코딩 정확도를 향상시키는가?
RQ3코드를 별도의 프롬프트로 제시하는 것이(작업당 한 코드) 전체 코드북 프롬프트보다 LLM 기반 코딩에 더 효과적인가?
RQ4어떤 코드북 적응 및 프롬프트 설계가 인간의 골드 표준과의 정렬을 최대화하는가?

주요 결과

GPT-4는 제로샷 프롬프트로 여러 코드에 대해 인간과 동등한 해석을 달성한다(예: 9개 코드 중 3개가 카파 0.75를 넘는다).
GPT-4는 Per Code 프롬프트에서 9개 코드 중 8개에서 상당한 신뢰도(kappa ≥ 0.6)를 보이고, 3개 코드에서 우수한 신뢰도(kappa ≥ 0.79)를 보인다.
GPT-3.5는 모든 코드에서 저조한 성능을 보이며(평균 kappa 약 0.34; 최대 약 0.55).
GPT-4에 근거 제공(사고의 흐름)을 프롬프팅하면 코딩 정확도가 향상된다(평균 kappa, Per Code: CoT 0.68 대 비-CoT 0.59; Full Codebook: CoT 0.60 대 비-CoT 0.46).
각 코드를 별도의 프롬프트로 제시하는 것이 전체 코드북 접근 방식보다 더 높은 신뢰도를 보인다(보고된 비교에서 평균 kappa 0.68 대 0.60).
사유화와 코드별 프롬프팅은 LLM 주도 질적 코딩의 품질을 높이기 위한 실용적인 모범 사례로 확인되었다.

Figure 2: Two examples of prompt redefinition. Colored, alphabetically labeled blocks of text show alterations derived through iterative code refinement. Italics draw attention to direction to constrain interpretive scope to implicit or explicit information.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.