QUICK REVIEW

[論文レビュー] Prevalence and prevention of large language model use in crowd work

Veniamin Veselovsky, Manoel Horta Ribeiro|arXiv (Cornell University)|Oct 24, 2023

Mobile Crowdsensing and Crowdsourcing被引用数 16

ひとこと要約

この論文は、クラウドワーカーの間で LLM の使用が一般的であることを示しており（約 30%）、狙いを定めた緩和策は使用を減らせるが完全には防げないことが多い；LLM からの高品質さはしばしば均質であり、研究の妥当性に影響を及ぼす可能性がある。

ABSTRACT

We show that the use of large language models (LLMs) is prevalent among crowd workers, and that targeted mitigation strategies can significantly reduce, but not eliminate, LLM use. On a text summarization task where workers were not directed in any way regarding their LLM use, the estimated prevalence of LLM use was around 30%, but was reduced by about half by asking workers to not use LLMs and by raising the cost of using them, e.g., by disabling copy-pasting. Secondary analyses give further insight into LLM use and its prevention: LLM use yields high-quality but homogeneous responses, which may harm research concerned with human (rather than model) behavior and degrade future models trained with crowdsourced data. At the same time, preventing LLM use may be at odds with obtaining high-quality responses; e.g., when requesting workers not to use LLMs, summaries contained fewer keywords carrying essential information. Our estimates will likely change as LLMs increase in popularity or capabilities, and as norms around their usage change. Yet, understanding the co-evolution of LLM-based tools and users is key to maintaining the validity of research done using crowdsourcing, and we provide a critical baseline before widespread adoption ensues.

研究の動機と目的

Prolific 上のクラウドワーカーの間で、テキスト要約タスク中の LLM の使用がどれだけ広まっているかを定量化する。
LLM の使用を防ぐための2つの緩和戦略（直接的/間接的な使用要請とコピー＆ペーストを妨げる障害の導入）の有効性を評価する。
キーワードの保持と回答の均質性を含むデータ品質に対する LLM の使用の影響を評価する。
LLM 使用の相関要因（労働者の年齢、先行研究の認識など）および回答の内容レベルの特徴を探索する。
LLM とクラウドワーク実践の共進化に関する将来の研究のためのベースラインを提供する。

提案手法

前研究の要約を基にしたテキスト要約タスクを用いて Prolific で2つの研究を実施した。
LLM生成テキストを検出するファインチューン済みの e5-base-v2 分類器を開発し、校正と複数の集約手法を用いて有病率を推定した。
Study #2 で (None/Indirect/Direct) LLM-use requests と (None/Image/No-Ctrl-Copy) copy-paste hurdles を組み合わせた 3×3 要因設計を実施した。
LLM 使用を推定するために、分類器ベース、自己申告、およびヒューリスティックな指標を比較した。
LLM 使用と労働者の年齢および関連研究の自己申告認識との相関を分析した。
介入効果を推定するために線形確率モデルを用い、内容レベルの差異を分析する均質性指標を用いた。
確率推定を改善するために温度スケーリングでモデル出力を校正し、誤分類を考慮するために有病率調整技術を用いた。

実験結果

リサーチクエスチョン

RQ1明示的な LLM 使用指示なしで、Prolific のクラウドワーカーにおけるテキスト要約タスクの基礎的な LLM 使用有病率はどの程度か？
RQ2明示的/不使用の要請や画像テキストやコピー＆ペースト不可といった障害を含む狙いを定めた緩和策は、LLM の使用を実質的に削減できるか？
RQ3緩和策は、クラウド生成要約の品質と特徴（例：キーワード保持と均質性）にどのような影響を与えるか？
RQ4クラウドワーカー間の LLM 使用には人口統計学的要因や認識関連の相関はあるか？
RQ5LLM ベースのツールはクラウドワークの実践とどのように共進化し、研究の妥当性にどんな影響があるか？

主な発見

指示のない労働者の LLM 使用は、推定手法間で約 30–35% だった。
Direct or indirect requests plus copy-paste hurdles significantly reduce LLM usage, though do not eliminate it.
Directly asking workers not to use LLMs combined with image-based text (hurdle) reduced LLM usage from 27.6% to 15.9% by one measure.
LLM-generated (synthetic) summaries were more homogeneous and retained more keywords than human-generated ones under certain conditions.
Awareness of LLM-use studies did not significantly decrease usage, while younger workers and those who reported frequent LLM use were more likely to use LLMs for the task.
Mitigation strategies can inadvertently reduce data quality, such as lower keyword retention when readers are explicitly told not to use LLMs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。