[論文レビュー] Whose LLM is it Anyway? Linguistic Comparison and LLM Attribution for GPT-3.5, GPT-4 and Bard
The paper shows that GPT-3.5, GPT-4, and Bard have distinct linguistic styles across vocabulary, POS, dependencies, and sentiment, enabling 88% accuracy in attributing text to its LLM origin using a simple classifier.
Large Language Models (LLMs) are capable of generating text that is similar to or surpasses human quality. However, it is unclear whether LLMs tend to exhibit distinctive linguistic styles akin to how human authors do. Through a comprehensive linguistic analysis, we compare the vocabulary, Part-Of-Speech (POS) distribution, dependency distribution, and sentiment of texts generated by three of the most popular LLMS today (GPT-3.5, GPT-4, and Bard) to diverse inputs. The results point to significant linguistic variations which, in turn, enable us to attribute a given text to its LLM origin with a favorable 88\% accuracy using a simple off-the-shelf classification model. Theoretical and practical implications of this intriguing finding are discussed.
研究の動機と目的
- Investigate whether major LLMs exhibit distinguishable linguistic styles similar to human authors.
- Characterize vocabulary, POS, dependency, and sentiment differences across GPT-3.5, GPT-4, and Bard.
- Assess the feasibility of LLM attribution using linguistic features with a supervised model.
提案手法
- Construct LC2 by extending HC3 with GPT-3.5, GPT-4, and Bard responses to 1,000 inputs per dataset across five datasets (total 5,000 inputs, 15,000 responses).
- Analyze vocabulary, POS, dependencies, and sentiment using ANOVA with Tukey post-hoc tests, KS tests with Bonferroni correction, and Wilcoxon tests (p<0.05).
- Train an off-the-shelf XGBoost classifier on linguistic features for LLM attribution with 5-fold cross-validation.
- Report feature importance via information gain and model performance metrics (recall, F1, accuracy).
- Provide code and data accessibility through a public repository.
実験結果
リサーチクエスチョン
- RQ1Do GPT-3.5, GPT-4, and Bard exhibit statistically distinct linguistic markers in vocabulary, POS, dependencies, and sentiment?
- RQ2Can linguistic features enable accurate attribution of text to its LLM origin?
- RQ3Which linguistic features contribute most to LLM attribution across models?
主な発見
| Dataset | LLM | Average length | Vocabulary size | Density |
|---|---|---|---|---|
| Finance | GPT-3.5 | 208.13 | 20974 | 2.49 |
| Finance | GPT-4 | 197.53 | 22785 | 2.73 |
| Finance | Bard | 219.28 | 21809 | 2.64 |
| Medicine | GPT-3.5 | 206.14 | 7910 | 3.11 |
| Medicine | GPT-4 | 168.09 | 8827 | 5.69 |
| Medicine | Bard | 180.16 | 7594 | 3.24 |
| open_qa | GPT-3.5 | 142.61 | 15379 | 9.06 |
| open_qa | GPT-4 | 88.42 | 12097 | 16.93 |
| open_qa | Bard | 65.74 | 10829 | 17.34 |
| reddit_eli5 | GPT-3.5 | 191.38 | 45198 | 1.40 |
| reddit_eli5 | GPT-4 | 151.18 | 48095 | 2.05 |
| reddit_eli5 | Bard | 133.70 | 46147 | 1.87 |
| wiki_csai | GPT-3.5 | 202.39 | 9347 | 5.03 |
| wiki_csai | GPT-4 | 215.05 | 10074 | 6.73 |
| wiki_csai | Bard | 186.18 | 9240 | 7.18 |
- Bard tends to produce shorter responses with smaller vocabulary size and relatively high density compared to GPT-3.5 and GPT-4.
- GPT-4 generally shows higher vocabulary size and density than GPT-3.5 across datasets.
- POS and dependency patterns differ significantly among the three LLMs, with Bard showing more diverse usage in low-frequency POS and certain dependency types.
- Sentiment is positive across all models with no significant differences (approx. 53% positive).
- An XGBoost classifier using the linguistic features achieves 0.88 accuracy (F1 0.87) in attributing text to GPT-3.5, GPT-4, or Bard.
- Top features for attribution include noun/proper noun usage, positive sentiment, punctuation, and density/word-count of vocabulary.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。