[論文レビュー] TimeLMs: Diachronic Language Models from Twitter
TimeLMsは、Twitterデータで訓練された時間依存のRoBERTa-baseベースの言語モデルを3か月ごとにアップデート・リリースし、将来データでの古いモデルの性能低下と継続的更新の利点を示します。
Despite its importance, the time variable has been largely neglected in the NLP and language model literature. In this paper, we present TimeLMs, a set of language models specialized on diachronic Twitter data. We show that a continual learning strategy contributes to enhancing Twitter-based language models' capacity to deal with future and out-of-distribution tweets, while making them competitive with standardized and more monolithic benchmarks. We also perform a number of qualitative analyses showing how they cope with trends and peaks in activity involving specific named entities or concept drift.
研究の動機と目的
- Motivate the need for diachronic, time-aware language models in fast-changing social media like Twitter.
- Show that continual, quarterly updating improves performance on future/out-of-distribution tweets.
- Provide a practical framework and tooling for time-aware evaluation and usage of TimeLMs.
提案手法
- Build a base RoBERTa-base model trained on 2018-2019 Twitter data (2019-90M).
- Continually train updated models every three months using newly collected Twitter data.
- Clean data by removing top-1% most active users, removing duplicates/near-duplicates, and anonymizing mentions (except verified users).
- Evaluate models with TweetEval benchmark and pseudo-perplexity on time-sliced test sets.
- Provide a Python interface to compute pseudo-perplexity and masked predictions across time-specific models.
実験結果
リサーチクエスチョン
- RQ1Do time-specific language models better handle diachronic shifts in Twitter data compared to static baselines?
- RQ2How does continual quarterly updating influence performance on newer versus older time periods?
- RQ3To what extent does increased data size versus recency drive improvements in time-aware LMs?
- RQ4Can a practical tooling interface enable easy time-aware evaluation and usage of TimeLMs?
主な発見
| Models | Emoji | Emotion | Hate | Irony | Offensive | Sentiment | Stance | ALL |
|---|---|---|---|---|---|---|---|---|
| SVM | 29.3 | 64.7 | 36.7 | 61.7 | 52.3 | 62.9 | 67.3 | 53.5 |
| FastText | 25.8 | 65.2 | 50.6 | 63.1 | 73.4 | 62.9 | 65.4 | 58.1 |
| BLSTM | 24.7 | 66.0 | 52.6 | 62.8 | 71.7 | 58.3 | 59.4 | 56.5 |
| RoBERTa-Base | 30.8 | 76.6 | 44.9 | 55.2 | 78.7 | 72.0 | 70.9 | 61.3 |
| TweetEval | 31.6 | 79.8 | 55.5 | 62.5 | 81.6 | 72.9 | 72.6 | 65.2 |
| BERTweet | 33.4 | 79.3 | 56.4 | 82.1 | 79.5 | 73.4 | 71.2 | 67.9 |
| TimeLM-19 | 33.4 | 81.0 | 58.1 | 48.0 | 82.4 | 73.2 | 70.7 | 63.8 |
| TimeLM-21 | 34.0 | 80.2 | 55.1 | 64.5 | 82.2 | 73.7 | 72.9 | 66.2 |
- Time-aware models show competitive performance on TweetEval tasks compared to baselines and BERTweet, with TimeLM-21 performing well across tasks.
- Pseudo-perplexity results indicate newer models generally outperform older ones on contemporaneous test data, reflecting reduced degradation over time.
- quarterly updates reduce degradation over time, though older periods benefit from larger cumulative data in some settings.
- A control experiment suggests that increasing training data size improves performance, while recency primarily benefits more recent test sets.
- Qualitative examples show time-specific models better predict period-relevant masked tokens (e.g., COVID era, Squid Game) than older models.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。