QUICK REVIEW

[论文解读] Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi, Rossella Varvara|arXiv (Cornell University)|Feb 16, 2026

Authorship Attribution and Profiling被引用 0

一句话总结

论文介绍了 Testimole-conversational，这是一个 300 亿词的意大利语讨论版语料库（1996–2024），来自 Usenet 和论坛，面向语言建模和社会语言学研究，并向研究社区公开 release。

ABSTRACT

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

研究动机与目标

从 Usenet 和讨论板创建一个大型的、跨时间的意大利语计算机媒介沟通语料库。
使对三十年 informal written Italian 的数据驱动语言学与社会语言学分析成为可能。
提供适用于意大利语语言模型预训练和领域适应的资源。
支持对正字法形式、话语动态和在线社交互动随时间的分析。

提出的方法

数据源包括用意大利语撰写的 Usenet 新闻组与在线论坛。
Web-scraping 于 2024 年 2–5 月进行，以收集可追溯到 1996 年的帖子。
每条帖子都带有元数据存储：标题、匿名化作者、主题 ID、逐步帖子 ID、时间戳，以及论坛/新闻组，文本内容。
为语言模型训练使用子词分词器（Tiktoken BPE cl100k_base）对帖子进行分词，以估计词元计数。
语料库包含帖子时间戳的跨时段注释，以实现基于时间的语言分析。
对用户名进行了匿名化处理，以解决隐私问题。

Figure 1: Total size per year. Forum overtakes Usenet around 2004

实验结果

研究问题

RQ1近三十年内非正式意大利语在讨论版中的使用如何演变（词汇/语法变化）？
RQ2意大利语 Usenet 与论坛讨论的主题和体裁分布如何，随时间有何变化？
RQ3Testimole-conversational 子集是否适合用于意大利语语言模型的预训练及社会语言学研究？
RQ4在使用本语料库进行 NLP 和社会语言学分析时的局限性与潜在噪声因素有哪些？

主要发现

语料库包含近 300 亿词元，其中论坛占 23 亿词元，Usenet 占 7 亿词元。
论坛帖子总数 468,391,746 条，涉及 25,280,745 个主题（平均每主题 18.5 条帖子）；Usenet 包含 89,499,446 条帖子，涉及 14,521,548 个主题（平均每主题 6 条帖子）。
子词分词后的词元计数为论坛 620 亿，Usenet 200 亿。
热门主题包括政治（Usenet 约占 6%，论坛约 9%），技术论坛如 hwupgrade（论坛约 15%），以及以女性议题为主的 alfemminile 等论坛。
数据集展现了跨时段的趋势，如诸如 troll、smartphone、streaming 等新词的兴起。
该资源旨在支持语言建模、领域适应、对话分析和社会语言学研究，同时指出潜在噪声与在 ML 中需谨慎使用的注意事项。

Figure 2: Usenet - Number of tokens per year

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。