QUICK REVIEW

[논문 리뷰] Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi, Rossella Varvara|arXiv (Cornell University)|2026. 02. 16.

Authorship Attribution and Profiling인용 수 0

한 줄 요약

본 논문은 Testimole-conversational을 제시합니다, Usenet 및 포럼에서 수집된 이탈리아어 토론 게시판 말뭉치로서 30-billion-word, (1996–2024), 언어 모델링 및 사회언어학 연구를 위한 것이며 연구 커뮤니티에 공개적으로 배포됩니다.

ABSTRACT

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

연구 동기 및 목표

Usenet 및 토론 게시판에서 대규모의 다시기적(역사적) 이탈리아어 컴퓨터 매개 커뮤니케이션 말뭉치를 생성한다.
30년에 걸친 비형식적 이탈리아어의 데이터 기반 언어학 및 사회언어학 분석을 가능하게 한다.
이탈리아어 언어 모델의 사전 학습 및 도메인 적응에 적합한 자원을 제공한다.
시간에 따른 철자 형태, 담화 역학, 온라인 사회적 상호작용의 분석을 지원한다.

제안 방법

데이터 원천은 이탈리아어로 작성된 Usenet 뉴스그룹 및 온라인 포럼을 포함한다.
2024년 2월~5월에 1996년으로 거슬러 올라가는 게시물을 수집하기 위한 웹 스크래핑이 수행됐다.
각 게시물은 메타데이터(제목, 익명화된 작성자, 스레드 ID, 진행 중인 게시 ID, 타임스탬프, 포럼/뉴스그룹)와 텍스트 내용을 함께 저장한다.
게시물은 LM 학습을 위한 서브워드 토크나이저(Tiktoken BPE cl100k_base)를 사용하여 토큰 수를 추정하기 위해 토큰화된다.
말뭉치에는 시간 기반의 언어 분석을 가능하게 하는 게시물 타임스탬프의 다시기적 주석이 포함되어 있다.
개인정보 보호 considerations를 다루기 위해 사용자 이름의 익명화가 수행되었다.

Figure 1: Total size per year. Forum overtakes Usenet around 2004

실험 결과

연구 질문

RQ1토론 게시판에서의 비형식 이탈리아어가 거의 30년 동안 어떻게 진화해 왔는가(어휘적/문법적 변화)?
RQ2이탈리아어 Usenet와 포럼 토론의 주제 및 장르 분포는 어떠하며 시간이 지남에 따라 어떻게 변화하는가?
RQ3Testimole-conversational 하위집합이 이탈리아어 언어 모델의 사전 학습 및 사회언어학 연구에 적합한가?
RQ4NLP 및 사회언어학 분석에 이 말뭉치를 사용할 때의 한계와 잠재적 노이즈 요인은 무엇인가?

주요 결과

말뭉치는 거의 30 billion word-tokens를 포함하며, 23 billion은 포럼에서, 7 billion은 Usenet에서 온 것이다.
Forum posts total 468,391,746 posts in 25,280,745 threads (average 18.5 posts per thread); Usenet contains 89,499,446 posts in 14,521,548 threads (average 6 posts per thread).
Token counts after sub-word tokenization are 62 billion for forums and 20 billion for Usenet.
Top topics include politics (about 6% of Usenet data; forums also show politics at ~9%), technology forums like hwupgrade (~15% of forums), and forums focusing on women's topics like alfemminile.
The dataset reveals diachronic trends such as the rise of neologisms like troll, smartphone, and streaming across time.
The resource is intended to support language modeling, domain adaptation, conversational analysis, and sociolinguistic studies, while noting potential noise and the need for careful use in ML.

Figure 2: Usenet - Number of tokens per year

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.