[論文レビュー] RusLICA: A Russian-Language Platform for Automated Linguistic Inquiry and Category Analysis
tldr: RusLICA adapts LIWC methodology for Russian, building a 96-category dictionary and an automated analyzer using NLP parsers and pre-trained models, deployed as a public web service.
Defining psycholinguistic characteristics in written texts is a task gaining increasing attention from researchers. One of the most widely used tools in the current field is Linguistic Inquiry and Word Count (LIWC) that originally was developed to analyze English texts and translated into multiple languages. Our approach offers the adaptation of LIWC methodology for the Russian language, considering its grammatical and cultural specificities. The suggested approach comprises 96 categories, integrating syntactic, morphological, lexical, general statistical features, and results of predictions obtained using pre-trained language models (LMs) for text analysis. Rather than applying direct translation to existing thesauri, we built the dictionary specifically for the Russian language based on the content from several lexicographic resources, semantic dictionaries and corpora. The paper describes the process of mapping lemmas to 42 psycholinguistic categories and the implementation of the analyzer as part of RusLICA web service.
研究の動機と目的
- Adapt LIWC for Russian language considering morphology and culture.
- Construct a 96-category lexicon and an automated analyzer for Russian texts.
- Integrate syntactic, morphological, lexical, and model-based features for text analysis.
- Provide a publicly accessible web service for researchers to analyze Russian corpora.
提案手法
- Develop a LIWC-like dictionary for Russian covering 96 categories across linguistic and psychological dimensions.
- Use SpaCy ru_core_news_lg for tokenization, lemmatization, and dependency parsing to derive syntactic and morphological features.
- Leverage Russian semantic dictionaries, RNC, and RuWordNet to build 42 lexical categories totaling 8309 entries.
- Normalize lemmas with MyStem to align text with dictionary entries for scoring.
- Incorporate a pretrained Russian emotion-detection model (Aniemore/rubert-tiny2-russian-emotion-detection) for classification into 7 emotions.
- Provide a web service RusLICA that uploads datasets (.csv/.xlsx), computes category scores, and outputs CSV/JSON results.
実験結果
リサーチクエスチョン
- RQ1How can LIWC-like psycholinguistic categories be effectively adapted to Russian language and morphology?
- RQ2Can a publicly accessible tool accurately quantify 96 Russian lexical and linguistic features from written text?
- RQ3What is the impact of combining lexical dictionaries with NLP parsers and language models on psycholinguistic analysis of Russian texts?
主な発見
- A 96-category Russian analysis framework was implemented, combining lexical, syntactic, and morphological features with psycholinguistic dimensions.
- The dictionary uses lemmas mapped to categories, totaling 8309 lexical entries across 42 lexical categories plus non-lexical features.
- Preprocessing uses normalization and lemmatization; features are computed from SpaCy parses with ru_core_news_lg and MyStem alignment.
- The RusLICA service supports uploading datasets and returns category scores for texts in CSV/JSON formats within a 12-hour processing limit.
- A pretrained emotion-detection model provides additional classification outputs for text emotion in the 7-emotion schema.
- The platform is freely accessible as RusLICA (ruslica.ipran.ru) for researchers to analyze large Russian text corpora.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。