[Paper Review] Sentiment Analysis on Bangla and Romanized Bangla Text (BRBT) using Deep Recurrent models
This paper proposes a large-scale, post-processed, and multi-validated dataset for Bangla and Romanized Bangla text (BRBT) to enable robust sentiment analysis. It evaluates deep recurrent models, particularly Long Short-Term Memory (LSTM) networks, using binary and categorical cross-entropy loss functions, achieving promising results with cross-validation and transfer pre-training, thus establishing a reusable benchmark for future NLP research in Bangla.
Sentiment Analysis (SA) is an action research area in the digital age. With rapid and constant growth of online social media sites and services, and the increasing amount of textual data such as - statuses, comments, reviews etc. available in them, application of automatic SA is on the rise. However, most of the research works on SA in natural language processing (NLP) are based on English language. Despite being the sixth most widely spoken language in the world, Bangla still does not have a large and standard dataset. Because of this, recent research works in Bangla have failed to produce results that can be both comparable to works done by others and reusable as stepping stones for future researchers to progress in this field. Therefore, we first tried to provide a textual dataset - that includes not just Bangla, but Romanized Bangla texts as well, is substantial, post-processed and multiple validated, ready to be used in SA experiments. We tested this dataset in Deep Recurrent model, specifically, Long Short Term Memory (LSTM), using two types of loss functions - binary crossentropy and categorical crossentropy, and also did some experimental pre-training by using data from one validation to pre-train the other and vice versa. Lastly, we documented the results along with some analysis on them, which were promising.
Motivation & Objective
- To address the lack of standardized, large-scale datasets for sentiment analysis in Bangla, a language spoken by over 200 million people.
- To create a post-processed, multi-validated dataset that includes both native Bangla and Romanized Bangla text for improved NLP model training.
- To evaluate the performance of deep recurrent models, specifically LSTMs, on sentiment classification using multiple loss functions.
- To explore transfer learning via pre-training on one validation set to improve performance on another, enhancing model generalization.
- To provide a reusable, comparable benchmark for future research in Bangla sentiment analysis.
Proposed method
- The authors constructed a substantial, post-processed, and multi-validated dataset containing both Bangla and Romanized Bangla text for sentiment analysis.
- They applied Long Short-Term Memory (LSTM) networks as the core deep learning architecture for sequence modeling and sentiment classification.
- Two loss functions—binary cross-entropy and categorical cross-entropy—were used to train and evaluate the LSTM models.
- The study implemented cross-validation and experimental pre-training, where models were fine-tuned using data from one fold to pre-train on another.
- The models were trained and evaluated using standard NLP pipelines, including tokenization, embedding, and sequence padding for input consistency.
- Performance was measured using standard classification metrics, with results analyzed across different data splits and training configurations.
Experimental results
Research questions
- RQ1Can a large, post-processed, and multi-validated dataset for Bangla and Romanized Bangla text improve the reliability and reusability of sentiment analysis models?
- RQ2How do different loss functions—binary and categorical cross-entropy—affect the performance of LSTM-based sentiment classifiers on BRBT?
- RQ3To what extent does pre-training on one validation fold improve performance on another in the context of Bangla sentiment analysis?
- RQ4Can transfer learning between different folds of the BRBT dataset enhance model generalization and accuracy?
- RQ5How do the results on Bangla and Romanized Bangla compare in terms of sentiment classification performance using deep recurrent models?
Key findings
- The proposed BRBT dataset is substantial, post-processed, and multi-validated, making it suitable for reliable sentiment analysis experiments.
- LSTM models trained with both binary and categorical cross-entropy loss functions achieved promising performance on the BRBT dataset.
- Pre-training on one validation fold and fine-tuning on another led to measurable improvements in model accuracy and generalization.
- The results demonstrated that Romanized Bangla text can be effectively leveraged in sentiment analysis with deep recurrent models.
- The study established a reusable benchmark for future research, enabling reproducible and comparable results in Bangla NLP.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.