Text classification in a resource-constrained language using Deep Learning Techniques.

Hossain, Md. Rajib

dc.contributor.author	Hossain, Md. Rajib
dc.date.accessioned	2025-09-14T10:42:11Z
dc.date.available	2025-09-14T10:42:11Z
dc.date.issued	2024-05-23
dc.identifier.uri	http://103.99.128.19:8080/xmlui/handle/123456789/483
dc.description	Thesis in CSE	en_US
dc.description.abstract	The exponential growth of unstructured textual data on the World Wide Web, particularly due to extensive online engagement and social media activities, poses a significant challenge in classifying and organizing this data efficiently. Resourceconstrained languages like Bengali face specific hurdles, including a lack of annotated corpora, out-of-vocabulary word problems, inadequately tuned hyperparameters, and class distribution imbalances. To address these issues, this research introduces AFuNeT, i.e. Attention-based early fusion of transformer-based language models. This intelligent text classification system encompasses three crucial phases: corpora development, embedding model development, and text classification model development, offering a comprehensive solution to enhance text classification capabilities for such languages. The first phase employs manual annotation and active learning techniques to develop two main corpus types: embedding and text classification corpora. Five specific embedding corpora are created for various tasks like Bengali document classification, authorship attribution, and Covid text identification. Text classification corpora are further divided into training and testing sets, comprising seven specific corpora, i.e., Bengali Authorship Classification Corpus (BACC- 18), Bengali Text Classification Corpus (BDTCC), Arabic Covid Text Classification Corpus (AraCoV), Bengali Covid Text Classification Corpus (BCovC), English Covid Text Classification Corpus (ECoVC), Bengali Text Classification Corpus (BTCC11), and Bengali Multi-Modal Text Classification Corpus (BMMTC). These corpora are essential for text classification and model performance evaluation. In the second phase, we’ve developed non-contextual embedding models using Bengali corpora and optimized their hyperparameters. These embeddings fall into three categories: individual (GloVe, FastText, Word2Vec), meta (AVG-M, CATM, CAT-SVD-M), and attention-based meta-embeddings (APeAGF, APeCGF). Top-performing models from intrinsic evaluations are selected for downstream tasks, with contextual embeddings like transformer-based models used for finetuning and feature extraction.	en_US
dc.language.iso	en	en_US
dc.publisher	CUET	en_US
dc.relation.ispartofseries	TCD-54;T-345
dc.subject	Corpora development, Data crawling, embedding model	en_US
dc.title	Text classification in a resource-constrained language using Deep Learning Techniques.	en_US
dc.type	Thesis	en_US