Text classification in a resource-constrained language using Deep Learning Techniques.

Hossain, Md. Rajib

Please use this identifier to cite or link to this item: http://103.99.128.19:8080/xmlui/handle/123456789/483

Full metadata record

DC Field	Value	Language
dc.contributor.author	Hossain, Md. Rajib	-
dc.date.accessioned	2025-09-14T10:42:11Z	-
dc.date.available	2025-09-14T10:42:11Z	-
dc.date.issued	2024-05-23	-
dc.identifier.uri	http://103.99.128.19:8080/xmlui/handle/123456789/483	-
dc.description	Thesis in CSE	en_US
dc.description.abstract	The exponential growth of unstructured textual data on the World Wide Web, particularly due to extensive online engagement and social media activities, poses a significant challenge in classifying and organizing this data efficiently. Resourceconstrained languages like Bengali face specific hurdles, including a lack of annotated corpora, out-of-vocabulary word problems, inadequately tuned hyperparameters, and class distribution imbalances. To address these issues, this research introduces AFuNeT, i.e. Attention-based early fusion of transformer-based language models. This intelligent text classification system encompasses three crucial phases: corpora development, embedding model development, and text classification model development, offering a comprehensive solution to enhance text classification capabilities for such languages. The first phase employs manual annotation and active learning techniques to develop two main corpus types: embedding and text classification corpora. Five specific embedding corpora are created for various tasks like Bengali document classification, authorship attribution, and Covid text identification. Text classification corpora are further divided into training and testing sets, comprising seven specific corpora, i.e., Bengali Authorship Classification Corpus (BACC- 18), Bengali Text Classification Corpus (BDTCC), Arabic Covid Text Classification Corpus (AraCoV), Bengali Covid Text Classification Corpus (BCovC), English Covid Text Classification Corpus (ECoVC), Bengali Text Classification Corpus (BTCC11), and Bengali Multi-Modal Text Classification Corpus (BMMTC). These corpora are essential for text classification and model performance evaluation. In the second phase, we’ve developed non-contextual embedding models using Bengali corpora and optimized their hyperparameters. These embeddings fall into three categories: individual (GloVe, FastText, Word2Vec), meta (AVG-M, CATM, CAT-SVD-M), and attention-based meta-embeddings (APeAGF, APeCGF). Top-performing models from intrinsic evaluations are selected for downstream tasks, with contextual embeddings like transformer-based models used for finetuning and feature extraction.	en_US
dc.language.iso	en	en_US
dc.publisher	CUET	en_US
dc.relation.ispartofseries	TCD-54;T-345	-
dc.subject	Corpora development, Data crawling, embedding model	en_US
dc.title	Text classification in a resource-constrained language using Deep Learning Techniques.	en_US
dc.type	Thesis	en_US
Appears in Collections:	Thesis in CSE

Files in This Item:

File	Description	Size	Format
PhD_Thesis_for_printing-5.pdf		37.96 MB	Adobe PDF	View/Open

Show simple item record

All the rights are reserved by the CUET library authority@2018.