CUET DIGITAL REPOSITORY

Text classification in a resource-constrained language using Deep Learning Techniques.

Show simple item record

dc.contributor.author Hossain, Md. Rajib
dc.date.accessioned 2025-09-14T10:42:11Z
dc.date.available 2025-09-14T10:42:11Z
dc.date.issued 2024-05-23
dc.identifier.uri http://103.99.128.19:8080/xmlui/handle/123456789/483
dc.description Thesis in CSE en_US
dc.description.abstract The exponential growth of unstructured textual data on the World Wide Web, particularly due to extensive online engagement and social media activities, poses a significant challenge in classifying and organizing this data efficiently. Resourceconstrained languages like Bengali face specific hurdles, including a lack of annotated corpora, out-of-vocabulary word problems, inadequately tuned hyperparameters, and class distribution imbalances. To address these issues, this research introduces AFuNeT, i.e. Attention-based early fusion of transformer-based language models. This intelligent text classification system encompasses three crucial phases: corpora development, embedding model development, and text classification model development, offering a comprehensive solution to enhance text classification capabilities for such languages. The first phase employs manual annotation and active learning techniques to develop two main corpus types: embedding and text classification corpora. Five specific embedding corpora are created for various tasks like Bengali document classification, authorship attribution, and Covid text identification. Text classification corpora are further divided into training and testing sets, comprising seven specific corpora, i.e., Bengali Authorship Classification Corpus (BACC- 18), Bengali Text Classification Corpus (BDTCC), Arabic Covid Text Classification Corpus (AraCoV), Bengali Covid Text Classification Corpus (BCovC), English Covid Text Classification Corpus (ECoVC), Bengali Text Classification Corpus (BTCC11), and Bengali Multi-Modal Text Classification Corpus (BMMTC). These corpora are essential for text classification and model performance evaluation. In the second phase, we’ve developed non-contextual embedding models using Bengali corpora and optimized their hyperparameters. These embeddings fall into three categories: individual (GloVe, FastText, Word2Vec), meta (AVG-M, CATM, CAT-SVD-M), and attention-based meta-embeddings (APeAGF, APeCGF). Top-performing models from intrinsic evaluations are selected for downstream tasks, with contextual embeddings like transformer-based models used for finetuning and feature extraction. en_US
dc.language.iso en en_US
dc.publisher CUET en_US
dc.relation.ispartofseries TCD-54;T-345
dc.subject Corpora development, Data crawling, embedding model en_US
dc.title Text classification in a resource-constrained language using Deep Learning Techniques. en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account