Please use this identifier to cite or link to this item: http://103.99.128.19:8080/xmlui/handle/123456789/483
Full metadata record
DC FieldValueLanguage
dc.contributor.authorHossain, Md. Rajib-
dc.date.accessioned2025-09-14T10:42:11Z-
dc.date.available2025-09-14T10:42:11Z-
dc.date.issued2024-05-23-
dc.identifier.urihttp://103.99.128.19:8080/xmlui/handle/123456789/483-
dc.descriptionThesis in CSEen_US
dc.description.abstractThe exponential growth of unstructured textual data on the World Wide Web, particularly due to extensive online engagement and social media activities, poses a significant challenge in classifying and organizing this data efficiently. Resourceconstrained languages like Bengali face specific hurdles, including a lack of annotated corpora, out-of-vocabulary word problems, inadequately tuned hyperparameters, and class distribution imbalances. To address these issues, this research introduces AFuNeT, i.e. Attention-based early fusion of transformer-based language models. This intelligent text classification system encompasses three crucial phases: corpora development, embedding model development, and text classification model development, offering a comprehensive solution to enhance text classification capabilities for such languages. The first phase employs manual annotation and active learning techniques to develop two main corpus types: embedding and text classification corpora. Five specific embedding corpora are created for various tasks like Bengali document classification, authorship attribution, and Covid text identification. Text classification corpora are further divided into training and testing sets, comprising seven specific corpora, i.e., Bengali Authorship Classification Corpus (BACC- 18), Bengali Text Classification Corpus (BDTCC), Arabic Covid Text Classification Corpus (AraCoV), Bengali Covid Text Classification Corpus (BCovC), English Covid Text Classification Corpus (ECoVC), Bengali Text Classification Corpus (BTCC11), and Bengali Multi-Modal Text Classification Corpus (BMMTC). These corpora are essential for text classification and model performance evaluation. In the second phase, we’ve developed non-contextual embedding models using Bengali corpora and optimized their hyperparameters. These embeddings fall into three categories: individual (GloVe, FastText, Word2Vec), meta (AVG-M, CATM, CAT-SVD-M), and attention-based meta-embeddings (APeAGF, APeCGF). Top-performing models from intrinsic evaluations are selected for downstream tasks, with contextual embeddings like transformer-based models used for finetuning and feature extraction.en_US
dc.language.isoenen_US
dc.publisherCUETen_US
dc.relation.ispartofseriesTCD-54;T-345-
dc.subjectCorpora development, Data crawling, embedding modelen_US
dc.titleText classification in a resource-constrained language using Deep Learning Techniques.en_US
dc.typeThesisen_US
Appears in Collections:Thesis in CSE

Files in This Item:
File Description SizeFormat 
PhD_Thesis_for_printing-5.pdf37.96 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.