Text classification in a resource-constrained language using Deep Learning Techniques.

Hossain, Md. Rajib

Text classification in a resource-constrained language using Deep Learning Techniques.

Hossain, Md. Rajib

URI: http://103.99.128.19:8080/xmlui/handle/123456789/483

Date: 2024-05-23

Abstract:

The exponential growth of unstructured textual data on the World Wide Web, particularly due to extensive online engagement and social media activities, poses a significant challenge in classifying and organizing this data efficiently. Resourceconstrained languages like Bengali face specific hurdles, including a lack of annotated corpora, out-of-vocabulary word problems, inadequately tuned hyperparameters, and class distribution imbalances. To address these issues, this research introduces AFuNeT, i.e. Attention-based early fusion of transformer-based language models. This intelligent text classification system encompasses three crucial phases: corpora development, embedding model development, and text classification model development, offering a comprehensive solution to enhance text classification capabilities for such languages. The first phase employs manual annotation and active learning techniques to develop two main corpus types: embedding and text classification corpora. Five specific embedding corpora are created for various tasks like Bengali document classification, authorship attribution, and Covid text identification. Text classification corpora are further divided into training and testing sets, comprising seven specific corpora, i.e., Bengali Authorship Classification Corpus (BACC- 18), Bengali Text Classification Corpus (BDTCC), Arabic Covid Text Classification Corpus (AraCoV), Bengali Covid Text Classification Corpus (BCovC), English Covid Text Classification Corpus (ECoVC), Bengali Text Classification Corpus (BTCC11), and Bengali Multi-Modal Text Classification Corpus (BMMTC). These corpora are essential for text classification and model performance evaluation. In the second phase, we’ve developed non-contextual embedding models using Bengali corpora and optimized their hyperparameters. These embeddings fall into three categories: individual (GloVe, FastText, Word2Vec), meta (AVG-M, CATM, CAT-SVD-M), and attention-based meta-embeddings (APeAGF, APeCGF). Top-performing models from intrinsic evaluations are selected for downstream tasks, with contextual embeddings like transformer-based models used for finetuning and feature extraction.