Abstract:
The exponential growth of unstructured textual data on the World Wide Web,
particularly due to extensive online engagement and social media activities, poses
a significant challenge in classifying and organizing this data efficiently. Resourceconstrained
languages like Bengali face specific hurdles, including a lack of annotated
corpora, out-of-vocabulary word problems, inadequately tuned hyperparameters,
and class distribution imbalances. To address these issues, this research
introduces AFuNeT, i.e. Attention-based early fusion of transformer-based
language models. This intelligent text classification system encompasses three
crucial phases: corpora development, embedding model development, and text
classification model development, offering a comprehensive solution to enhance
text classification capabilities for such languages.
The first phase employs manual annotation and active learning techniques to
develop two main corpus types: embedding and text classification corpora. Five
specific embedding corpora are created for various tasks like Bengali document
classification, authorship attribution, and Covid text identification. Text classification
corpora are further divided into training and testing sets, comprising
seven specific corpora, i.e., Bengali Authorship Classification Corpus (BACC-
18), Bengali Text Classification Corpus (BDTCC), Arabic Covid Text Classification
Corpus (AraCoV), Bengali Covid Text Classification Corpus (BCovC), English
Covid Text Classification Corpus (ECoVC), Bengali Text Classification Corpus
(BTCC11), and Bengali Multi-Modal Text Classification Corpus (BMMTC).
These corpora are essential for text classification and model performance evaluation.
In the second phase, we’ve developed non-contextual embedding models using
Bengali corpora and optimized their hyperparameters. These embeddings fall into
three categories: individual (GloVe, FastText, Word2Vec), meta (AVG-M, CATM,
CAT-SVD-M), and attention-based meta-embeddings (APeAGF, APeCGF).
Top-performing models from intrinsic evaluations are selected for downstream
tasks, with contextual embeddings like transformer-based models used for finetuning
and feature extraction.