Abstract:
The ever-growing problem which is threatening the
current mailing system is spam. Spam is nothing but an
unsolicited bulk e-mail frequently sent in a financial nature
which generates the need for creating an anti-spam filter.
Amongst many spam filtering techniques, the most advanced
method "Naïve Bayesian filtering" using the Support Vector
Machine (SVM) have been implemented. Spammers are very
careful about the filtering techniques. For that very reason,
dynamic filtering is needed and the proposed method meets the
demand. The algorithm splits the received email into tokens and
uses Bayes' theorem of probability to calculate the probability of
spam for each token to determine the total spam probability of
the mail. Implementation of SVM instead of corpora is one of the
added features of the algorithm. The most challenging feature
was to take the words as well as whole sentences as input in the
SVM as tokens and feature vectors. The inclusion of sentences in
the dataset training has increased the accuracy of detecting spam
and ham. Natural Language Tool Kit (NLTK) has been used as a
useful language processing tool to tokenize the sentences and
also to understand the meaning of the same types of sentences to
some extent. As a test mail is being compared by word to word
and also sentence to sentence from the training datasets to
determine if the mail is spam or not, it will improve the
performance of the filter. With some simple modifications, the
filter can be used in both server and client end. The efficiency
increases gradually with the increased number of email it
processes.