Methodology to Classify Unsolicited Email Threats


Publication date: 2024



Themes & Topics

Email, a fundamental form of communication, faces increasing threats from unsolicited messages. Differentiating these types of threats is essential to take appropriate mitigation measures and deploy effective security controls. This research delves into the complexities of this issue, examining the diverse categories, inherent threats, and the role of language in classifying unsolicited emails. To build a dataset of 10.8 million unsolicited emails (spam), which cover a period of four and a half years, this study constructed a robust email processing pipeline and methodology for categorizing unsolicited emails into spam, scam, phishing, and adult content.

The study uses machine learning models, including the Long Short-Term Memory (LSTM) neural network and the frequency-inverse document frequency (TF-IDF) statistical measure, which, together, excel in classifying unsolicited emails across various languages. This process extends beyond English, achieving high classification accuracy across 80+ languages, and demonstrating the adaptability of the models.