Sarcasm Detection on Reddit

2024

Comparative NLP study on 1.3M Reddit comments using Naive Bayes, TF-IDF + Logistic Regression, and DistilBERT fine-tuning — achieving 76.65% accuracy with transformer models.

AI/ML

Tech Stack

DistilBERTTF-IDFLogistic RegressionNaive BayesNLPPython

View this Project on GitHub

Key Highlights

Conducted a comprehensive comparative study of sarcasm detection on 1.3 million Reddit comments, implementing three models: Naive Bayes (66.55%), TF-IDF + Logistic Regression (70.80%), and DistilBERT Transformer (76.65% accuracy).
Performed extensive preprocessing including tokenization, stopword removal, and lemmatization; engineered n-gram features (unigrams and bigrams) for the statistical baseline models.
Fine-tuned DistilBERT using HuggingFace Transformers with AdamW optimizer, linear warmup scheduler, and gradient clipping — demonstrating how bidirectional context outperforms statistical baselines.
Both TF-IDF and DistilBERT achieved 100% accuracy on live Reddit comment prediction tests; provided evidence-based recommendations balancing performance, interpretability, and compute cost.
Visualized token-level attention weights from DistilBERT to interpret which contextual cues (punctuation, intensifiers, semantic incongruence) contributed most to sarcasm classification.
Structured the project as a reproducible research pipeline with modular preprocessing, training, and evaluation scripts, detailed in a final comparative analysis report.

Depression Detection from Visual Content All Projects Identity Management System (IMS)