Back to Projects
Sarcasm Detection on Reddit screenshot 1
1 / 14
NLP / Sentiment AnalysisCompleted

Sarcasm Detection on Reddit

2024

Comparative NLP study on 1.3M Reddit comments using Naive Bayes, TF-IDF + Logistic Regression, and DistilBERT fine-tuning — achieving 76.65% accuracy with transformer models.

AI/ML

Tech Stack

DistilBERTTF-IDFLogistic RegressionNaive BayesNLPPython

Key Highlights

  • Conducted a comprehensive comparative study of sarcasm detection on 1.3 million Reddit comments, implementing three models: Naive Bayes (66.55%), TF-IDF + Logistic Regression (70.80%), and DistilBERT Transformer (76.65% accuracy).
  • Performed extensive preprocessing including tokenization, stopword removal, and lemmatization; engineered n-gram features (unigrams and bigrams) for the statistical baseline models.
  • Fine-tuned DistilBERT using HuggingFace Transformers with AdamW optimizer, linear warmup scheduler, and gradient clipping — demonstrating how bidirectional context outperforms statistical baselines.
  • Both TF-IDF and DistilBERT achieved 100% accuracy on live Reddit comment prediction tests; provided evidence-based recommendations balancing performance, interpretability, and compute cost.
  • Visualized token-level attention weights from DistilBERT to interpret which contextual cues (punctuation, intensifiers, semantic incongruence) contributed most to sarcasm classification.
  • Structured the project as a reproducible research pipeline with modular preprocessing, training, and evaluation scripts, detailed in a final comparative analysis report.