YouTube Comment Intelligence

Machine Learning
NLP
MLOps
An end-to-end MLOps system that delivers real-time sentiment analysis of YouTube comments through a Chrome extension, backed by a FastAPI service deployed on AWS.
Author

Sushrut

Published

August 1, 2025

1 TL;DR

YouTube Comment Intelligence is a production-grade machine learning system that gives content creators actionable insight into the conversations happening under their videos. Open the Chrome extension on any YouTube video and it pulls the comments through the YouTube Data API, sends them to a FastAPI service running behind an Application Load Balancer on AWS, and returns a full analytical view: sentiment distribution, a word cloud, a monthly sentiment trend, the top-25 comments with predicted labels, and headline metrics like the number of unique commenters and an average sentiment score.

Underneath the product is a complete MLOps loop: data versioned with DVC and stored on S3, experiments tracked on MLflow (hosted on DagsHub), models promoted through a staging → production workflow in the MLflow Model Registry, the whole training pipeline reproducible via a single dvc repro, and a GitHub Actions workflow that on every push retrains, tests, builds a Docker image, pushes to ECR, and triggers a CodeDeploy rolling deployment onto an EC2 Auto Scaling Group fronted by a load balancer.

The sentiment model itself is a LightGBM classifier trained on Reddit comments (Twitter data was rejected during EDA for being too politically skewed) with bag-of-words bigrams and class_weight="balanced" to handle class imbalance, a choice that emerged from five structured experiments covering feature engineering, max_features tuning, resampling strategies, model comparison, and extensive hyperparameter search with Optuna.

This post walks through the system top-down, architecture first, then each layer in detail.


2 System Architecture

2.1 High-Level View

At the highest level, the system has three tiers: a client (the Chrome extension running in the user’s browser), a serving layer (a FastAPI application running in a Docker container on EC2, fronted by an Application Load Balancer), and an MLOps layer (DVC pipeline, MLflow Model Registry, S3 artifact storage, GitHub Actions CI/CD).

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd', 'secondaryColor': '#444', 'tertiaryColor': '#222'}}}%%
flowchart TD
 subgraph Client["Client"]
 EXT[Chrome Extension<br/>popup.html / popup.js]
 end

 subgraph External["External API"]
 YT[YouTube Data API v3]
 end

 subgraph AWS["AWS Serving Layer"]
 ALB[Application<br/>Load Balancer]
 ASG[Auto Scaling Group<br/>EC2 × 2-3]
 DOCK[Docker Container<br/>FastAPI + LightGBM]
 ALB --> ASG
 ASG --> DOCK
 end

 subgraph MLOps["MLOps Layer"]
 MLF[MLflow Registry<br/>on DagsHub]
 S3[(S3:<br/>raw data + DVC remote)]
 ECR[(ECR:<br/>Docker image)]
 end

 EXT -- "fetch comments" --> YT
 YT -- "comments + metadata" --> EXT
 EXT -- "POST /predict<br/>/generate_chart<br/>/generate_wordcloud<br/>/generate_trend_graph" --> ALB
 DOCK -- "load model<br/>at startup" --> MLF
 DOCK -- "pull image" --> ECR
 MLF -.-> S3
Figure 1: High-level system architecture.

The client never talks to the model directly. It always goes through the load balancer, which spreads traffic across the healthy EC2 instances in the Auto Scaling Group. Each instance runs the same Docker image (pulled from ECR at deployment time), which in turn loads the current “Production” model from the MLflow Registry during FastAPI’s startup event. This means rolling out a new model is just a matter of promoting a new version in the registry and restarting the containers. No code changes required.

2.2 Request Flow: What Happens When You Click the Extension

The extension exposes five backend endpoints. The most interesting is the main flow for the “full analysis” view:

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd', 'actorBkg':'#375a7f', 'actorTextColor':'#fff', 'signalColor':'#adb5bd', 'signalTextColor':'#fff'}}}%%
sequenceDiagram
 autonumber
 participant U as User
 participant X as Extension
 participant Y as YouTube Data API
 participant L as ALB
 participant F as FastAPI (EC2)
 participant M as MLflow Registry

 U->>X: Clicks extension icon on a YouTube video
 X->>X: Validate URL, extract videoId
 X->>Y: GET commentThreads?videoId=...
 Y-->>X: comments + authors + timestamps
 X->>L: POST /predict_with_timestamps
 L->>F: Forward to healthy instance
 F->>F: Preprocess → vectorize → predict
 F-->>X: [{comment, sentiment, timestamp}, …]
 par Parallel enrichment calls
 X->>L: POST /generate_chart (sentiment counts)
 L->>F: Forward
 F-->>X: pie chart PNG
 and
 X->>L: POST /generate_wordcloud (comments)
 L->>F: Forward
 F-->>X: word cloud PNG
 and
 X->>L: POST /generate_trend_graph (timestamped labels)
 L->>F: Forward
 F-->>X: monthly trend PNG
 end
 X->>U: Render dashboard in popup
 Note over F,M: Model & vectorizer<br/>loaded once at startup
Figure 2: End-to-end request flow when a user opens the extension on a YouTube video.

The parallelism in steps 8-10 matters for latency, the extension issues the three “generation” requests concurrently rather than serially, so the total wall-clock time the user waits on is roughly the slowest of the three chart calls plus the sentiment inference, not their sum.

2.3 The MLOps Loop

The training-and-deployment side is a closed loop. A push to main can, and usually does, propagate all the way to a new model in production:

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart TB
 DEV[git push] --> GHA[GitHub Actions<br/>ci-cd.yaml]

 subgraph CI["Continuous Integration"]
 GHA --> DVC[dvc repro<br/>6-stage pipeline]
 DVC --> REG[Register model → Staging]
 REG --> T1[Model loading test]
 T1 --> T2[Model signature test]
 T2 --> T3[Performance test<br/>≥ 0.75 thresholds]
 T3 --> PROMO[Promote → Production<br/>archive old]
 PROMO --> API[FastAPI endpoint tests]
 end

 subgraph CD["Continuous Delivery"]
 API --> BUILD[Build Docker image]
 BUILD --> PUSH[Push to ECR]
 PUSH --> ZIP[Zip appspec.yml +<br/>deploy scripts → S3]
 ZIP --> CD_TRIG[Trigger CodeDeploy]
 CD_TRIG --> ROLL[Rolling deploy<br/>OneAtATime across ASG]
 end

 ROLL --> PROD[Production]

 style PROMO fill:#00bc8c,stroke:#fff,color:#fff
 style PROD fill:#00bc8c,stroke:#fff,color:#fff
 style T3 fill:#e74c3c,stroke:#fff,color:#fff
Figure 3: MLOps loop: from git push to a rolling deployment.

The two gates that matter most are the performance test (a model that falls below 0.75 on accuracy / precision / recall / F1 on a held-out slice never makes it out of Staging) and the rolling deployment (CodeDeploy’s OneAtATime config, combined with the ALB health checks, means a broken build can’t take down the whole fleet).


3 Problem Framing

3.1 Business Context

Imagine an influencer management company trying to expand its roster but working with a limited marketing budget. Paid acquisition is out. The alternative is to give influencers something genuinely useful, a tool that solves a real pain point so well that they come to you.

The pain point is obvious once you look at any sufficiently popular channel: comment sections with tens of thousands of entries that no human can read, let alone analyze. Creators want to know things like what is the overall sentiment on this video?, what are people actually talking about?, and is the conversation getting more or less positive over time?, and they currently have no good way to answer any of them.

3.2 Product Requirements

The extension was scoped around four capabilities:

  1. Real-time sentiment classification of every comment as positive, neutral, or negative, with visual distribution (pie chart) and drill-downs into individual comments.
  2. Word cloud over all comments so themes jump out at a glance.
  3. Sentiment trend over time (monthly line chart) so creators can see how audience mood evolved.
  4. Headline metrics: total comments, unique commenters, average words per comment, and an average sentiment score on a −10 to +10 scale.

3.3 Technical Challenges

Before writing a line of code it was worth cataloguing what was going to be hard:

  • No labelled YouTube data. The YouTube Data API returns comments but no sentiment labels. The nearest available labelled corpus was a Kaggle dataset containing both Reddit and Twitter comments, the Twitter subset turned out to be heavily political during EDA, so the model was trained on the Reddit subset only, which generalized better.
  • No universally representative corpus. YouTube content is extraordinarily heterogeneous (cooking, politics, gaming, music, and so on). One model cannot be equally strong across all of them. This is accepted as a known limitation.
  • Multilingual and code-switched comments, including English-scripted Hindi and similar transliterations. Out of scope for v1.
  • Spam, bots, sarcasm, slang, emoji, evolving Gen-Z vocabulary: all real sources of noise that bias or mislabel training signal.
  • Latency. The extension must feel responsive. Hundreds of comments need to be classified and visualized in a few seconds at most.
  • Privacy / compliance. The source of training data matters if the system is ever commercialized; using licensed Kaggle data is a deliberate choice.

4 The Sentiment Model

This is the heart of the system. The goal of this section isn’t to enumerate every experiment, it’s to show the reasoning chain that led from a 65%-accuracy baseline to the LightGBM model in production.

4.1 Data

  • Source: Kaggle “Twitter and Reddit Sentimental Analysis Dataset”, only the Reddit subset.
  • Size: ~37,249 comments after light cleaning.
  • Labels: -1 (negative), 0 (neutral), +1 (positive).
  • Class distribution: 43% positive, 35% neutral, 22% negative. This is a moderate imbalance that, as we’ll see, drives most of the early model’s failure modes.

4.2 Exploratory Data Analysis

EDA and preprocessing were done in tight cycles: clean a bit, look at distributions, adjust the cleaning, look again. The full EDA notebook lives here.

4.2.1 Cleaning Steps That Actually Fired

  • Explicit NaNs: 100 out of 37,249 comments, dropped (all were labelled neutral).
  • Whitespace-only comments: 6, dropped.
  • Duplicates: 350, dropped.
  • Leading/trailing whitespace: 32,266 comments, stripped.
  • URLs: none found by regex match; no-op.
  • Newlines and tabs: replaced with a single space.
  • Non-English special characters: removed via regex.
  • Punctuation: already absent from the source.

4.2.2 Class Imbalance

Figure 4: Class imbalance.

4.2.3 Length-Based Features

Several columns were engineered to probe whether comment length carried any sentiment signal.

Figure 5: Distribution of word count.

The word_count distribution is heavily right-skewed, most comments are short, with a long thin tail of unusually long ones.

Figure 6: Distribution of word count by sentiment.

Broken down by sentiment, neutral comments are noticeably shorter and more concentrated around small word counts. Positive and negative comments have wider spreads and behave similarly to each other.

Figure 7: Box plot of word count by sentiment.

The box plot confirms that neutral comments have the lowest median and tightest IQR; positive comments sit slightly above negative ones in median but both have long upper tails.

Figure 8: Bar plot of median word count by sentiment.

4.2.4 Stop Word Analysis

A num_of_stop_words feature was built using NLTK. Its distributions mirrored word_count almost exactly (right skew, neutral concentrated at the low end), which is unsurprising.

Figure 9: Distribution of stop words.
Figure 10: Distribution of stop words by sentiment.
Figure 11: Bar plot of median stop words by sentiment.

The interesting finding was in the top-25 stop words:

Figure 12: Top-25 most common stop words.

The word not ranks extremely high. not, along with but, however, no, and yet, can invert the sentiment of an entire sentence. The stop word removal step was therefore configured to preserve these five words rather than using NLTK’s default list wholesale. This is a small but important decision. Stripping “not” would turn “this is not good” into “this is good” from the model’s point of view.

4.2.5 Character Counts

A num_chars feature was created. During this step the data revealed the presence of many non-English special characters, which were stripped via regex.

4.2.6 Bigrams and Trigrams

Figure 13: Most common bigrams.
Figure 14: Most common trigrams.

4.2.7 Lemmatization

All tokens were reduced to their root form with NLTK’s WordNetLemmatizer.

4.2.8 Word Clouds

Figure 15: Word cloud for the entire data.
Figure 16: Word cloud for positive comments.
Figure 17: Word cloud for neutral comments.
Figure 18: Word cloud for negative comments.

A somewhat uncomfortable observation: the dominant words are broadly similar across sentiment classes. That signals early that a bag-of-words model will need to rely on combinations (bigrams, contextual co-occurrence) rather than individual tokens to separate the classes.

4.2.9 Most Frequent Words

Figure 19: Top-20 most frequent words in the data.
Figure 20: Top-20 most frequent words in the data by sentiment.

Same caveat as the word clouds: the lists look similar across sentiments, and the data leans political. Expect the model to transfer better to political content than, say, cooking tutorials.

4.3 Baseline: Random Forest + Bag of Words

The baseline was deliberately simple so subsequent experiments would have a clear reference point.

  • Vectorizer: CountVectorizer (unigrams).
  • Model: RandomForestClassifier with default-ish hyperparameters.
  • Experiment tracking: MLflow server hosted on DagsHub, logging metrics, the confusion matrix heatmap, the model, and the training data.

Best test accuracy: ~65%.

Table 1: Baseline random forest classification report.
Class Precision Recall F1-score Support
-1 (negative) 1.00 0.02 0.03 1650
0 (neutral) 0.66 0.85 0.74 2555
+1 (positive) 0.64 0.82 0.72 3154
Accuracy 0.65 7359
Macro avg 0.77 0.56 0.49 7359
Weighted avg 0.73 0.65 0.57 7359
Figure 21: Confusion matrix of the baseline model (0 = negative, 1 = neutral, 2 = positive).

The failure is diagnostic. Recall on the negative class is ~0.02, the model classified just 28 out of 1,650 actual negatives correctly:

\[ \text{Recall (negative)} = \frac{28}{28 + 571 + 1051} = \frac{28}{1650} \approx 0.02 \]

But of the 28 it did flag as negative, every single one actually was, precision of 1.0:

\[ \text{Precision (negative)} = \frac{28}{28 + 0 + 0} = 1.0 \]

So the model isn’t “bad at negatives” per se, it’s refusing to predict negative almost entirely, defaulting to the majority classes. That’s the signature of class imbalance, and it dictates the direction of everything that follows. Baseline notebook.

4.4 Experiment Plan

Five structured experiments, each isolating one variable:

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart TD
 E1[Exp 1:<br/>BoW vs TF-IDF<br/>× n-gram range] --> E2[Exp 2:<br/>max_features<br/>sweep]
 E2 --> E3[Exp 3:<br/>Resampling<br/>strategy]
 E3 --> E4[Exp 4:<br/>Model family<br/>comparison]
 E4 --> E5[Exp 5:<br/>Fine HPT on<br/>top 4 models]
 E5 --> E6[Exp 6:<br/>Targeted LightGBM<br/>improvements]
Figure 22: Experiment plan: each stage locks in the winner before moving on.

4.4.1 Experiment 1: Feature Engineering

max_features fixed at 5,000. Swept vectorizer ∈ {BoW, TF-IDF} × n-gram ∈ {uni, bi, tri}.

Figure 23: Parallel coordinates plot for experiment 1.

Winner: Bag-of-Words with bigrams. Notebook.

4.4.2 Experiment 2: max_features

Swept max_features from 1,000 to 10,000 in steps of 1,000 using BoW + bigrams.

Figure 24: Parallel coordinates plot for experiment 2.

Winner: max_features = 1000. Lower values gave both higher overall accuracy and dramatically better recall on the negative class. This is a classic bias-variance win, narrower vocabulary, less overfitting on rare feature combinations. Notebook.

4.4.3 Experiment 3: Handling Class Imbalance

Candidates: undersampling, ADASYN, SMOTE, SMOTE+ENN, class weights in random forest.

Figure 25: Experiment 3, negative comments.
Figure 26: Experiment 3, neutral comments.
Figure 27: Experiment 3, positive comments.

Winner: random undersampling, the most balanced performance across all three classes, not necessarily the highest on any single one. Notebook.

4.4.4 Experiment 4: Model Comparison

Seven models (XGBoost, LightGBM, random forest, SVM, logistic regression, KNN, naive Bayes), each with coarse Optuna HPT, each best version logged as a single MLflow run.

Figure 28: Experiment 4, negative comments.
Figure 29: Experiment 4, neutral comments.
Figure 30: Experiment 4, positive comments.
Figure 31: Experiment 4, combined view (2 = negative, 0 = neutral, 1 = positive).

Four-way tie at the top: XGBoost, LightGBM, SVM, and logistic regression clustered tightly together with no clear winner. Notebooks: XGBoost, LightGBM, SVM, LogReg, KNN, Naive Bayes, Random Forest.

4.4.5 Experiment 5: Fine Hyperparameter Tuning

Extensive Optuna search on the four tied models.

Figure 32: Parallel coordinates plot for experiment 5.

Still no clear winner after extensive tuning, accuracy stuck under 80%. LightGBM chosen for its combination of speed and the fact that ensemble tree methods tended to be more robust at the decision boundary between neutral and positive comments.

Figure 33: Hyperparameter importances for LightGBM (from Optuna).

The top three hyperparameters by importance: learning_rate, max_depth, min_child_samples. Notebooks: LightGBM, XGBoost, SVM, LogReg.

4.5 Pushing LightGBM Further

Four targeted improvement attempts:

Table 2: Targeted improvement techniques.
Technique Notebook Verdict
class_weight="balanced" 18 Chosen for production, best overall + strong negative recall
Word2Vec (300-dim, skip-gram) 19 Worst of the four
Custom linguistic features 20 Decent, but weak on negative recall
Stacking (XGB+LGBM+SVM+LR → KNN meta) 21 Decent accuracy, unacceptable latency (4 base models)

Custom features built for technique 3: comment_length, word_count, avg_word_length, unique_word_count, lexical_diversity, pos_count, plus spaCy-derived POS proportions (NaNs filled with 0).

Figure 34: Comparison of improvement techniques.

The decisive factor for class_weight was operational, not just metric-driven, it removes the undersampling step entirely. Less data thrown away means a more efficient training pipeline, and the model handles imbalance intrinsically.

Final model: LightGBM with class_weight="balanced", bag-of-words bigrams, max_features=1000, Optuna-tuned hyperparameters.


5 The DVC Pipeline

Everything in the training side of the system is wired into a DVC pipeline so dvc repro reproduces the whole thing end-to-end from raw data in S3 to a registered model in MLflow.

5.1 Pipeline Shape

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart TB
 S3[(S3: raw_data.csv)] --> DI[data_ingestion.py]
 DI --> RAW[data/raw/<br/>train.csv, test.csv]
 RAW --> DP[data_preprocessing.py]
 DP --> INT[data/interim/<br/>train.csv, test.csv]
 INT --> FE[feature_engineering.py]
 FE --> PROC[data/processed/<br/>X_train, X_test, y_train, y_test]
 FE --> VEC[models/vectorizer.pkl]
 PROC --> MT[model_training.py]
 MT --> MDL[models/model.pkl]
 PROC --> ME[model_evaluation.py]
 MDL --> ME
 VEC --> ME
 ME --> EXP[experiment_info.json]
 ME --> MLF1[(MLflow:<br/>metrics + artifacts)]
 EXP --> MR[model_registration.py]
 MR --> MLF2[(MLflow Registry:<br/>Staging)]

 style S3 fill:#375a7f,stroke:#fff,color:#fff
 style MLF1 fill:#00bc8c,stroke:#fff,color:#fff
 style MLF2 fill:#00bc8c,stroke:#fff,color:#fff
Figure 35: DVC pipeline stages with data flow.

5.2 Stage-by-Stage

Stage Module What it does
Ingestion data_ingestion.py Pulls raw CSV from S3; drops NaNs, duplicates, and empty comments; does a stratified train/test split; writes data/raw/.
Preprocessing data_preprocessing.py Lowercases, removes URLs, removes stop words (keeping not/but/however/no/yet), lemmatizes; writes data/interim/.
Feature engineering feature_engineering.py Fits CountVectorizer(ngram_range=(1,2), max_features=1000) on train, transforms both splits, persists the vectorizer to models/vectorizer.pkl.
Training model_training.py Fits LGBMClassifier with the tuned hyperparameters; persists models/model.pkl.
Evaluation model_evaluation.py Scores on test; logs params, metrics, confusion matrix, model with signature, and vectorizer to MLflow; writes experiment_info.json with run ID and model path.
Registration model_registration.py Reads experiment_info.json, registers the model, transitions to Staging.

Two supporting files hold it all together: dvc.yaml defines stage dependencies and outputs, and params.yaml holds every tunable (train/test split ratio, CountVectorizer settings, LGBM hyperparameters) so they can be changed without touching code.


6 Backend: FastAPI

6.1 Responsibilities

The FastAPI app (backend/app.py) is deliberately thin. Its job is to:

  1. Load the Production model from MLflow and the vectorizer from disk once, at startup.
  2. Accept comments from the extension.
  3. Preprocess → vectorize → predict.
  4. Generate the three server-side visualizations (pie chart, word cloud, monthly trend).
  5. Return JSON or PNG responses.

6.2 Endpoint Surface

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart LR
 subgraph Core["Prediction endpoints"]
 P1[/POST /predict/]
 P2[/POST /predict_with_timestamps/]
 end
 subgraph Viz["Visualization endpoints"]
 V1[/POST /generate_chart/]
 V2[/POST /generate_wordcloud/]
 V3[/POST /generate_trend_graph/]
 end

 IN1[["comments: list[str]"]] --> P1
 IN2[["comments + timestamps"]] --> P2
 IN3[["sentiment counts"]] --> V1
 IN4[["comments"]] --> V2
 IN5[["timestamped labels"]] --> V3

 P1 --> OUT1[["[{comment, sentiment}]"]]
 P2 --> OUT2[["[{comment, sentiment, ts}]"]]
 V1 --> OUT3[["pie chart PNG"]]
 V2 --> OUT4[["word cloud PNG"]]
 V3 --> OUT5[["line chart PNG"]]
Figure 36: FastAPI endpoint surface.

6.3 Preprocessing Mirror

A subtle but critical detail: the preprocessing applied to incoming comments must exactly match the training-time preprocessing. Any divergence (a different stop-word list, a missing lemmatization step) silently corrupts inference. The backend reuses the same logic (lowercase, newline/non-alphanumeric scrub, stop-word removal with the five exceptions, WordNet lemmatization) that the DVC data_preprocessing.py stage uses.

6.4 Startup-Time Resource Loading

# Conceptual, see backend/app.py for the real thing
@app.on_event("startup")
def load_resources():
    global model, vectorizer
    mlflow.set_tracking_uri(DAGSHUB_URI)
    model = mlflow.pyfunc.load_model("models:/youtube_comments_analyzer/Production")
    vectorizer = joblib.load("/app/models/vectorizer.pkl")

This has two consequences. First, each container is stateful in the sense that its model reference is pinned at startup. A freshly promoted model in the registry will not be picked up until the containers are restarted, which CodeDeploy handles during its rolling deploy. Second, startup latency includes model download time, which is why the ALB’s health-check grace period matters.

6.5 Error Handling

The endpoints catch and report: malformed JSON input, preprocessing failures (e.g., non-string entries in the comments array), vectorizer-shape mismatches, and model prediction errors. Each produces a structured HTTP error response rather than a 500 with a stack trace.

6.6 Example Request/Response

// POST /predict
{
    "comments": [
        "I love this video",
        "This was terrible and boring"
    ]
}
// Response
[
    {"comment": "I love this video", "sentiment": 1},
    {"comment": "This was terrible and boring","sentiment": -1}
]

7 Frontend: The Chrome Extension

7.1 Anatomy of the Extension

A Chrome extension at its minimum is three files:

File Role
manifest.json Declares the extension’s permissions, icons, popup page, and (importantly) the domains it may talk to (the ALB’s DNS name, YouTube’s API, and any image sources).
popup.html The UI shown when the user clicks the toolbar icon. Hosts containers for the sentiment boxes, the three images, the metrics, and the top-25 comment list. Styled with inline CSS in a dark theme.
popup.js All the logic: URL validation, calls to the YouTube Data API, calls to the FastAPI backend, DOM rendering.

A fourth file, secrets.json, holds the YouTube Data API key. It is .gitignore’d.

7.2 Client-Side Flow

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart TB
 START([Popup opens]) --> CHK{URL is<br/>YouTube video?}
 CHK -- No --> MSG1[Show 'not a YouTube video']
 CHK -- Yes --> VID[Extract videoId]
 VID --> FC[fetchComments<br/>up to 500]
 FC --> GSP[getSentimentPredictions<br/>POST /predict_with_timestamps]
 GSP --> SPLIT{Distribute to UI renderers}

 SPLIT --> PCT[Compute sentiment %]
 SPLIT --> CHART[fetchAndDisplayChart<br/>→ /generate_chart]
 SPLIT --> WC[fetchAndDisplayWordCloud<br/>→ /generate_wordcloud]
 SPLIT --> TR[fetchAndDisplayTrendGraph<br/>→ /generate_trend_graph]
 SPLIT --> MET[Compute metrics<br/>total, unique authors,<br/>avg words, avg sentiment score]
 SPLIT --> TOP[Render top-25 comments]

 PCT --> DONE([Render dashboard])
 CHART --> DONE
 WC --> DONE
 TR --> DONE
 MET --> DONE
 TOP --> DONE
Figure 37: Internal flow of popup.js when the extension opens on a YouTube video.

7.3 The “Average Sentiment Score” Metric

Among the four headline metrics, the only non-trivial one is the average sentiment score, mapped onto a 0-10 scale where higher is more positive:

\[ \text{avg score} \;=\; \frac{(+1)\,c_{+} + 0 \cdot c_{0} + (-1)\,c_{-}}{c_{+} + c_{0} + c_{-}} \times 10 \]

The other three (total comments, unique commenters, average words per comment) are computed directly from the YouTube API response without hitting the backend.

7.4 Iterative Build

The extension was built as a sequence of seven progressively-richer prototypes, each a separate commit:

  1. Hard-coded two-comment prototype, verifying the extension can POST to the backend.
  2. URL checker / videoId extraction.
  3. YouTube Data API wired up for comment counts.
  4. Sentiment percentages on 100 comments.
  5. 500-comment fetch, top-25 display, dark-theme styling.
  6. Pie chart from /generate_chart.
  7. Word cloud, monthly trend, headline metrics, the current production surface.

8 CI/CD Pipeline

8.1 The Workflow at a Glance

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart TB
 TRIG[on: push] --> CK[Checkout]
 CK --> PY[Setup Python 3.11]
 PY --> CACHE[Cache pip]
 CACHE --> DEPS[Install deps + dvc-s3]
 DEPS --> REPRO[dvc repro]
 REPRO --> DPUSH[dvc push → S3]
 DPUSH --> GIT[git add/commit/push<br/>as github-actions bot]
 GIT --> T_LOAD[Test: model loading]
 T_LOAD --> T_SIG[Test: model signature]
 T_SIG --> T_PERF[Test: performance ≥ 0.75]
 T_PERF --> PROMO[Promote Staging → Production]
 PROMO --> FASTAPI[Start FastAPI + wait]
 FASTAPI --> T_API[Test: FastAPI endpoints]
 T_API --> ECR_LOGIN[Log in to ECR]
 ECR_LOGIN --> D_BUILD[docker build]
 D_BUILD --> D_TAG[docker tag]
 D_TAG --> D_PUSH[docker push → ECR]
 D_PUSH --> ZIP[Zip deploy bundle]
 ZIP --> S3UP[Upload to S3]
 S3UP --> DEPLOY[aws deploy create-deployment]

 style PROMO fill:#00bc8c,stroke:#fff,color:#fff
 style DEPLOY fill:#00bc8c,stroke:#fff,color:#fff
 style T_PERF fill:#e74c3c,stroke:#fff,color:#fff
Figure 38: GitHub Actions ci-cd.yaml, stage by stage.

8.2 Pipeline Reproduction

Running the DVC pipeline inside CI requires a handful of secrets and a little care to avoid runaway loops. The core step:

- name: Run DVC Pipeline
    env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_DEFAULT_REGION: us-east-2
        RAW_DATA_S3_BUCKET_NAME: ${{ secrets.RAW_DATA_S3_BUCKET_NAME }}
        RAW_DATA_S3_KEY: ${{ secrets.RAW_DATA_S3_KEY }}
        DAGSHUB_USER_TOKEN: ${{ secrets.DAGSHUB_USER_TOKEN }}
    run: dvc repro

After dvc repro, artifacts are pushed to the S3 DVC remote:

- name: Push DVC-tracked Data to Remote
    run: dvc push

The pipeline also modifies dvc.lock and possibly other tracked files. Those changes need to land back on the repo. The workflow configures a bot identity and commits, but only if the actor wasn’t itself the bot, which is the standard trick to avoid an infinite CI loop:

- name: Commit Changes
    if: ${{ github.actor != 'github-actions[bot]' }}
    run: git commit -m "Automated commit of DVC outputs" || echo "No changes"

8.3 Model Tests

Three staged tests, each gating the next:

  1. Model loading via tests/test_model_loading.py. Fetches the just-registered Staging model from MLflow and asserts it’s usable.
  2. Model signature via tests/test_model_signature.py. A mismatch between expected input shape/dtype and the model’s signature is one of the sneakiest ways to break production. This test pushes a preprocessed sample through and verifies both input and output shapes.
  3. Performance via tests/test_model_performance.py. Scores the model on a validation slice and fails if accuracy, precision, recall, or F1 falls below 0.75.

Pass all three → scripts/promote_model.py transitions the Staging model to Production and archives any currently-Production model.

8.4 API Tests

Once the model is in Production, the workflow spins up FastAPI in the background and waits for the port to become available:

- name: Wait for FastAPI to be Ready
    run: |
        for i in {1..10}; do
            nc -z localhost 5000 && echo "FastAPI is up!" && exit 0
            echo "Waiting for FastAPI..." && sleep 3
        done
    echo "FastAPI server failed to start" && exit 1

Then tests/test_fast_api.py hits each of the five endpoints with dummy payloads and asserts on status codes and response shapes.


9 Containerization

9.1 Dockerfile

The image starts from python:3.11-slim to keep size down:

FROM python:3.11-slim

WORKDIR /app/backend

# OpenMP, needed by LightGBM
RUN apt-get update && apt-get install -y libgomp1

# Copy only what's needed for inference
COPY backend/ /app/backend/
COPY models/ /app/models/

# Inference-specific requirements (kept separate from training reqs)
RUN pip install -r requirements.txt

# NLTK assets needed for preprocessing at inference time
RUN python -m nltk.downloader stopwords wordnet

EXPOSE 5000
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "5000"]

Two things worth calling out:

  • backend/requirements.txt is intentionally a trimmed version of the project’s requirements.txt: only the libraries the inference server actually imports. Training-only libraries (Optuna, DVC, MLflow extras beyond the model-loading client, matplotlib-for-EDA, spaCy, etc.) are omitted. This keeps the final image size reasonable.
  • libgomp1 is easy to forget. LightGBM binaries that were linked against OpenMP will fail at import time on slim base images without it.

9.2 Local Build and Sanity Test

docker build -t sushrutgaikwad/youtube-comments-analyzer .
docker run -p 8888:5000 \
    -e AWS_ACCESS_KEY_ID=DUMMY \
    -e AWS_SECRET_ACCESS_KEY=DUMMY \
    -e DAGSHUB_USER_TOKEN=DUMMY \
    sushrutgaikwad/youtube-comments-analyzer

Port 5000 inside the container is exposed on 8888 on the host. The dummy env vars are enough to let the container start; real AWS/DagsHub credentials are only needed if you actually want it to load the Production model from the registry.

9.3 Pushing to ECR

After an aws configure, the standard four ECR commands (all automated in the workflow):

aws ecr get-login-password --region us-east-2 \
    | docker login --username AWS \
    --password-stdin 872515288060.dkr.ecr.us-east-2.amazonaws.com
docker build -t youtube-comments-analyzer .
docker tag youtube-comments-analyzer:latest \
    872515288060.dkr.ecr.us-east-2.amazonaws.com/youtube-comments-analyzer:latest
docker push 872515288060.dkr.ecr.us-east-2.amazonaws.com/youtube-comments-analyzer:latest

In CI these become four GitHub Actions steps, each gated on if: success() so a failure earlier in the workflow (say, a model test failing) aborts the build before it can ship a broken image.


10 Deployment: AWS EC2 + CodeDeploy

10.1 Deployment Topology

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart TB
 USER[User] -- HTTPS --> DNS[Route 53 / ALB DNS]
 DNS --> ALB[Application Load Balancer<br/>health checks]

 subgraph ASG["Auto Scaling Group (2-3 instances)"]
 direction LR
 I1[EC2 instance 1<br/>CodeDeploy agent<br/>Docker: FastAPI:5000]
 I2[EC2 instance 2<br/>CodeDeploy agent<br/>Docker: FastAPI:5000]
 I3[EC2 instance 3<br/>on scale-out]
 end

 ALB --> I1
 ALB --> I2
 ALB -. scale-out .-> I3

 subgraph CD_STACK["Deployment control plane"]
 CDAPP[CodeDeploy application]
 CDGRP[Deployment group<br/>OneAtATime]
 S3B[(S3:<br/>deployment.zip)]
 ECR[(ECR:<br/>latest image)]
 end

 CDGRP --> ASG
 S3B --> CDGRP
 I1 -.->|docker pull| ECR
 I2 -.->|docker pull| ECR
 I3 -.->|docker pull| ECR
Figure 39: AWS deployment topology.

10.2 The Two IAM Roles

Deployment hinges on two separate IAM roles:

  1. EC2 instance role, attached to the launch template. Lets each instance pull images from ECR and be managed by CodeDeploy. Policies:
    • AmazonEC2ContainerRegistryReadOnly
    • AmazonEC2RoleforAWSCodeDeploy
  2. CodeDeploy service role, lets CodeDeploy itself manipulate the Auto Scaling Group, target groups, and load balancer. Policy: AWSCodeDeployRole.

10.3 Launch Template + User Data

The launch template bakes the CodeDeploy agent into every instance via its user-data script:

#!/bin/bash
sudo apt-get update -y
sudo apt-get install ruby -y
wget https://aws-codedeploy-us-east-2.s3.us-east-2.amazonaws.com/latest/install
chmod +x ./install
sudo ./install auto
sudo service codedeploy-agent start

After the ASG launches, sudo service codedeploy-agent status on each instance confirms the agent is running.

10.4 Auto Scaling Group Configuration

  • Size: min 2, desired 2, max 3.
  • Distribution: Balanced best-effort across AZs.
  • Load balancer: new internet-facing Application Load Balancer, new target group, ELB health checks enabled.
  • Scaling policy: target tracking on average CPU utilization, target value 50%, instance warm-up 300s.

10.5 CodeDeploy Application & Deployment Group

  • Compute platform: EC2/On-premises.
  • Deployment type: In-place.
  • Environment: pointed at the ASG above.
  • Deployment configuration: CodeDeployDefault.OneAtATime, one instance taken out of rotation at a time, preventing the fleet from ever being fully down.
  • Load balancer: the ALB + target group from the ASG.

10.6 The Deployment Bundle

Three files make up the deployment bundle that CodeDeploy consumes:

10.6.1 appspec.yml

version: 0.0
os: linux
files:
    - source: /
      destination: /home/ubuntu/app
hooks:
    BeforeInstall:
        - location: deploy/scripts/install_dependencies.sh
          timeout: 300
          runas: ubuntu
    ApplicationStart:
        - location: deploy/scripts/start_docker.sh
          timeout: 300
          runas: ubuntu

10.6.2 deploy/scripts/install_dependencies.sh

Idempotent setup of the instance: Docker, AWS CLI, user permissions.

#!/bin/bash
export DEBIAN_FRONTEND=noninteractive
sudo apt-get update -y
sudo apt-get install -y docker.io
sudo systemctl start docker && sudo systemctl enable docker
sudo apt-get install -y unzip curl

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "/home/ubuntu/awscliv2.zip"
unzip -o /home/ubuntu/awscliv2.zip -d /home/ubuntu/
sudo /home/ubuntu/aws/install

sudo usermod -aG docker ubuntu
rm -rf /home/ubuntu/awscliv2.zip /home/ubuntu/aws

10.6.3 deploy/scripts/start_docker.sh

Pull the latest image, cleanly replace any running container, bind host port 80 to container port 5000 so the ALB’s HTTP listener lines up with the app:

#!/bin/bash
exec > /home/ubuntu/start_docker.log 2>&1

echo "Logging in to ECR..."
aws ecr get-login-password --region us-east-2 \
    | docker login --username AWS \
    --password-stdin 872515288060.dkr.ecr.us-east-2.amazonaws.com

echo "Pulling Docker image..."
docker pull 872515288060.dkr.ecr.us-east-2.amazonaws.com/youtube-comments-analyzer:latest

echo "Checking for existing container..."
if [ "$(docker ps -q -f name=youtube-comments-analyzer)" ]; then
    docker stop youtube-comments-analyzer
fi
if [ "$(docker ps -aq -f name=youtube-comments-analyzer)" ]; then
    docker rm youtube-comments-analyzer
fi

echo "Starting new container..."
docker run -d -p 80:5000 --name youtube-comments-analyzer \
    872515288060.dkr.ecr.us-east-2.amazonaws.com/youtube-comments-analyzer:latest

echo "Container started successfully."

10.7 Wiring It All to CI

The final three steps of ci-cd.yaml zip the bundle, upload to S3, and trigger CodeDeploy:

- name: Zip Files for Deployment
    if: success()
    run: zip -r deployment.zip appspec.yml deploy/scripts/install_dependencies.sh deploy/scripts/start_docker.sh

- name: Upload Zip File to S3
    if: success()
    run: aws s3 cp deployment.zip s3://yt-comments-analyzer-codedeploy-bucket/deployment.zip

- name: Deploy to AWS CodeDeploy
    if: success()
    run: |
        aws deploy create-deployment \
        --application-name youtube-comments-analyzer \
        --deployment-config-name CodeDeployDefault.OneAtATime \
        --deployment-group-name youtube-comments-analyzer-deployment-group \
        --s3-location bucket=yt-comments-analyzer-codedeploy-bucket,key=deployment.zip,bundleType=zip \
        --file-exists-behavior OVERWRITE \
        --region us-east-2

11 Observability (Planned)

The current deployment is covered by AWS CloudWatch for infrastructure-level telemetry: instance CPU, ALB request counts, target group health, container logs via the Docker driver. That’s enough to know whether the fleet is alive; it is not enough to know whether the model is alive and behaving.

The intended next step is a model-observability layer. The sketch:

%%{init: {'theme':'dark', 'themeVariables': { 'primaryColor': '#375a7f', 'primaryTextColor': '#fff', 'primaryBorderColor': '#6c757d', 'lineColor': '#adb5bd'}}}%%
flowchart LR
 API[FastAPI<br/>/metrics endpoint] -->|scrape| PROM[Prometheus]
 API --> CW[CloudWatch Logs]
 PROM --> GRAF[Grafana dashboards]
 CW --> GRAF

 subgraph Metrics["Planned metrics"]
 M1[Inference latency<br/>p50/p95/p99]
 M2[Request throughput]
 M3[Predicted class<br/>distribution]
 M4[Input length<br/>distribution]
 M5[Error rate per<br/>endpoint]
 end

 API -.emits.-> M1 & M2 & M3 & M4 & M5

 style PROM fill:#555,stroke:#fff,color:#fff,stroke-dasharray: 5 5
 style GRAF fill:#555,stroke:#fff,color:#fff,stroke-dasharray: 5 5
Figure 40: Planned observability stack.

Dashed components are planned, not shipped. The Grafana dashboards would compare live predicted-class distribution and input-length distribution against their training-time counterparts to catch data drift early. Examples include a shift toward shorter comments, or a sudden collapse of the negative-class prediction rate the way the baseline model did.


12 What’s Deliberately Out of Scope (For Now)

A few things that are mentioned elsewhere as “we could do this” but were intentionally not pursued in v1:

  • Multilingual support. Detecting and handling non-English and code-switched comments would require a very different feature pipeline (likely multilingual embeddings) and a much larger labelled corpus. Punted.
  • Sarcasm detection. Genuinely open research. A bag-of-words LightGBM is not going to solve this.
  • A deep-learning model (transformer fine-tune). Experiment 5 showed all classical ML families topping out in a similar range, a transformer would plausibly improve absolute accuracy, but at a latency cost that would require either async inference or a bigger instance class, and the current accuracy/latency trade-off is acceptable for the product goal.
  • Stacking ensemble in production. Experiment 6 showed it was competitive on accuracy but the four-base-model latency was unacceptable.

13 Repository Map

Repo Role
youtube-comments-analyzer Training pipeline (DVC), FastAPI backend, Dockerfile, CI/CD workflow, deploy scripts.
youtube-comments-analyzer-frontend Chrome extension (manifest.json, popup.html, popup.js).

Key directories in the backend repo:

youtube-comments-analyzer/
├── youtube_comments_analyzer/    # Python package: DVC pipeline stages
│   ├── data_ingestion.py
│   ├── data_preprocessing.py
│   ├── feature_engineering.py
│   ├── model_training.py
│   ├── model_evaluation.py
│   └── model_registration.py
├── backend/                      # FastAPI app + inference-only requirements
│   ├── app.py
│   └── requirements.txt
├── tests/                        # pytest: model loading, signature, perf, API
├── scripts/                      # deployment scripts
│   └── promote_model.py
├── notebooks/                    # 01-21: EDA through final model selection
├── deploy/scripts/               # install_dependencies.sh, start_docker.sh
├── .github/workflows/            # CI/CD workflow
│   └── ci-cd.yaml
├── dvc.yaml                      # DVC pipeline definition
├── params.yaml                   # Hyperparameters and model configs
├── appspec.yml                   # CodeDeploy application specification
├── Dockerfile                    # Docker image definition
└── requirements.txt              # Production dependencies

14 Closing Thoughts

Two things stand out in retrospect.

The classical ML ceiling is real but liveable. Four very different model families (trees, linear, kernel) converged to the same accuracy band. That’s a signal that the data preprocessing and feature representation are the binding constraint, not the model family. That is exactly the kind of finding that would justify a feature-engineering investment (or a switch to learned embeddings) rather than another round of HPT. The decision to ship LightGBM with class_weight="balanced" was as much about operational simplicity as about metric wins.

The MLOps plumbing is where the project’s real leverage is. The model accuracy isn’t state-of-the-art. But the retrain-test-promote-deploy loop means a better model (whether it comes from new data, a transformer replacement, or a completely different approach) can be dropped in without rewriting anything downstream. That’s the piece that generalizes beyond this one project.