YouTube Comment Intelligence

Machine Learning

NLP

A Chrome extension for sentiment analysis on YouTube comments.

Author

Sushrut

Published

August 1, 2025

Project Planning

Problem Statement

Business Context

Consider that we are an influencer management company seeking to expand our network by attracting more influencers to join our platform. Due to a limited marketing budget, traditional advertising channels are not viable for us. To overcome this, we aim to offer a solution that addresses a significant pain point for influencers, thereby encouraging them to engage with our company.

Business Problem

Need to Attract More Influencers

Objective:
- Increase our influencer clientele to enhance our service offerings to brands and stay competitive.
Challenge:
- Limited marketing budget restricts our ability to reach and engage potential influencer clients through conventional means.

Identifying Influencer Pain Point

Understanding Influencer Challenges:
- To effectively attract influencers, we need to understand and address the key challenges they face.
Research Insight:
- Influencers, especially those with large followings, struggle with managing and interpreting the vast amount of feedback they receive via comments on their content.

Big Influencers Face Issues with Comments Analysis

Volume of Comments:
- High-profile influencers receive thousands of comments on their videos, making manual analysis impractical.
Time Constraints:
- Influencers often lack the time to sift through comments to extract meaningful insights.
Impact on Content Strategy:
- Without efficient comment analysis, influencers miss opportunities to understand audience sentiment, address concerns, and tailor their content effectively.

Our Solution

To directly address the significant pain point faced by big influencers—managing and interpreting vast amounts of comments data—we present the “Influencer Insights” Chrome extension. This tool is designed to empower influencers by providing in-depth analysis of their YouTube video comments, helping them make data-driven decisions to enhance their content and engagement strategies.

Key Features of the Extension

Sentiment Analysis of Comments

Real-Time Sentiment Classification:
- The extension performs real-time analysis of all comments on a YouTube video, classifying each as positive, neutral, or negative.
Sentiment Distribution Visualization:
- Displays the overall sentiment distribution with intuitive graphs or charts (e.g., pie charts or bar graphs showing percentages like 70% positive, 20% neutral, 10% negative).
Detailed Sentiment Insights:
- Allows users to drill down into each sentiment category to read specific comments classified under it.
Trend Tracking:
- Monitors how sentiment changes over time, helping influencers identify how different content affects audience perception.

Additional Comments Analysis Features

Word Cloud Visualization:
- Generates a word cloud showcasing the most frequently used words and phrases in the comments.
- Helps quickly identify trending topics, keywords, or recurring themes.
Average Comment Length:
- Calculates and displays the average length of comments, indicating the depth of audience engagement.
Export Data Functionality:
- Enables users to export analysis reports and visualizations in various formats (e.g., PDF, CSV) for further use or sharing with team members.

Challenges

Data

The first problem any data science project faces is about the data. There are various problems that we can face in this project. Some are as follows.

Availability

The problem we are solving is a supervised machine learning problem. So, we need a supervised dataset such that we have the comment and its associated label, i.e., whether the sentiment of the comment is positive, negative, or neutral. Obviously, finding this data is a challenge. One option is to use the YouTube Data API, but this will only get us the comments and not the associated labels. There is no readily available data for YouTube comments with their associated sentiment label. So, we will be using the Reddit data from Kaggle. This Kaggle dataset also contains Twitter data which we won’t use because during EDA I discovered that it is highly political. On the other hand, Reddit data is comparatively less political. Hence, a model trained on the Reddit data will be able to generalize better.

Lack of General Kind of Dataset

We cannot get a universal dataset that is representative of all types of content on YouTube. This means that our model won’t generalize well on all types of YouTube videos.

Multi-Language Comments

YouTube comments can have many languages, and the prevalence of these languages is also varying. Some comments have multiple languages too. Further, some comments use English script for typing other languages, e.g., using English to type Hindi words. So, it is quite difficult to train a model for all languages.

Spam and Bot Comments

Identifying and removing such meaningless comments from training the model is difficult. Such comments can bias the model and worsen its accuracy.

Slang, Emoji, and Informal Comments

Understanding the sentiment of such comments is a challenge.

Sarcastic Comments

It is easy for humans to understand sarcastic comments because they already know the context. However, the same is very difficult for a model to learn. Hence, such comments are also a problem during training.

Evolving Language Usage

Language evolves with each generation. There are many terms that were introduced by Generation Z (Gen-Z) that were not used before. So, comments by Gen-Z users will be different from users from other generations. This is also known as data drift. Modeling this behavior is difficult.

Privacy and Data Compliance

If you build a model using some data, you are always answerable to a company using your model. So, the source of the data is crucial.

Building an Efficient Model

We can already see that building an efficient model, due to the considerations indicated in the previous subsection, is already challenging. On top of that, noise, variability, class imbalance, etc., can make it even more difficult to train an efficient model.

Latency

Whenever someone opens a YouTube video, we want the extension to give results quickly. In other words, we want the latency of the model to be low. This again is a challenge given the complexity of the problem. We definitely would need to implement asynchronous programming.

User Experience

We would need to make sure that the user experience is good. If it is bad, users will definitely avoid using the extension. Hence, we will need to design it accordingly.

Workflow

The steps in the workflow of this project are:

Data collection,
Data preprocessing,
EDA,
Model building, hyperparameter tuning and evaluation alongside experiment tracking,
Building a DVC pipeline,
Registering the model,
Building the API using Flask,
Developing the Chrome extension,
Setting up CI/CD pipeline,
Testing,
Building the Docker image and pushing to ECR,
Deployment using AWS.

Tools and Technologies

We will use the following tools and technologies in this project.

Version Control and Collaboration

Git

Purpose:
- Distributed version control system for tracking changes in source code.
Usage:
- Manage codebase, track changes, and collaborate with team members.

GitHub

Purpose:
- Hosting service for Git repositories with collaboration features.
Usage:
- Store repositories, manage issues, pull requests, and facilitate team collaboration.

Data Management and Versioning

DVC (Data Version Control)

Purpose:
- Version control system for tracking large datasets and machine learning models.
Usage:
- Version datasets and machine learning pipelines, enabling reproducibility and collaboration.

AWS S3 (Simple Storage Service)

Purpose:
- Scalable cloud storage service.
Usage:
- Store datasets, pre-processed data, and model artifacts tracked by DVC.

Machine Learning and Experiment Tracking

Python

Purpose:
- Programming language for backend development and machine learning.

Machine Learning Libraries

scikit-learn:
- Purpose: Library for classical machine learning algorithms.
- Usage: Implement baseline models and preprocessing techniques.

NLP Libraries

NLTK (Natural Language Toolkit)
- Purpose: Platform for building Python programs to work with human language data.
- Usage: Tokenization, stemming, and other basic NLP tasks.
spaCy:
- Purpose: Industrial-strength NLP library.
- Usage: Advanced NLP tasks like named entity recognition, part-of-speech tagging.

MLFlow

Purpose:
- Platform for managing the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
Usage:
- Track experiments, log parameters, metrics, and artifacts; manage model versions.

MLFlow Model Registry

Purpose:
- Component of MLflow for managing the full lifecycle of ML models.
Usage:
- Register models, manage model stages (e.g., staging, production), and collaborate on model development.

Optuna

For Hyperparameter tuning.

Continuous Integration / Continuous Delivery (CI/CD)

GitHub Actions

Purpose:
- Automation platform that enables CI/CD directly from GitHub repositories.
Usage:
- Automate testing, building, and deployment pipelines.
- Trigger workflows on events like code commits or pull requests.

Cloud Services and Infrastructure

AWS (Amazon Web Services)

AWS EC2 (Elastic Compute Cloud):
- Purpose: Scalable virtual servers in the cloud.
- Usage: Host backend services, APIs, and model servers.
AWS Auto Scaling Groups:
- Purpose: Automatically adjust the number of EC2 instances to handle load changes.
- Usage:
  - Ensure that the application scales out during demand spikes to maintain performance.
  - Scale in during low demand periods to reduce costs.
  - Maintain application availability by automatically adding or replacing instances as needed.
AWS CodeDeploy:
- Purpose: Deployment service that automates application deployments to various compute services like EC2, Lambda, and on-premises servers.
- Usage:
  - Automate the deployment process of backend services and machine learning models to AWS EC2 instances or AWS Lambda.
  - Integrate with GitHub Actions to create a seamless CI/CD pipeline that deploys code changes automatically upon successful testing.
AWS CloudWatch:
- Purpose: Monitoring and observability service.
- Usage: Monitor application logs, set up alerts, and track performance metrics.
AWS IAM (Identity and Access Management):
- Purpose: Securely manage access to AWS services.
- Usage: Control access permissions for users and services.

Programming Languages and Libraries

Python

Purpose:
- Backend development, data processing, machine learning.
Usage:
- Implement APIs, machine learning models, data pipelines.

JavaScript

Purpose:
- Frontend development, especially for web applications and browser extensions.
Usage:
- Develop the Chrome extension’s user interface and functionality.

HTML and CSS

Purpose:
- Markup and styling languages for web content.
Usage:
- Structure and style the Chrome extension’s interface.

Data Processing Libraries

Pandas:
- Purpose: Data manipulation and analysis.
- Usage: Handle tabular data, preprocess datasets.
NumPy:
- Purpose: Fundamental package for scientific computing with Python.
- Usage: Perform numerical operations, handle arrays.

Frontend Development Tools

Chrome Extension APIs

Purpose:
- APIs provided by Chrome for building extensions.
Usage:
- Interact with browser features, modify web page content, manage extension behaviour.

Browser Developer Tools

Purpose:
- Built-in tools for debugging and testing web applications.
Usage:
- Inspect elements, debug JavaScript, monitor network activity.

Code Editors and IDEs

Visual Studio Code:
- Purpose: Source code editor.
- Usage: Write and edit code for both frontend and backend development.

Testing and Quality Assurance Tools

Testing Frameworks

Pytest:
- Purpose: Testing framework for Python.
- Usage: Write and run unit tests for backend code and data processing scripts.
Unittest:
- Purpose: Built-in Python testing framework.
- Usage: Write unit tests for Python code.
Jest:
- Purpose: JavaScript testing framework.
- Usage: Write and run tests for JavaScript code in the Chrome extension.

Project Management and Communication

Project Management Tools

Jira:
- Purpose: Issue and project tracking software.
- Usage: Manage tasks, track progress, and coordinate team activities.

Communication Tools

Slack:
- Purpose: Team communication platform.
- Usage: Facilitate real-time communication among team members.
Microsoft Teams:
- Purpose: Collaboration and communication platform.
- Usage: Chat, meet, call, and collaborate in one place.

DevOps and MLOps Tools

Docker

Purpose:
- Containerization platform.
Usage:
- Package applications and dependencies into containers for consistent deployment.

Security and Compliance

SSL/TLS Certificates

Purpose:
- Secure communications over a computer network.
Usage:
- Encrypt data between users and backend services.

Monitoring and Logging

Logging Tools

AWS CloudWatch Logs:
- Purpose: Monitor, store, and access log files.
- Usage: Collect and monitor logs from AWS resources.

Monitoring Tools

Prometheus (Optional)
- Purpose: Open-source monitoring system.
- Usage: Collect and store metrics, generate alerts.
Grafana:
- Purpose: Visualization and analytics software.
- Usage: Create dashboards to visualize metrics.

API Development and Testing

Frameworks

Flask:
- Purpose: Lightweight WSGI web application framework.
- Usage: Build RESTful APIs for backend services.

API Testing Tools

Postman:
- Purpose: API development environment.
- Usage: Design, test, and document APIs.

Code Quality and Documentation

Code Linters and Formatters

Pylint:
- Purpose: Code analysis for Python.
- Usage: Enforce coding standards, detect code smells.

Documentation Generation

Sphinx
- Purpose: Generate documentation from source code.
- Usage: Create project documentation automatically.

Additional Tools and Libraries

Visualization Libraries

Matplotlib:
- Purpose: Plotting library for Python.
- Usage: Create static, animated, and interactive visualizations.
Seaborn:
- Purpose: Statistical data visualization.
- Usage: Generate high-level interface for drawing attractive graphics.
D3.js
- Purpose: JavaScript library for producing dynamic, interactive data visualizations.
- Usage: Create word clouds and other visual elements in the Chrome extension.

Data Serialization Formats

JSON:
- Purpose: Lightweight data interchange format.
- Usage: Transfer data between frontend and backend services.

EDA & Preprocessing

Introduction

Data preprocessing and EDA were performed in cycles. First, basic data preprocessing was done; then, based on the results, EDA was carried out, and this cycle was repeated.

Missing Values

Explicit `NaN`s

Out of the 37,249 comments, only 100 contained explicit missing values (NaN). The sentiment associated with them was neutral. These were dropped without any problem, as their proportion was minuscule.

Empty Comments

6 comments in the data were actually just a whitespace character. They were also dropped.

Duplicates

350 comments were duplicates, which were again dropped without any issues.

Converting Comments to Lowercase

It is important to convert all the words in the comments to lowercase because we do not want our model to distinguish between the words “This” and “this”.

Leading or Trailing Whitespaces

32,266 comments in the data had leading or trailing whitespaces, which is a massive proportion of the whole. Hence, these whitespaces were removed by using the strip method.

Comments with URLs

As URLs are not helpful in doing sentiment analysis, they should be removed from the comments. However, on checking using a regular expression, we found that there were no URLs present in the data.

Newline and Tab Characters

These characters were replaced using a whitespace in all the comments.

Class Imbalance

The column corresponding to the sentiment of the comment is category, which has the following three values:

-1: For negative comments,
0: For neutral comments,
+1: For positive comments.

The class imbalance was visualized using a count plot, which is shown in

We found that 43% of the comments had positive sentiment, 35% had neutral sentiment, and 22% had negative sentiment.

Creating a Column for Word Count

We created a column called word_count which consists of the number of words in each comment. The aim was to investigate whether the length of the comment had any predictive power about its sentiment.

Distribution of Word Count

Figure 2 shows the distribution of the word count.

Clearly, it has a huge right skew. The overwhelming majority of comments have very few words, but there is a small proportion of comments with a huge amount of words.

Distribution of Word Count By Sentiment

Figure 3 shows the distribution of word count for each sentiment.

We can see the following:

Neutral comments: These comments seem to have very few words and their distribution is largely concentrated around shorter comments.
Positive comments: The word count of these comments have a wider spread, indicating that longer comments are more common in comments with a positive sentiment.
Negative comments: These comments have a distribution similar to positive comments.

Box Plot of Word Count By Sentiment

Figure 4 shows the box plot of word count for each sentiment.

We can see that positive and negative comments have more outliers as compared to neutral comments. This was also clear when we saw the plot in Figure 3. Some more observations are the following:

Neutral comments: The median word_count for these comments is the lowest, with a tighter IQR. This suggests that neutral comments are generally shorter.
Positive comments: The median word_count for these comments is relatively high, and there are several outliers with longer comments. This indicates that positive comments tend to be more verbose.
Negative comments: The word_count distribution of these comments is similar to the positive comments, but with a slightly lower median and fewer extreme outliers.

Bar Plot of Median Word Count By Sentiment

Figure 5 shows a bar plot of the median word count for each sentiment.

Again, the bar plot corroborates the fact that the distribution of neutral comments is different as compared to positive and negative comments.

Creating a Column for Stop Words

Using the nltk library, a new column called num_of_stop_words was created for each comment.

Distribution of Stop Words

Figure 6 shows the distribution of stop words in the data.

This distribution also has a huge right skew, just like the column word_count

Distribution of Stop Words By Sentiment

Figure 7 shows the distribution of stop words by sentiments.

Again, very similar behavior as the column word_count.

Bar Plot of Median of Stop Words By Sentiment

Figure 8 shows a bar plot of the median of stop words for each sentiment.

Figure 8: Bar plot of median stop words by sentiment.

Again, the behavior is very similar to the column word_count.

Top-25 Most Common Stop Words

Figure 9 shows the top-25 most common stop words in the data.

We can see that the word “not” is used quite a lot in the comments. This word can completely change the sentiment of a sentence. The words “but”, “however”, “no”, “yet”, etc., also have the same effect. These may also be present in the comments.

Removing Stop Words

As mentioned in the previous sub-section, there are stop words like “not” in the data that can completely reverse the sentiment of the sentence. Hence, retaining such stop words is crucial, which is exactly what was done. The stop words in the data were removed using the set of all English stop words available in the library nltk, except the stop words “not”, “but”, “however”, “no”, and “yet”.

Creating a column for Number of Characters

A new column called num_chars was created that counted the number of characters of the respective comment. Here, it was seen that the data consisted of many non-English special characters. These characters were removed using a regular expression pattern.

Checking Punctuations

It turned out that the data already had all the punctuation characters removed.

Most Common Bigrams and Trigrams

Figure 10 and Figure 11 show the top-25 most common bigrams and trigrams in the data, respectively.

Lemmatization

All the words were converted to their root form using the WordNetLemmatizer class in the stem module of the package nltk.

Word Clouds

Word clouds are very helpful tools that help visualize the prevalence of words in the data in a visual manner.

Word Cloud for the Entire Data

Figure 12 shows the word cloud plot for the entire data.

Word Cloud for Positive Comments

Figure 13 shows the word cloud plot for the positive comments.

Word Cloud for Neutral Comments

Figure 14 shows the word cloud plot for the neutral comments.

Word Cloud for Negative Comments

Figure 15 shows the word cloud plot for the negative comments.

From the word cloud plots, it looks like the types of words are more or less the same irrespective of the sentiment of the comment.

Most Frequent Words

Figure 16 shows the top-20 most frequent words in the data.

Most Frequent Words By Sentiment

Figure 17 shows the top-20 most frequent words in the data for each sentiment.

Figure 17: Top-20 most frequent words in the data by sentiment.

It looks like the distribution of the most frequent words in the data is more or less the same irrespective of the sentiment of the comment. This was also apparent by looking at the word cloud plots. Further, it seems like there are a lot of political comments. So, the model may perform better on political videos as compared to others.

Jupyter Notebook

This is the Jupyter notebook in which EDA was performed.

Training a Baseline Model

Introduction

We will now create a baseline model for sentiment analysis. Creating such a model is helpful because it is the simplest model possible that helps us get a benchmark performance that can be compared to future models that are more complex. We will also set up an MLFlow tracking server on DagsHub that will help track all the experiments and evaluate various models. It will also help improve the baseline model’s performance using techniques like hyperparameter tuning. This will further help plan the future experiments.

Creating a GitHub Repository

The local directory was created using the Cookiecutter Data Science template. A new Python environment was created inside this directory using the following command:

conda create -p venv python=3.11 -y

After activating this environment, all the required libraries were mentioned in the requirements.txt file and were installed using the following command:

pip install -r requirements.txt

Git was initialized in this directory, and it was pushed to a new GitHub repository.

Creating a DagsHub Repository

A DagsHub repository was created by simply connecting the GitHub repository.

Preprocessing

The same preprocessing that was performed during EDA was performed on the data.

Feature Engineering

As this is just a baseline model, the simple feature engineering technique of bag of words using the CountVectorizer class from scikit-learn was performed.

Baseline Model

Random forest was used as the baseline model as it is known for giving good results out of the box.

Experiment Tracking

Experiments were tracked on MLFlow server. Metrics like accuracy, precision, recall, etc., on the test data were logged for combinations of the following hyperparameters: max_features of CountVectorizer, and n_estimators and max_depth of RandomForestClassifier. Further, a heat map of the confusion matrix, the model, and the data were also logged on MLFlow.

Results

The best test accuracy obtained was around 65%. This needs to be improved. Table 1 shows the classification report of this model.

Table 1: Classification report for the baseline random forest classifier.

Class	Precision	Recall	F1-score	Support
-1	1.00	0.02	0.03	1650
0	0.66	0.85	0.74	2555
1	0.64	0.82	0.72	3154
Accuracy			0.65	7359
Macro Avg	0.77	0.56	0.49	7359
Weighted Avg	0.73	0.65	0.57	7359

Further, Figure 18 shows the heat map of the confusion matrix.

Figure 18: Confusion matrix of the baseline model. Note that here, “0” corresponds to negative comments, “1” corresponds to neutral comments, and “2” corresponds to positive comments.

The precision and recall scores of neutral and positive comments are decent, which has also translated to good corresponding F1-scores. However, the recall for negative comments is terrible. We know that recall is given by

\[ \begin{equation*} \text{Recall} = \frac{\text{True positives}}{\text{True positives} + \text{false negatives}} \end{equation*} \]

So, a recall score of $\approx 0$ implies that the model was unable to correctly classify negative comments as negative. In other words, very few negative comments were classified as negative by the model. This is also clear from the confusion matrix (Figure 18). The model has correctly classified only $28$ negative comments out of a total of $1650$ negative comments. And,

\[ \text{Recall of negative comments} = \frac{28}{28+571+1051} = \frac{28}{1650} \approx 0.02 \]

On the other hand, precision is given by

\[ \text{Precision} = \frac{\text{True positives}}{\text{True positives} + \text{false positives}} \]

So, we get

\[ \text{Precision of negative comments} = \frac{28}{28+0+0} = 1 \]

So, the model is performing extremely poorly on the negative comments. We need to improve this.

One reason this may be attributed to can be the imbalance. The data has a lesser number of negative comments as compared to neutral and positive ones. This is clear from the “Support” column shown in Table 1. So, this is the direction we need to head to.

Jupyter Notebook

Baseline model training was implemented in this Jupyter notebook.

Improving the Baseline Model

Discussion

Handling Class Imbalance

To reiterate, the normalized value counts in terms of percentage identified during EDA are the following:

Positive comments: 43%
Neutral comments: 35%
Negative comments: 22%

The reason for getting poor performance, especially for the negative comments, can be attributed to this class imbalance.

There are various methods to address this imbalance, e.g., over-sampling, under-sampling, SMOTE, etc. We will try all these. If there is any parameter in a model corresponding to the weight of a class, we will try tweaking that.

Using a More Complex Model

Using complex models like XGBoost, LightGBM, deep learning, etc., can improve the results, especially if applied after handling class imbalance.

Hyperparameter Tuning

Hyperparameter tuning using a Bayesian approach (by using Optuna) of more complex models can improve results significantly.

Use of Ensembling

We can use VotingClassifier or StackingClassifier and combine multiple models to improve the performance.

Feature Engineering

We used a very basic bag of words technique for the baseline model. We can use n-grams in conjunction with bag of words to improve results. We can also use embeddings like Word2vec. We can also use custom features like the number of stop words, the ratio of vowels to consonants, etc., to see if they improve the performance.

Data Preprocessing

Investigating the data deeper may give us more insights about it, which in turn may help improve the model’s performance.

Implementation

Experiment 1:
- Here, we will compare two feature engineering techniques: bag of words and TF-IDF.
- Further, we will also compare unigram, bigram, and trigram results, keeping the value of the hyperparameter max_features fixed.
Experiment 2:
- Once the best feature engineering technique is found, we will find the best value of the hyperparameter max_features.
Experiment 3:
- Here, we will apply various techniques to handle the data imbalance.
- We will try under-sampling, ADASYN, SMOTE, SMOTE with ENN, and using class weights.
Experiment 4:
- Here, we will try out many models on the data. They are: XGBoost, LightGBM, random forest, SVM, logistic regression, $k$-nearest neighbors, and naive Bayes.
- Further, we will also perform some coarse hyperparameter tuning on all these models.
Experiment 5:
- Here, once we find the best performing algorithm from the previous experiment, we will perform a significantly finer hyperparameter tuning on this model.

Experiment 1

This experiment was mainly to determine which feature engineering technique among bag of words and TF-IDF is better. The value of max_features was kept fixed at 5,000 for this experiment. Figure 19 shows the parallel coordinates plot obtained on MLFlow for this experiment.

We can clearly see that the best accuracy and the best recall for negative comments correspond to the combination of bigram with bag of words. So, the best feature engineering technique observed in this experiment is bag of words with bigrams.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 2

This experiment was mainly to determine the value of the hyperparameter max_features that gives the best results considering bag of words with bigrams. The value of this hyperparameter was varied from 1,000 to 10,000 in steps of 1,000. Figure 20 shows the parallel coordinates plot obtained on MLFlow for this experiment.

We can clearly see that the accuracy is higher for lower values of max_features. The same is true for the recall of negative comments. Also, we can see that the recall value is now substantially improved as compared to the previous experiments. Hence, we can conclude that the best value of max_features is 1,000.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Experiment 3

The main aim of this experiment was to determine which balancing technique gives the best model performance. As mentioned earlier, the techniques tried were under-sampling, ADASYN, SMOTE, SMOTE with ENN, and using class weights in random forest. Figure 21, Figure 22, and Figure 23 show the parallel coordinates plot obtained on MLFlow for this experiment corresponding to the negative comments, neutral comments, and positive comments, respectively.

Figure 22: Parallel coordinates plot for experiment 3 (neutral comments).

Figure 23: Parallel coordinates plot for experiment 3 (positive comments).

We can clearly see that the under-sampling method is performing the best for all types of comments. It is giving the most balanced performance.

Jupyter Notebook

This experiment was performed in this Jupyter notebook.

Now, using the best configuration that we have discovered till now, i.e., bag of words with bigrams, 1000 features, and under-sampling, we will do hyperparameter tuning for all the machine learning models using Optuna.

Experiment 4

As mentioned earlier, we will try out 7 machine learning models on the best configuration learned so far. These models are: XGBoost, LightGBM, random forest, SVM, logistic regression, $k$-nearest neighbors, and naive Bayes. Further, we will also perform a coarse hyperparameter tuning on each of them using Optuna. We will log the best performing version of each of these models as a run on MLFlow. So, we will end up getting a total of 7 runs for this experiment. Figure 24, Figure 25, and Figure 26 show the parallel coordinates plot obtained on MLFlow for this experiment corresponding to the negative comments, neutral comments, and positive comments, respectively.

Figure 25: Parallel coordinates plot for experiment 4 (neutral comments).

Figure 26: Parallel coordinates plot for experiment 4 (positive comments).

From these figures, it is clear that the most reliable and consistent performers are XGBoost, SVM, logistic regression, and LightGBM. This is shown even more clearly in Figure 27.

Figure 27: Parallel coordinates plot for experiment 4. Note that here ‘2’ indicate negative comments, ‘0’ indicate neural comments, and ‘1’ indicates positive comments.

It is not clear which one to choose among them. So, we will instead perform an extensive and finer hyperparameter tuning on all 4 of these.

Jupyter Notebooks

This experiment was performed in the following Jupyter notebooks:

Experiment 5

As mentioned earlier, we will now perform detailed hyperparameter tuning of LightGBM, XGBoost, SVM, and logistic regression. Figure 28 shows the parallel coordinates plot obtained on MLFlow for this experiment.

Again, even after extensive hyperparameter tuning, there is no clear winner. So, it looks like choosing any model among these four would work. We will choose LightGBM. We also plotted the hyperparameter importances for LightGBM using Optuna. Figure 29 shows this plot.

Figure 29: Hyperparameter importances for LightGBM.

We can see that the top-3 most important hyperparameters are learning_rate, max_depth, and min_child_samples.

Note that the accuracy we have is less than even 80%. We need to improve it further.

Jupyter Notebooks

This experiment was performed in the following Jupyter notebooks:

Improving the LightGBM Model

Techniques Used

We will try the following techniques to improve the model performance:

Balancing data using the class_weight parameter,
Word2vec,
Creating custom features, e.g., average word length, etc.
Ensemble learning (stacking).

Balancing Using `class_weight`

In the LightGBMClassifier implementation, the following two parameters can be provided if the data is imbalanced: is_unbalance = True and class_weight = "balanced". Using this setting will assign weights to all the classes appropriately according to their imbalance. Doing this, we can skip the undersampling step and directly train the model.

Jupyter Notebook

This was implemented in this Jupyter notebook.

Using Word2Vec

Word2Vec is a technique that is used to create vector representations of words. It is a feature engineering technique very commonly used in NLP. We used it with a vector_size} of 300, a window of 5, and we used the skip-gram model. Further, undersampling was used to balance the data.

Jupyter Notebook

This was implemented in this Jupyter notebook.

Creating Custom Features

Firstly, we created the following 6 new custom features from the raw data:

comment_length: This represents the number of characters in a comment.
word_count: This represents the number of words in a comment.
avg_word_length: This represents the average word length in a comment.
unique_word_count: This represents the count of words that are unique in a comment.
lexical_diversity: This represents the diversity of words used in a comment.
pos_count: This represents the number of parts of speech used in the comment.

Further, more features were also created using the spaCy library. For instance, we created the proportion of each type of parts of speech like adjectives, verbs, etc. Some proportions were turning out to be NaN’s possibly due to zeros in the denominator. These were filled using 0’s. Finally, undersampling was used to balance the data.

Jupyter Notebook

This was implemented in this Jupyter notebook.

Stacking Classifier

Here, we tried using ensemble learning. We had discovered earlier that four models, namely, XGBoost, SVM, LightGBM, and logistic regression had almost identical performance. Figure 27 indicated this. So the plan was to use all these four models as base learners and use a meta-learner on top of them to form a stacking classifier. We used $k$-nearest neighbors model as the meta-learner. Further, undersampling was used to balance the data.

Jupyter Notebook

This was implemented in this Jupyter notebook.

Comparison of the Results

Extensive hyperparameter tuning using Optuna was also done on all the models built using these 4 techniques, and the best version of them was logged on MLFlow. Figure 30 shows the parallel coordinates plot obtained on MLFlow for all four improvement techniques we tried.

Figure 30: Parallel coordinates plot that compares using `class_weight` for balancing, word2vec, custom features, and stacking techniques.

The observations are the following:

Word2vec:
- This technique has the worst performance. Hence, we won’t use this as our final model.
Stacking:
- This resulted in relatively decent performance.
- However, as 4 models are used in this ensemble, the latency is bad. Hence, we won’t use this as our final model too.
Custom features:
- Using custom features gave us good results.
- However, we needed to use undersampling to balance the classes.
- Also, the recall for the negative comments (indicated with “2” on the plot) is not that great.
Using class_weight:
- This gave us the best results without needing to balance the data.
- The recall for the negative comments is also good.
- So, we may want to finalize this.

Building the DVC Pipeline

Recap

We have already created the project directory using the cookiecutter data science template, initialized Git in it, and pushed it to a GitHub repository.

Stages in the DVC Pipeline

Data Ingestion

The raw data is kept in an Amazon S3 bucket. We will carry out the following steps in this stage:

Fetch the raw data from the S3 bucket.
Carry out the following basic data cleaning tasks:

Drop the missing values,
Drop the duplicates,
Remove the empty comments.

Split the data into training and test sets.
Save the training and test sets in the “data/raw” directory.

This is implemented in the data_ingestion.py module.