YouTube Comments Analyzer

A Chrome extension for sentiment analysis on YouTube comments.

Project Planning

Problem Statement

Business Context

Consider that we are an influencer management company seeking to expand our network by attracting more influencers to join our platform. Due to a limited marketing budget, traditional advertising channels are not viable for us. To overcome this, we aim to offer a solution that addresses a significant pain point for influencers, thereby encouraging them to engage with our company.

Business Problem

Need to Attract More Influencers

  • Objective:
    • Increase our influencer clientele to enhance our service offerings to brands and stay competitive.
  • Challenge:
    • Limited marketing budget restricts our ability to reach and engage potential influencer clients through conventional means.

Identifying Influencer Pain Point

  • Understanding Influencer Challenges:
    • To effectively attract influencers, we need to understand and address the key challenges they face.
  • Research Insight:
    • Influencers, especially those with large followings, struggle with managing and interpreting the vast amount of feedback they receive via comments on their content.

Big Influencers Face Issues with Comments Analysis

  • Volume of Comments:
    • High-profile influencers receive thousands of comments on their videos, making manual analysis impractical.
  • Time Constraints:
    • Influencers often lack the time to sift through comments to extract meaningful insights.
  • Impact on Content Strategy:
    • Without efficient comment analysis, influencers miss opportunities to understand audience sentiment, address concerns, and tailor their content effectively.

Our Solution

To directly address the significant pain point faced by big influencers—managing and interpreting vast amounts of comments data—we present the “Influencer Insights” Chrome extension. This tool is designed to empower influencers by providing in-depth analysis of their YouTube video comments, helping them make data-driven decisions to enhance their content and engagement strategies.

Key Features of the Extension

Sentiment Analysis of Comments

  • Real-Time Sentiment Classification:
    • The extension performs real-time analysis of all comments on a YouTube video, classifying each as positive, neutral, or negative.
  • Sentiment Distribution Visualization:
    • Displays the overall sentiment distribution with intuitive graphs or charts (e.g., pie charts or bar graphs showing percentages like 70% positive, 20% neutral, 10% negative).
  • Detailed Sentiment Insights:
    • Allows users to drill down into each sentiment category to read specific comments classified under it.
  • Trend Tracking:
    • Monitors how sentiment changes over time, helping influencers identify how different content affects audience perception.

Additional Comments Analysis Features

  • Word Cloud Visualization:
    • Generates a word cloud showcasing the most frequently used words and phrases in the comments.
    • Helps quickly identify trending topics, keywords, or recurring themes.
  • Average Comment Length:
    • Calculates and displays the average length of comments, indicating the depth of audience engagement.
  • Export Data Functionality:
    • Enables users to export analysis reports and visualizations in various formats (e.g., PDF, CSV) for further use or sharing with team members.

Challenges

Data

The first problem any data science project faces is about the data. There are various problems that we can face in this project. Some are as follows.

Availability

The problem we are solving is a supervised machine learning problem. So, we need a supervised dataset such that we have the comment and its associated label, i.e., whether the sentiment of the comment is positive, negative, or neutral. Obviously, finding this data is a challenge. One option is to use the YouTube Data API, but this will only get us the comments and not the associated labels. There is no readily available data for YouTube comments with their associated sentiment label. So, we will be using the Reddit data from Kaggle. This Kaggle dataset also contains Twitter data which we won’t use because during EDA I discovered that it is highly political. On the other hand, Reddit data is comparatively less political. Hence, a model trained on the Reddit data will be able to generalize better.

Lack of General Kind of Dataset

We cannot get a universal dataset that is representative of all types of content on YouTube. This means that our model won’t generalize well on all types of YouTube videos.

Multi-Language Comments

YouTube comments can have many languages, and the prevalence of these languages is also varying. Some comments have multiple languages too. Further, some comments use English script for typing other languages, e.g., using English to type Hindi words. So, it is quite difficult to train a model for all languages.

Spam and Bot Comments

Identifying and removing such meaningless comments from training the model is difficult. Such comments can bias the model and worsen its accuracy.

Slang, Emoji, and Informal Comments

Understanding the sentiment of such comments is a challenge.

Sarcastic Comments

It is easy for humans to understand sarcastic comments because they already know the context. However, the same is very difficult for a model to learn. Hence, such comments are also a problem during training.

Evolving Language Usage

Language evolves with each generation. There are many terms that were introduced by Generation Z (Gen-Z) that were not used before. So, comments by Gen-Z users will be different from users from other generations. This is also known as data drift. Modeling this behavior is difficult.

Privacy and Data Compliance

If you build a model using some data, you are always answerable to a company using your model. So, the source of the data is crucial.

Building an Efficient Model

We can already see that building an efficient model, due to the considerations indicated in the previous subsection, is already challenging. On top of that, noise, variability, class imbalance, etc., can make it even more difficult to train an efficient model.

Latency

Whenever someone opens a YouTube video, we want the extension to give results quickly. In other words, we want the latency of the model to be low. This again is a challenge given the complexity of the problem. We definitely would need to implement asynchronous programming.

User Experience

We would need to make sure that the user experience is good. If it is bad, users will definitely avoid using the extension. Hence, we will need to design it accordingly.

Workflow

The steps in the workflow of this project are:

  1. Data collection,
  2. Data preprocessing,
  3. EDA,
  4. Model building, hyperparameter tuning and evaluation alongside experiment tracking,
  5. Building a DVC pipeline,
  6. Registering the model,
  7. Building the API using Flask,
  8. Developing the Chrome extension,
  9. Setting up CI/CD pipeline,
  10. Testing,
  11. Building the Docker image and pushing to ECR,
  12. Deployment using AWS.

Tools and Technologies

We will use the following tools and technologies in this project.

Version Control and Collaboration

Git

  • Purpose:
    • Distributed version control system for tracking changes in source code.
  • Usage:
    • Manage codebase, track changes, and collaborate with team members.

GitHub

  • Purpose:
    • Hosting service for Git repositories with collaboration features.
  • Usage:
    • Store repositories, manage issues, pull requests, and facilitate team collaboration.

Data Management and Versioning

DVC (Data Version Control)

  • Purpose:
    • Version control system for tracking large datasets and machine learning models.
  • Usage:
    • Version datasets and machine learning pipelines, enabling reproducibility and collaboration.

AWS S3 (Simple Storage Service)

  • Purpose:
    • Scalable cloud storage service.
  • Usage:
    • Store datasets, pre-processed data, and model artifacts tracked by DVC.

Machine Learning and Experiment Tracking

Python

  • Purpose:
    • Programming language for backend development and machine learning.

Machine Learning Libraries

  • scikit-learn:
    • Purpose: Library for classical machine learning algorithms.
    • Usage: Implement baseline models and preprocessing techniques.

NLP Libraries

  • NLTK (Natural Language Toolkit)
    • Purpose: Platform for building Python programs to work with human language data.
    • Usage: Tokenization, stemming, and other basic NLP tasks.
  • spaCy:
    • Purpose: Industrial-strength NLP library.
    • Usage: Advanced NLP tasks like named entity recognition, part-of-speech tagging.

MLFlow

  • Purpose:
    • Platform for managing the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry.
  • Usage:
    • Track experiments, log parameters, metrics, and artifacts; manage model versions.

MLFlow Model Registry

  • Purpose:
    • Component of MLflow for managing the full lifecycle of ML models.
  • Usage:
    • Register models, manage model stages (e.g., staging, production), and collaborate on model development.

Optuna

  • For Hyperparameter tuning.

Continuous Integration / Continuous Delivery (CI/CD)

GitHub Actions

  • Purpose:
    • Automation platform that enables CI/CD directly from GitHub repositories.
  • Usage:
    • Automate testing, building, and deployment pipelines.
    • Trigger workflows on events like code commits or pull requests.

Cloud Services and Infrastructure

AWS (Amazon Web Services)

  • AWS EC2 (Elastic Compute Cloud):
    • Purpose: Scalable virtual servers in the cloud.
    • Usage: Host backend services, APIs, and model servers.
  • AWS Auto Scaling Groups:
    • Purpose: Automatically adjust the number of EC2 instances to handle load changes.
    • Usage:
      • Ensure that the application scales out during demand spikes to maintain performance.
      • Scale in during low demand periods to reduce costs.
      • Maintain application availability by automatically adding or replacing instances as needed.
  • AWS CodeDeploy:
    • Purpose: Deployment service that automates application deployments to various compute services like EC2, Lambda, and on-premises servers.
    • Usage:
      • Automate the deployment process of backend services and machine learning models to AWS EC2 instances or AWS Lambda.
      • Integrate with GitHub Actions to create a seamless CI/CD pipeline that deploys code changes automatically upon successful testing.
  • AWS CloudWatch:
    • Purpose: Monitoring and observability service.
    • Usage: Monitor application logs, set up alerts, and track performance metrics.
  • AWS IAM (Identity and Access Management):
    • Purpose: Securely manage access to AWS services.
    • Usage: Control access permissions for users and services.

Programming Languages and Libraries

Python

  • Purpose:
    • Backend development, data processing, machine learning.
  • Usage:
    • Implement APIs, machine learning models, data pipelines.

JavaScript

  • Purpose:
    • Frontend development, especially for web applications and browser extensions.
  • Usage:
    • Develop the Chrome extension’s user interface and functionality.

HTML and CSS

  • Purpose:
    • Markup and styling languages for web content.
  • Usage:
    • Structure and style the Chrome extension’s interface.

Data Processing Libraries

  • Pandas:
    • Purpose: Data manipulation and analysis.
    • Usage: Handle tabular data, preprocess datasets.
  • NumPy:
    • Purpose: Fundamental package for scientific computing with Python.
    • Usage: Perform numerical operations, handle arrays.

Frontend Development Tools

Chrome Extension APIs

  • Purpose:
    • APIs provided by Chrome for building extensions.
  • Usage:
    • Interact with browser features, modify web page content, manage extension behaviour.

Browser Developer Tools

  • Purpose:
    • Built-in tools for debugging and testing web applications.
  • Usage:
    • Inspect elements, debug JavaScript, monitor network activity.

Code Editors and IDEs

  • Visual Studio Code:
    • Purpose: Source code editor.
    • Usage: Write and edit code for both frontend and backend development.

Testing and Quality Assurance Tools

Testing Frameworks

  • Pytest:
    • Purpose: Testing framework for Python.
    • Usage: Write and run unit tests for backend code and data processing scripts.
  • Unittest:
    • Purpose: Built-in Python testing framework.
    • Usage: Write unit tests for Python code.
  • Jest:
    • Purpose: JavaScript testing framework.
    • Usage: Write and run tests for JavaScript code in the Chrome extension.

Project Management and Communication

Project Management Tools

  • Jira:
    • Purpose: Issue and project tracking software.
    • Usage: Manage tasks, track progress, and coordinate team activities.

Communication Tools

  • Slack:
    • Purpose: Team communication platform.
    • Usage: Facilitate real-time communication among team members.
  • Microsoft Teams:
    • Purpose: Collaboration and communication platform.
    • Usage: Chat, meet, call, and collaborate in one place.

DevOps and MLOps Tools

Docker

  • Purpose:
    • Containerization platform.
  • Usage:
    • Package applications and dependencies into containers for consistent deployment.

Security and Compliance

SSL/TLS Certificates

  • Purpose:
    • Secure communications over a computer network.
  • Usage:
    • Encrypt data between users and backend services.

Monitoring and Logging

Logging Tools

  • AWS CloudWatch Logs:
    • Purpose: Monitor, store, and access log files.
    • Usage: Collect and monitor logs from AWS resources.

Monitoring Tools

  • Prometheus (Optional)
    • Purpose: Open-source monitoring system.
    • Usage: Collect and store metrics, generate alerts.
  • Grafana:
    • Purpose: Visualization and analytics software.
    • Usage: Create dashboards to visualize metrics.

API Development and Testing

Frameworks

  • Flask:
    • Purpose: Lightweight WSGI web application framework.
    • Usage: Build RESTful APIs for backend services.

API Testing Tools

  • Postman:
    • Purpose: API development environment.
    • Usage: Design, test, and document APIs.

Code Quality and Documentation

Code Linters and Formatters

  • Pylint:
    • Purpose: Code analysis for Python.
    • Usage: Enforce coding standards, detect code smells.

Documentation Generation

  • Sphinx
    • Purpose: Generate documentation from source code.
    • Usage: Create project documentation automatically.

Additional Tools and Libraries

Visualization Libraries

  • Matplotlib:
    • Purpose: Plotting library for Python.
    • Usage: Create static, animated, and interactive visualizations.
  • Seaborn:
    • Purpose: Statistical data visualization.
    • Usage: Generate high-level interface for drawing attractive graphics.
  • D3.js
    • Purpose: JavaScript library for producing dynamic, interactive data visualizations.
    • Usage: Create word clouds and other visual elements in the Chrome extension.

Data Serialization Formats

  • JSON:
    • Purpose: Lightweight data interchange format.
    • Usage: Transfer data between frontend and backend services.

EDA & Preprocessing

Introduction

Data preprocessing and EDA were performed in cycles. First, basic data preprocessing was done; then, based on the results, EDA was carried out, and this cycle was repeated.

Missing Values

Explicit NaNs

Out of the 37,249 comments, only 100 contained explicit missing values (NaN). The sentiment associated with them was neutral. These were dropped without any problem, as their proportion was minuscule.

Empty Comments

6 comments in the data were actually just a whitespace character. They were also dropped.

Duplicates

350 comments were duplicates, which were again dropped without any issues.

Converting Comments to Lowercase

It is important to convert all the words in the comments to lowercase because we do not want our model to distinguish between the words “This” and “this”.

Leading or Trailing Whitespaces

32,266 comments in the data had leading or trailing whitespaces, which is a massive proportion of the whole. Hence, these whitespaces were removed by using the strip method.

Comments with URLs

As URLs are not helpful in doing sentiment analysis, they should be removed from the comments. However, on checking using a regular expression, we found that there were no URLs present in the data.

Newline and Tab Characters

These characters were replaced using a whitespace in all the comments.

Class Imbalance

The column corresponding to the sentiment of the comment is category, which has the following three values:

  • -1: For negative comments,
  • 0: For neutral comments,
  • +1: For positive comments.

The class imbalance was visualized using a count plot, which is shown in Figure 1.

Class imbalance.
Figure 1: Class imbalance.

We found that 43% of the comments had positive sentiment, 35% had neutral sentiment, and 22% had negative sentiment.

Creating a Column for Word Count

We created a column called word_count which consists of the number of words in each comment. The aim was to investigate whether the length of the comment had any predictive power about its sentiment.

Distribution of Word Count

Figure 2 shows the distribution of the word count.

Distribution of word count.
Figure 2: Distribution of word count.

Clearly, it has a huge right skew. The overwhelming majority of comments have very few words, but there is a small proportion of comments with a huge amount of words.

Distribution of Word Count By Sentiment

Figure 3 shows the distribution of word count for each sentiment.

Distribution of word count by sentiment.
Figure 3: Distribution of word count by sentiment.

We can see the following:

  • Neutral comments: These comments seem to have very few words and their distribution is largely concentrated around shorter comments.
  • Positive comments: The word count of these comments have a wider spread, indicating that longer comments are more common in comments with a positive sentiment.
  • Negative comments: These comments have a distribution similar to positive comments.

Box Plot of Word Count By Sentiment

Figure 4 shows the box plot of word count for each sentiment.

Box plot of word count by sentiment.
Figure 4: Box plot of word count by sentiment.

We can see that positive and negative comments have more outliers as compared to neutral comments. This was also clear when we saw the plot in Figure 3. Some more observations are the following:

  • Neutral comments: The median word_count for these comments is the lowest, with a tighter IQR. This suggests that neutral comments are generally shorter.
  • Positive comments: The median word_count for these comments is relatively high, and there are several outliers with longer comments. This indicates that positive comments tend to be more verbose.
  • Negative comments: The word_count distribution of these comments is similar to the positive comments, but with a slightly lower median and fewer extreme outliers.

Bar Plot of Median Word Count By Sentiment

Figure 5 shows a bar plot of the median word count for each sentiment.

Bar plot of median word count by sentiment.
Figure 5: Bar plot of median word count by sentiment.

Again, the bar plot corroborates the fact that the distribution of neutral comments is different as compared to positive and negative comments.

Creating a Column for Stop Words

Using the nltk library, a new column called num_of_stop_words was created for each comment.

Distribution of Stop Words

Figure 6 shows the distribution of stop words in the data.

Distribution of stopwords.
Figure 6: Distribution of stopwords.

This distribution also has a huge right skew, just like the column word_count.

Distribution of Stop Words By Sentiment

Figure 7 shows the distribution of stop words by sentiments.

Distribution of stopwords by sentiment.
Figure 7: Distribution of stopwords by sentiment.

Again, very similar behavior as the column word_count.

Bar Plot of Median of Stop Words By Sentiment

Figure 8 shows a bar plot of the median of stop words for each sentiment.

Bar plot of median stop words by sentiment.
Figure 8: Bar plot of median stop words by sentiment.

Again, the behavior is very similar to the column word_count.

Top-25 Most Common Stop Words

Figure 9 shows the top-25 most common stop words in the data.

Top-25 most common stop words.
Figure 9: Top-25 most common stop words.

We can see that the word “not” is used quite a lot in the comments. This word can completely change the sentiment of a sentence. The words “but”, “however”, “no”, “yet”, etc., also have the same effect. These may also be present in the comments.

Removing Stop Words

As mentioned in the previous sub-section, there are stop words like “not” in the data that can completely reverse the sentiment of the sentence. Hence, retaining such stop words is crucial, which is exactly what was done. The stop words in the data were removed using the set of all English stop words available in the library nltk, except the stop words “not”, “but”, “however”, “no”, and “yet”.

Creating a column for Number of Characters

A new column called num_chars was created that counted the number of characters of the respective comment. Here, it was seen that the data consisted of many non-English special characters. These characters were removed using a regular expression pattern.

Checking Punctuations

It turned out that the data already had all the punctuation characters removed.

Most Common Bigrams and Trigrams

Figure 10 and Figure 11 show the top-25 most common bigrams and trigrams in the data, respectively.

Most common bigrams.
Figure 10: Most common bigrams.
Most common trigrams.
Figure 11: Most common trigrams.

Lemmatization

All the words were converted to their root form using the WordNetLemmatizer class in the stem module of the package nltk.

Word Clouds

Word clouds are very helpful tools that help visualize the prevalence of words in the data in a visual manner.

Word Cloud for the Entire Data

Figure 12 shows the word cloud plot for the entire data.

Word cloud for the entire data.
Figure 12: Word cloud for the entire data.

Word Cloud for Positive Comments

Figure 13 shows the word cloud plot for the positive comments.

Word cloud for positive comments.
Figure 13: Word cloud for positive comments.

Word Cloud for Neutral Comments

Figure 14 shows the word cloud plot for the neutral comments.

Word cloud for neutral comments.
Figure 14: Word cloud for neutral comments.

Word Cloud for Negative Comments

Figure 15 shows the word cloud plot for the negative comments.

Word cloud for negative comments.
Figure 15: Word cloud for negative comments.

From the word cloud plots, it looks like the types of words are more or less the same irrespective of the sentiment of the comment.

Most Frequent Words

Figure 16 shows the top-20 most frequent words in the data.

Top-20 most frequent words in the data.
Figure 16: Top-20 most frequent words in the data.

Most Frequent Words By Sentiment

Figure 17 shows the top-20 most frequent words in the data for each sentiment.

Top-20 most frequent words in the data by sentiment.
Figure 17: Top-20 most frequent words in the data by sentiment.

It looks like the distribution of the most frequent words in the data is more or less the same irrespective of the sentiment of the comment. This was also apparent by looking at the word cloud plots. Further, it seems like there are a lot of political comments. So, the model may perform better on political videos as compared to others.

Jupyter Notebook

This is the Jupyter notebook in which EDA was performed.

Training a Baseline Model

Introduction

We will now create a baseline model for sentiment analysis. Creating such a model is helpful because it is the simplest model possible that helps us get a benchmark performance that can be compared to future models that are more complex. We will also set up an MLFlow tracking server on DagsHub that will help track all the experiments and evaluate various models. It will also help improve the baseline model’s performance using techniques like hyperparameter tuning. This will further help plan the future experiments.

Creating a GitHub Repository

The local directory was created using the Cookiecutter Data Science template. A new Python environment was created inside this directory using the following command:

conda create -p venv python=3.11 -y

After activating this environment, all the required libraries were mentioned in the requirements.txt file and were installed using the following command:

pip install -r requirements.txt

Git was initialized in this directory, and it was pushed to a new GitHub repository.

Creating a DagsHub Repository

A DagsHub repository was created by simply connecting the GitHub repository.

Preprocessing

The same preprocessing that was performed during EDA was performed on the data.

Feature Engineering

As this is just a baseline model, the simple feature engineering technique of bag of words using the CountVectorizer class from scikit-learn was performed.

Baseline Model

Random forest was used as the baseline model as it is known for giving good results out of the box.

Experiment Tracking

Experiments were tracked on MLFlow server. Metrics like accuracy, precision, recall, etc., on the test data were logged for combinations of the following hyperparameters: max_features of CountVectorizer, and n_estimators and max_depth of RandomForestClassifier. Further, a heat map of the confusion matrix, the model, and the data were also logged on MLFlow.

Results

The best test accuracy obtained was around 65%. This needs to be improved. Table 1 shows the classification report of this model.

Class Precision Recall F1-score Support
-1 1.00 0.02 0.03 1650
0 0.66 0.85 0.74 2555
1 0.64 0.82 0.72 3154
Accuracy 0.65 7359
Macro avg 0.77 0.56 0.49 7359
Weighted avg 0.73 0.65 0.57 7359

Table 1: Classification report.

Further, Figure 18 shows the heat map of the confusion matrix.

Confusion matrix of the baseline model. Note that here, '0' corresponds to negative comments, '1' corresponds to neutral comments, and '2' corresponds to positive comments.
Figure 18: Confusion matrix of the baseline model. Note that here, "0" corresponds to negative comments, "1" corresponds to neutral comments, and "2" corresponds to positive comments.

The precision and recall scores of neutral and positive comments are decent, which has also translated to good corresponding F1-scores. However, the recall for negative comments is terrible. We know that recall is given by

\[\begin{equation*} \text{Recall} = \frac{\text{True positives}}{\text{True positives} + \text{false negatives}} \end{equation*}\]

So, a recall score of \(\approx 0\) implies that the model was unable to correctly classify negative comments as negative. In other words, very few negative comments were classified as negative by the model. This is also clear from the confusion matrix (Figure 18). The model has correctly classified only \(28\) negative comments out of a total of \(1650\) negative comments. And,

\[\begin{equation*} \text{Recall of negative comments} = \frac{28}{28+571+1051} = \frac{28}{1650} \approx 0.02 \end{equation*}\]

On the other hand, precision is given by

\[\begin{equation*} \text{Precision} = \frac{\text{True positives}}{\text{True positives} + \text{false positives}} \end{equation*}\]

So, we get

\[\begin{equation*} \text{Precision of negative comments} = \frac{28}{28+0+0} = 1 \end{equation*}\]

So, the model is performing extremely poorly on the negative comments. We need to improve this.

One reason this may be attributed to can be the imbalance. The data has a lesser number of negative comments as compared to neutral and positive ones. This is clear from the “Support” column shown in Table 1. So, this is the direction we need to head to.

Jupyter Notebook

Baseline model training was implemented in this Jupyter notebook.

Improving the Baseline Model

Discussion

Handling Class Imbalance

To reiterate, the normalized value counts in terms of percentage identified during EDA are the following:

  • Positive comments: 43%
  • Neutral comments: 35%
  • Negative comments: 22%

The reason for getting poor performance, especially for the negative comments, can be attributed to this class imbalance.

There are various methods to address this imbalance, e.g., over-sampling, under-sampling, SMOTE, etc. We will try all these. If there is any parameter in a model corresponding to the weight of a class, we will try tweaking that.

Using a More Complex Model

Using complex models like XGBoost, LightGBM, deep learning, etc., can improve the results, especially if applied after handling class imbalance.

Hyperparameter Tuning

Hyperparameter tuning using a Bayesian approach (by using Optuna) of more complex models can improve results significantly.

Use of Ensembling

We can use VotingClassifier or StackingClassifier and combine multiple models to improve the performance.

Feature Engineering

We used a very basic bag of words technique for the baseline model. We can use n-grams in conjunction with bag of words to improve results. We can also use embeddings like Word2vec. We can also use custom features like the number of stop words, the ratio of vowels to consonants, etc., to see if they improve the performance.

Data Preprocessing

Investigating the data deeper may give us more insights about it, which in turn may help improve the model’s performance.

Implementation

  • Experiment 1:
    • Here, we will compare two feature engineering techniques: bag of words and TF-IDF.
    • Further, we will also compare unigram, bigram, and trigram results, keeping the value of the hyperparameter \texttt{max_features} fixed.
  • Experiment 2:
    • Once the best feature engineering technique is found, we will find the best value of the hyperparameter \texttt{max_features}.
  • Experiment 3:
    • Here, we will apply various techniques to handle the data imbalance.
    • We will try under-sampling, ADASYN, SMOTE, SMOTE with ENN, and using class weights.
  • Experiment 4:
    • Here, we will try out many models on the data. They are: XGBoost, LightGBM, random forest, SVM, logistic regression, $k$-nearest neighbors, and naive Bayes.
    • Further, we will also perform some coarse hyperparameter tuning on all these models.
  • Experiment 5:
    • Here, once we find the best performing algorithm from the previous experiment, we will perform a significantly finer hyperparameter tuning on this model.

Experiment 1

(Writing in Progress…)

References