Resume/CV

Contact


Summary

Data Scientist and Machine Learning Engineer with strong experience building production-grade, end-to-end data and ML systems spanning data ingestion, feature engineering, modeling, MLOps, and scalable deployment. Hands-on expertise across Retrieval-Augmented Generation (RAG), time-series forecasting, signal processing, NLP, recommender systems, spatiotemporal modeling, and cloud-based analytics engineering, with multiple systems deployed using AWS and modern DevOps workflows.

Proven track record of designing reproducible ML pipelines using DVC, MLflow (DagsHub), Optuna, Docker, and CI/CD, and delivering inference services through FastAPI, Streamlit, Gradio, and containerized AWS infrastructure. Experienced in RAG system design (hybrid retrieval, reranking, citation enforcement, automated evaluation with Ragas), serverless ETL architectures (S3, Lambda, Glue, Athena, Redshift), experiment-driven model development, and translating research-driven methods into reliable production systems.

Strong foundation in statistical modeling and deep learning, with experience implementing neural architectures using TensorFlow and PyTorch for sequence modeling and spatiotemporal forecasting, and building LLM-powered applications using LangChain, OpenAI, Google Gemini, and Anthropic Claude APIs.


Work Experience

Machinery Monitoring Systems LLC — Remote, USA

Data Science / Machine Learning Intern
August 2025 – December 2025

Built a real-time end-to-end system for automated fault detection in rotating machinery using multivariate accelerometer time-series. Designed the full workflow covering data collection, preprocessing, feature engineering, modeling, real-time inference, and UI integration, with iterative experimentation using Agile practices.

Key contributions:

  • Built a full data collection pipeline for a 3-axis accelerometer connected to a rotor kit, capturing 1024-sample waveform frames at high sampling rates.
  • Developed a threaded serial-communication system (Python) to parse raw hex sensor streams, reconstruct waveforms, remove DC components, compute rFFT features, and store synchronized X–Y–Z triplets.
  • Collected and curated labeled datasets for normal, bearing fault, unbalance, misalignment, and looseness conditions; maintained clean, reproducible dataset organization.
  • Engineered time-domain and frequency-domain representations enabling dual-branch modeling (waveform + FFT).
  • Trained and evaluated advanced models including MiniROCKET + Ridge, 1D CNNs, InceptionTime, and fused waveform–FFT architectures.
  • Achieved ≈99% performance across accuracy / F1 / precision / recall / sensitivity / specificity on multi-class fault prediction (reported internally).
  • Built a Streamlit UI for real-time inference, waveform + FFT visualizations, confusion matrices, and model comparisons.
  • Conducted representation analysis using PCA and t-SNE to visualize fault separability in low-dimensional embeddings.
  • Followed software engineering best practices: modular codebase, OOP design, version control, reproducible pipelines, logging, and iterative experimentation.

Keywords: time-series classification, vibration analysis, signal processing, rFFT, streaming inference, MiniROCKET, CNN, InceptionTime, Streamlit, Python, MLOps, reproducibility, PCA, t-SNE, Agile


Indian Institute of Tropical Meteorology (IITM) — Pune, India

MS Student Data Science Intern / Research Intern
January 2022 – January 2023

  • Developed spatiotemporal ConvLSTM deep learning models on large-scale satellite imagery to forecast environmental risk events (stubble burning and forest fires), achieving 0.80 1-day temporal correlation across geospatial grids.
  • Engineered end-to-end data pipelines for ingestion, preprocessing, transformation, and validation of multi-source environmental and satellite datasets, improving reproducibility and downstream modeling reliability.
  • Harmonized heterogeneous spatial datasets by resolving mismatched spatial resolutions using interpolation techniques (e.g., linear interpolation) to ensure geospatial alignment across inputs.
  • Trained deep spatiotemporal models on multi-GPU high-performance computing (HPC) clusters, enabling scalable experimentation on terabyte-scale environmental data.
  • Conducted exploratory data analysis and advanced feature engineering to uncover spatiotemporal patterns and validate modeling assumptions in collaboration with climate scientists.
  • Built predictive models for stubble burning events in the Delhi–Punjab–Haryana region to support proactive air quality mitigation strategies.
  • Designed forest fire forecasting models for northeast and central India, contributing to data-driven environmental risk monitoring and resource planning.
  • Applied classical and deep learning time-series models (ARIMA, LSTM) to air pollution data from New Delhi, improving understanding of pollutant dynamics and forecast accuracy.
  • Communicated technical findings to non-technical environmental and climate researchers through clear visualizations, storytelling, and recurring research presentations.
  • Co-authored a peer-reviewed research publication, contributing to advancements in environmental forecasting methodologies.

Keywords: ConvLSTM, spatiotemporal forecasting, satellite imagery, environmental ML, geospatial grids, data pipelines, research collaboration


Projects

Production RAG System | Hybrid Retrieval, Citation Enforcement & CI-Gated Evaluation

Detailed Report

Check out the detailed report here.

Tech Stack

Python, LangChain, ChromaDB, OpenAI Embeddings (text-embedding-3-small), rank-bm25, sentence-transformers (cross-encoder/ms-marco-MiniLM-L-6-v2), Google Gemini 2.5 Flash, OpenAI GPT-5.4, Anthropic Claude Opus 4.6, Ragas, Gradio, Pydantic, PyYAML, loguru, pytest, GitHub Actions, uv

Key Highlights

  • Built a production-grade RAG pipeline on 102 LangChain and LangGraph documentation files, featuring hybrid retrieval, cross-encoder reranking, enforced source citations, and automated evaluation with a CI quality gate.
  • Implemented hybrid retrieval combining BM25 keyword search and vector similarity search (OpenAI text-embedding-3-small + ChromaDB), fusing scores with configurable weights (0.6 vector + 0.4 BM25) to capture both semantic and exact lexical matches.
  • Added cross-encoder reranking using ms-marco-MiniLM-L-6-v2 to rescore 20+ hybrid candidates by joint query-passage relevance, selecting the top 5 for answer generation.
  • Designed a citation-enforced prompt system (versioned as YAML configs) requiring inline [Source N] references for every claim, with a programmatic decline-to-answer mechanism (INSUFFICIENT_CONTEXT prefix detection) when retrieved context is insufficient.
  • Generated a golden evaluation dataset of 102 QA pairs across 6 question types (factual, conceptual, procedural, comparative, multi-hop, edge-case) using OpenAI GPT-5.4 with high reasoning effort, with quality filtering and targeted top-up for underrepresented categories.
  • Enforced a three-model, three-vendor separation to eliminate self-evaluation bias: OpenAI GPT-5.4 generates the dataset, Google Gemini 2.5 Flash generates RAG answers, and Anthropic Claude Opus 4.6 independently evaluates faithfulness via Ragas.
  • Achieved strong evaluation scores across all 102 questions: Faithfulness: 0.9561, Answer Relevancy: 0.8572, Context Precision: 0.8336, Context Recall: 0.9220, all passing the 0.7 CI threshold.
  • Built a Gradio chatbot with streaming typing animation, inline citations, source file listings, and decline behavior for out-of-scope queries.
  • Wrote 62 unit tests (all offline with mocked dependencies) covering ingestion, chunking, vector store, retrieval, reranking, generation, evaluation, and CI modules.
  • Configured a GitHub Actions CI pipeline running the full test suite on push and PR, with a commented Ragas evaluation gate ready for activation with API secrets.

YouTube Comment Intelligence | End-to-End NLP & MLOps System

Detailed Report

Check out the detailed report here.

Tech Stack

Python, LightGBM, XGBoost, scikit-learn, NLTK, spaCy, Optuna, MLflow, DVC, FastAPI, Docker, GitHub Actions, AWS (EC2, ECR, S3, CodeDeploy, IAM, Auto Scaling, CloudWatch), Chrome Extension APIs, JavaScript, HTML, CSS

Key Highlights & Impact

  • Built a production-ready Chrome extension + FastAPI backend that analyzes up to 500 YouTube comments per video in real time, classifying them into positive, neutral, and negative sentiments.
  • Trained and optimized a multi-class sentiment model on 37,000+ labeled comments, improving performance from a weak baseline (65% accuracy) to a tuned LightGBM model approaching ~80–87% accuracy with significantly improved recall for negative comments.
  • Improved negative class recall from 2% (baseline Random Forest) to a balanced and production-acceptable level using:
    • Class imbalance handling (under-sampling, class weights, SMOTE variants),
    • Feature engineering (bigrams, custom linguistic features),
    • Extensive Bayesian hyperparameter tuning with Optuna.
  • Designed and executed 5 structured experimentation phases, comparing:
    • Bag-of-Words vs TF-IDF,
    • 7 ML models (XGBoost, LightGBM, SVM, Logistic Regression, KNN, Naive Bayes, Random Forest),
    • Multiple imbalance strategies and ensemble methods, resulting in selection of an optimized LightGBM production model.
  • Engineered custom NLP features (lexical diversity, POS proportions, avg word length, etc.), increasing macro F1-score and improving minority class detection without increasing latency.
  • Implemented full MLOps pipeline using DVC + MLflow, enabling:
    • Reproducible training,
    • Experiment tracking,
    • Model registry (Staging → Production promotion),
    • Automated performance gating (minimum 0.75 threshold for accuracy, precision, recall, F1).
  • Developed robust CI/CD workflow using GitHub Actions, automatically:
    • Running DVC pipelines,
    • Testing model loading & signature,
    • Validating performance metrics,
    • Promoting models to production,
    • Building and pushing Docker images to AWS ECR,
    • Deploying via CodeDeploy to EC2 Auto Scaling group.
  • Containerized the FastAPI service with Docker and deployed to AWS EC2 behind an Application Load Balancer, ensuring:
    • Scalable inference,
    • Rolling deployments,
    • High availability with auto-scaling (2–3 instances).
  • Built interactive frontend features including:
    • Sentiment percentage breakdown,
    • Pie charts,
    • Word clouds,
    • Monthly sentiment trend graphs,
    • Engagement metrics (unique commenters, avg words/comment, sentiment score out of 10).

Model Performance Summary

  • Baseline (Random Forest + BoW):
    • Accuracy: 65%
    • Negative Recall: 2% (severe imbalance issue)
  • Final Production Model (LightGBM + Bigrams + Class Weights):
    • Accuracy: ~80–87%
    • Balanced precision/recall across all three classes
    • Negative class recall significantly improved
    • Passed automated CI performance thresholds (≥ 0.75 across key metrics)

Food Delivery Time Prediction | End-to-End MLOps Regression System

Detailed Report

Check out the detailed report here.

Tech Stack

Python, Pandas, NumPy, Scikit-learn, LightGBM, XGBoost, Optuna, MLflow, DVC, FastAPI, Docker, GitHub Actions, AWS S3, DagsHub

Key Contributions & Impact

  • Built an end-to-end machine learning system to predict food delivery ETA (in minutes) using 45,000+ real-world records with rider, traffic, weather, and geospatial features.
  • Engineered advanced features including Haversine distance, distance bins, time-of-day segmentation, pickup delay, and festival/traffic interactions, improving predictive signal (distance–target correlation ≈ 0.32).
  • Designed a robust data cleaning pipeline to handle 9.27% non-random missing data, inconsistent geospatial coordinates, outliers (invalid ratings, minors), and categorical noise.
  • Developed modular preprocessing pipelines using ColumnTransformer, implementing scaling, one-hot encoding, ordinal encoding, Yeo-Johnson target transformation, and hybrid imputation strategies (KNN + mode + missing indicators).
  • Benchmarked multiple models using Optuna (Bayesian optimization) across Random Forest, Gradient Boosting, XGBoost, LightGBM, SVM, and KNN.

Model Performance

  • Baseline Linear Regression (NaNs Dropped):
    • Test MAE: 4.73 minutes
    • Test R²: 0.60
  • Random Forest (Tuned):
    • Test MAE: 3.12 minutes
    • Test R²: 0.83
  • LightGBM (Best Standalone Model):
    • Test MAE: 3.05 minutes
    • Test R²: 0.84
  • Final Stacking Regressor (LightGBM + RF + Linear Meta Model):
    • Cross-validated MAE ≈ ~3 minutes
    • Strong generalization with controlled overfitting compared to standalone Random Forest

MLOps & Productionization

  • Tracked experiments and hyperparameter searches using MLflow (DagsHub).
  • Built a fully modular DVC pipeline with staged reproducibility (data cleaning → preprocessing → training → evaluation → model registry).
  • Implemented automated CI/CD via GitHub Actions, including:
    • Model loading validation
    • Performance threshold testing (MAE < 5 mins)
    • Automated promotion from Staging → Production.
  • Containerized FastAPI prediction service with Docker and exposed real-time /predict endpoint.
  • Designed API testing and stress-validation workflow prior to production deployment.

Spotify Recommendation System | End-to-End Recommender System with MLOps & AWS Deployment

Detailed Report

Check out the detailed report here.

Tech Stack

Python, Pandas, NumPy, Scikit-learn, Dask, SciPy (Sparse Matrices), Streamlit, DVC, Docker, GitHub Actions (CI/CD), AWS S3, AWS ECR, AWS EC2, AWS CodeDeploy, AWS Auto Scaling, Cosine Similarity, TF-IDF, One-Hot Encoding

Key Highlights

  • Built a hybrid recommender system combining Content-Based Filtering and Item-Based Collaborative Filtering on 50,683 songs and 9.7M+ user–song interactions (≈962K users, 30K+ active tracks).
  • Engineered high-dimensional song embeddings using TF-IDF (85 features), One-Hot Encoding, Frequency Encoding, Standard Scaling, and Min-Max Scaling via Scikit-learn’s ColumnTransformer.
  • Constructed an item–user interaction matrix (30,459 × 962,037 ≈ 29B possible entries) and optimized memory usage by converting it into a SciPy sparse matrix, avoiding ~60GB dense allocation.
  • Leveraged Dask for parallelized chunk processing to handle large-scale interaction data that could not fit into memory.
  • Implemented cosine similarity–based ranking to generate top-k real-time recommendations.
  • Designed a weighted hybrid model with normalized similarity vectors to resolve scale mismatch between sparse collaborative signals and dense content embeddings.
  • Addressed cold-start problem by dynamically switching to content-based recommendations for ~20K unseen songs.
  • Modularized the pipeline using DVC stages for reproducible data cleaning, transformation, interaction matrix creation, and hybrid recommendation.
  • Built an interactive Streamlit application with dynamic weight selection and session-level caching for improved inference speed.
  • Productionized the system with Docker containerization, GitHub Actions CI/CD, AWS ECR image versioning, EC2 deployment, and blue/green deployments using AWS CodeDeploy + Auto Scaling.

NYC Taxi Demand Prediction | End-to-End Time Series Forecasting & Application

Detailed Report

Check out the detailed report here.

Tech Stack

Python, pandas, Dask, NumPy, scikit-learn (StandardScaler, MiniBatchKMeans, LinearRegression), Optuna, MLflow, DagsHub, DVC, Amazon S3, Streamlit, pytest, GitHub Actions, Docker, AWS (ECR, EC2 Auto Scaling, Application Load Balancer, CodeDeploy)

Key Highlights

  • Built an end-to-end taxi demand forecasting system that predicts next 15-minute pickup demand from (latitude, longitude) + time, enabling driver decision-making for higher-earning regions.
  • Processed 3 months of NYC TLC Yellow Taxi data (Jan–Mar 2016) and engineered a forecasting-ready dataset by resampling pickups into 15-min bins per region and generating 4 lag features (t−1 … t−4) plus day-of-week signals.
  • Scaled geospatial modeling by dividing NYC into 30 demand-aware regions using MiniBatch K-Means, selecting k = 30 based on an average ~1–1.5 mile distance to the 8 nearest neighboring region centroids (driver-reachable within 15 minutes).
  • Implemented robust preprocessing by removing extreme outliers across geo coordinates, trip distance, fare, and related fields, improving clustering quality and downstream demand estimates.
  • Reduced time-series noise via Exponentially Weighted Moving Average (EWMA) smoothing with α = 0.4, aligned with the 4-lag feature window to avoid leakage.
  • Trained and validated a time-series baseline using a non-random split (train: Jan–Feb 2016, test/val: Mar 2016) and achieved MAPE = 8.78% (train) and MAPE = 7.93% (test) with Linear Regression.
  • Performed Bayesian hyperparameter optimization with Optuna across candidate models and selected the best configuration, reaching a best test MAPE ≈ 7.93% (final selected model: Linear Regression).
  • Built a reproducible DVC pipeline covering ingestion → outlier removal → clustering → feature engineering → training → evaluation, with artifact versioning (datasets, scaler, clustering model, encoder, regressor) and S3 as the DVC remote.
  • Tracked experiments and registered models using MLflow (remote on DagsHub), logging parameters, metrics, and artifacts; promoted models through staging → production gated by automated tests.
  • Delivered an interactive Streamlit app that visualizes predicted demand as a color-coded NYC region map, with an option to show predictions for the current region + 8 nearest neighbors (9 regions total) to reflect realistic driver mobility.
  • Implemented CI/CD with GitHub Actions + pytest, including automated checks for model registry loading and performance thresholds (MAPE ≤ 10% on train & test); containerized the app with Docker and prepared deployment via AWS ECR + EC2 Auto Scaling + ALB + CodeDeploy (desired capacity: 2, min: 1, max: 2, CPU target: 70%).

Olist E-Commerce Database | PostgreSQL → SQLite + Streamlit App

Detailed Report

Check out the detailed report here.

Tech Stack

PostgreSQL, SQLite, SQL, Python, pandas, sqlite3, Streamlit, EXPLAIN/query plans, indexing (B-tree / hash in Postgres; B-tree in SQLite), E/R modeling (Lucidchart), Git/GitHub, Kaggle (Olist dataset)

Key Highlights

  • Designed and implemented a normalized relational database for the Olist Brazilian e-commerce dataset (~100,000 orders; 2016–2018) across 9 core tables, enforcing PK/FK constraints and referential integrity for marketplace analytics.
  • Performed global BCNF verification by enumerating candidate keys + functional dependencies; resolved non-BCNF edge cases by introducing surrogate primary keys (geolocation_id, review_record_id) to handle duplicate-heavy relations without lossy decompositions.
  • Built a reproducible CSV → PostgreSQL ingestion pipeline using COPY (including duplicate-safe loads via ON CONFLICT DO NOTHING for reviews), enabling fast bulk loads and consistent schema initialization.
  • Authored 10+ operational SQL queries (INSERT/UPDATE/DELETE/SELECT) and advanced analytics queries using WITH CTEs, window functions (PERCENT_RANK), nested subqueries, and conditional aggregation to derive seller tiers, customer value, and delivery KPIs.
  • Conducted query performance profiling with EXPLAIN / query plans and identified bottlenecks in join-heavy + aggregation workloads (e.g., a multi-table JOIN query measured ~50s pre-indexing).
  • Implemented an indexing strategy (join keys + filter columns) and demonstrated measurable improvements on representative workloads (e.g., JOIN query reduced from ~50s → ~42s, ~16% speedup), while documenting why gains were bounded by optimizer behavior and workload characteristics.
  • Diagnosed and documented expensive analytical queries (e.g., state-level seller ranking query at ~14.5 minutes, customer payment aggregation at ~11 minutes) and proposed targeted indexing / access-pattern changes to reduce full scans and sort costs.
  • Migrated the full database from PostgreSQL to SQLite for lightweight deployment, recreating schema + constraints in Python and loading data via pandas.to_sql, enabling a portable, serverless artifact (olist.db) for demos and sharing.
  • Developed and deployed a Streamlit SQL Explorer that lets users run interactive SQL queries against the SQLite database for self-serve analytics and rapid experimentation.

Customer Care ETL Pipeline (End-to-End Serverless Data Engineering on AWS)

Detailed Report

Check out the detailed report here.

Tech Stack

Python, pandas, SQLAlchemy, MySQL, Amazon S3 (raw/processed zones), AWS Lambda (S3 event triggers + layers: AWSSDKPandas, psycopg2), AWS Glue (Visual ETL + Script ETL + Crawlers), Amazon Athena, Amazon Redshift Serverless, Parquet/CSV, Loguru, Power BI

Key Highlights

  • Designed and implemented a production-style, event-driven ETL pipeline to unify scattered customer-care data into a single source of truth, enabling consistent analytics and dashboarding.
  • Built incremental ingestion for MySQL support tickets using a date-tracker state file with exactly-once semantics (state updated only after successful S3 write) and graceful handling of empty daily extracts.
  • Ingested unstructured daily support logs and implemented fault-tolerant, incremental file ingestion (one-day-per-run) with structured, rotated logs via Loguru for auditability and debugging.
  • Converted raw log text into an analytics-ready table by parsing 13+ fields (timestamp, log_level, service component, ticket/session IDs, IP, response time, CPU%, event type, error flag, user agent, message/debug) using regex-based extraction.
  • Enforced data quality checks and cleaning: filtered invalid records (e.g., negative response times), normalized malformed log levels (INF0/DEBG/warnING/EROR → canonical levels), dropped non-informative columns (e.g., trace_id), and deduplicated events.
  • Implemented serverless transformations for logs in AWS Lambda, writing curated outputs to S3 processed zone in Parquet using PyArrow for efficient downstream querying.
  • Built a Glue Visual ETL job for ticket cleansing (schema fixes, null-field drops, renames, invalid-value filters, SQL-based category standardization) and productionized it via Glue Script ETL triggered by S3 Put events.
  • Enabled ad hoc analytics directly on the lake using Athena + Glue Crawlers, creating queryable tables over Parquet datasets and supporting operational KPI exploration (ticket volume trends, agent workload, log-level distributions, CPU utilization by user agent).
  • Loaded curated data into Amazon Redshift Serverless using automated incremental COPY jobs triggered on new processed Parquet arrivals (Lambda + psycopg2), supporting scalable OLAP queries.
  • Delivered near-real-time operational dashboards in Power BI connected to Redshift, tracking KPIs such as total/open tickets, ticket distribution by channel/status/agent, total logs, log-level mix, and average CPU utilization.

Technical Skills

Programming & Software Engineering

  • Python (advanced): OOP design, modular architectures, threading & multiprocessing, structured logging
  • R: statistical modeling, regression, time series analysis
  • SQL: complex joins, subqueries, window functions, indexing, query optimization (EXPLAIN)
  • MATLAB (signal processing, academic use)
  • LaTeX, Quarto, Markdown (technical documentation & reproducible reporting)

Machine Learning & Statistical Modeling

  • Regression & Classification: Linear/Ridge, Logistic Regression, Random Forest, Gradient Boosting, LightGBM
  • Model evaluation: cross-validation, calibration, bias–variance analysis, diagnostics
  • Statistical testing: ANOVA, chi-squared tests, Jarque–Bera, correlation analysis
  • Hyperparameter optimization with Optuna
  • Dimensionality reduction & embedding analysis: PCA, t-SNE

Deep Learning

  • Frameworks: TensorFlow, PyTorch
  • Sequence modeling with 1D CNNs, InceptionTime, MiniROCKET
  • ConvLSTM for spatiotemporal forecasting
  • Multi-branch neural architectures (waveform + frequency fusion)
  • Model training, evaluation, and architectural comparison

Time Series & Signal Processing

  • Time-series forecasting & classification
  • Signal preprocessing: DC removal, normalization, smoothing (EWMA)
  • Frequency-domain analysis: rFFT, spectral feature extraction
  • Lag features, rolling statistics, temporal resampling

Natural Language Processing

  • End-to-end sentiment analysis pipelines
  • Text preprocessing: normalization, lemmatization, negation-aware stopword handling
  • Feature extraction: Bag-of-Words, TF-IDF, n-grams
  • Class imbalance handling (under-sampling, SMOTE variants)

Retrieval-Augmented Generation (RAG) & LLM Applications

  • End-to-end RAG pipelines: document ingestion, chunking, embedding, retrieval, generation, evaluation
  • Hybrid retrieval: BM25 keyword search + vector similarity search with weighted score fusion
  • Cross-encoder reranking with sentence-transformers (ms-marco-MiniLM-L-6-v2)
  • Citation enforcement and programmatic decline-to-answer mechanisms
  • Versioned prompt engineering (YAML-based prompt configs)
  • Vector stores: ChromaDB with OpenAI embeddings (text-embedding-3-small)
  • LLM orchestration with LangChain
  • Multi-vendor LLM integration: OpenAI (GPT-5.4), Google (Gemini 2.5 Flash), Anthropic (Claude Opus 4.6)
  • Automated RAG evaluation using Ragas (faithfulness, answer relevancy, context precision, context recall)
  • Interactive LLM applications with Gradio

Recommender Systems

  • Content-Based Filtering (TF-IDF, encoding, similarity ranking)
  • Item-Based Collaborative Filtering (sparse matrices)
  • Hybrid recommendation systems with weighted score fusion
  • Scalable similarity computation using Dask

Data Engineering & ETL

  • End-to-end data pipelines (batch + event-driven ingestion)
  • Dataset versioning and lineage tracking with DVC
  • Log parsing and structured transformation pipelines
  • Schema enforcement, data validation, deduplication, partitioning
  • Parquet-based data lake design (raw/processed zones)

Cloud & MLOps

  • AWS: S3, Lambda, Glue (Visual + Script ETL), Glue Crawlers, Athena, Redshift Serverless, EC2, ECR, CodeDeploy
  • Multi-vendor LLM API management (OpenAI, Google, Anthropic)
  • Experiment tracking with MLflow (DagsHub)
  • CI/CD with GitHub Actions
  • Docker containerization
  • Blue/green deployments, auto-scaling architectures
  • Event-driven incremental data loading (Lambda + psycopg2)

APIs & Applications

  • FastAPI for production inference services
  • Streamlit for real-time ML applications and dashboards
  • Gradio for LLM-powered chatbot interfaces with streaming
  • REST APIs with model validation & performance safeguards
  • Chrome extension backend integration

Databases & Analytics Engineering

  • Relational schema design & ER modeling
  • Normalization up to BCNF, functional dependency analysis
  • PostgreSQL, SQLite
  • Indexing strategies & query performance analysis

Visualization & Reporting

  • Matplotlib, Seaborn
  • Confusion matrices, ROC/PR curves
  • Time-series visualizations
  • Power BI dashboards (Redshift-backed)
  • Technical reporting for research & production systems

Education

University at Buffalo, SUNY

MS in Engineering Science (Data Science)GPA: 4.0/4.0
August 2024 – December 2025

Courses

  • Introduction to Numerical Mathematics for Computing and Data Scientists
  • Introduction to Probability Theory for Data Scientists
  • Programming and Database Fundamentals for Data Scientists
  • Statistical Learning and Data Mining - I
  • Statistical Learning and Data Mining - II
  • Predictive Analytics
  • Data Intensive Computing
  • Introduction to Machine Learning
  • Data Models Query Language
  • Experiential Projects in Artificial Intelligence and Data Science

Indian Statistical Institute (ISI) Kolkata

Post-Graduate Diploma in Applied Statistics (Data Analytics)GPA: 9/10
September 2023 – August 2024

Courses

  • Basic Statistics
  • Basic Probability
  • Statistical Methods
  • Survey Sampling
  • Introduction to Official Statistical System
  • Statistics and Economy
  • Introduction to R and Python
  • Multiple Regression
  • Advanced Regression
  • Time Series Analysis and Forecasting
  • Multivariate Statistics
  • Statistical Machine Learning

Indian Institute of Science Education and Research (IISER) Pune

Integrated BS–MS Dual Degree (Science)
August 2017 – January 2023


Publications