Olist E-Commerce Database

Databases

SQL

Designed and optimized a normalized e-commerce relational database (PostgreSQL → SQLite) using the Olist dataset, with BCNF validation, indexing, query performance analysis, and a deployed Streamlit SQL explorer.

Published

April 25, 2025

Abstract

The rapid expansion of e-commerce, along with the increase in online web and mobile applications has led to generation of huge amounts of data. This has skyrocketed the need for efficient data storage, retrieval, and management. We aim to create a robust database application for the Olist store, which is a Brazilian e-commerce store, to manage customer information, product catalogues, order processing, and customer reviews.

Tech Stack

Database Systems: PostgreSQL (data loading, schema design, constraints), SQLite (lightweight deployment)
Querying & Optimization: SQL, indexing, EXPLAIN / query plan analysis
Data Ingestion & ETL: CSV → DB pipelines (COPY in PostgreSQL, pandas + sqlite3 for SQLite)
Backend / Scripting: Python (sqlite3, pandas)
App / UI: Streamlit (interactive SQL query runner)
Modeling & Design: E/R modeling (Lucidchart), normalization (BCNF, functional dependencies)
Tooling: Git/GitHub (version control), Kaggle (dataset source)

Introduction

This project analyzes the Brazilian e-commerce public dataset by Olist, available on Kaggle. The dataset contains information on about 100,000 orders from 2016 to 2018, encompassing multiple dimensions of e-commerce operations including order status, pricing, payments, customer locations, and product reviews. We obtained this dataset to explore marketplace dynamics and optimize e-commerce strategies. Our analysis focuses on solving marketplace problems by examining key metrics such as order volumes, popular product categories, and customer behavior. We will use SQL queries to investigate sales trends, seller performance, and customer satisfaction. This real-world dataset provides an interesting use case for understanding the complexities of managing a multi-seller e-commerce platform and deriving actionable insights for business improvement.

We downloaded this dataset in csv format on the local machine and then imported it into a PostgreSQL database. We created a database schema called RAW to store the data in a structured manner using the INSERT statements provided in the following insert_stmt.sql file.

-- Insert data into customers table
COPY "RAW".customers(customer_id, customer_unique_id, customer_zip_code_prefix, customer_city, customer_state)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_customers_dataset.csv'
DELIMITER ',' CSV HEADER;

-- Insert data into products table
COPY "RAW".products(product_id, product_category_name, product_name_length, product_description_length, product_photos_qty, product_weight_g, product_length_cm, product_height_cm, product_width_cm)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_products_dataset.csv'
DELIMITER ',' CSV HEADER;

-- Insert data into sellers table
COPY "RAW".sellers (seller_id, seller_zip_code_prefix, seller_city, seller_state)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_sellers_dataset.csv'
WITH (FORMAT CSV, HEADER true, DELIMITER ',');

-- Insert data into orders table
COPY "RAW".orders (order_id, customer_id, order_status, order_purchase_timestamp, order_approved_at, order_delivered_carrier_date, order_delivered_customer_date, order_estimated_delivery_date)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_orders_dataset.csv'
WITH (FORMAT CSV, HEADER true, DELIMITER ',');

-- Insert data into order_items table
COPY "RAW".order_items (order_id, order_item_id, product_id, seller_id, shipping_limit_date, price, freight_value)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_order_items_dataset.csv'
WITH (FORMAT CSV, HEADER true, DELIMITER ',');

-- Insert data into order_payments table
COPY "RAW".order_payments (order_id, payment_sequential, payment_type, payment_installments, payment_value)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_order_payments_dataset.csv'
WITH (FORMAT CSV, HEADER true, DELIMITER ',');

-- Insert data into order_reviews table (there are duplicates and hence, few extra steps)
CREATE TEMP TABLE temp_order_reviews (LIKE "RAW".order_reviews);

-- Copy data into the temporary table
COPY temp_order_reviews (review_id, order_id, review_score, review_comment_title, review_comment_message, review_creation_date, review_answer_timestamp)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_order_reviews_dataset.csv'
WITH (FORMAT CSV, HEADER true, DELIMITER ',');

-- Insert data from temporary table to the main table, ignoring duplicates
INSERT INTO "RAW".order_reviews
SELECT * FROM temp_order_reviews
ON CONFLICT (review_id) DO NOTHING;

DROP TABLE temp_order_reviews;

COPY "RAW".geolocation (geolocation_zip_code_prefix, geolocation_lat, geolocation_lng, geolocation_city, geolocation_state)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\olist_geolocation_dataset.csv'
WITH (FORMAT CSV, HEADER true, DELIMITER ',');

COPY "NYC_TAXI_DB"."RAW".product_categories 
(product_category_name, product_category_name_english)
FROM 'C:\Users\sushrut\Desktop\Others\4_DMQL\olist\product_category_name_translation.csv'
DELIMITER ',' CSV HEADER;

Problem Statement

The e-commerce industry faces critical challenges in managing marketplace dynamics:

Monitoring Seller Performance
- Marketplaces must track and evaluate seller metrics to maintain quality standards, build customer trust, and uphold platform reputation.
Strategic Market Expansion
- Data-driven insights are crucial for identifying new market opportunities, understanding regional preferences, and guiding expansion strategies.
Customer Complaint and Refund Management
- Efficient handling of issues is vital for customer retention, requiring streamlined processes for complaints and refunds.
Real-Time Reporting for Decision-Making
- Instant access to sales trends, inventory levels, and customer behavior data is essential for informed, timely decisions in the fast-paced e-commerce environment.

Our database solution addresses these challenges by providing a comprehensive framework to capture, analyze, and report on crucial marketplace data, enabling effective management of seller relationships, strategic growth, and operational excellence. We plan to deliver a structured database that supports complex queries, data visualization, and business intelligence tools to empower Olist with actionable insights for marketplace optimization.

Why Not Excel?

There are several reasons as to why we cannot use a simple Excel file for this data. Some of them are:

Any real life e-commerce dataset is constantly updating. So, the number of rows in the file can grow indefinitely. Whereas, an Excel sheet has a row limit of 1,048,576.
Databases can handle millions of rows efficiently. However, if the Excel file gets huge, the program becomes slow and is more likely to crash.
SQL queries that work on a database are optimized for retrieving, filtering, and aggregating large datasets quickly. Doing the same on an Excel file is inefficient.
Normalization in a database eliminates redundancy and inconsistency in the data. This is not possible in an Excel file.
Multiple users can read and write to a database at the same time with no conflicts or overwrites. Doing this in an Excel file will lead to overwrites and data corruption.

Target Users

There are several target users that can benefit from this database. Some of them are:

Business Users & Decision Makers: They can use the database for business performance analysis like tracking sales, revenue, etc., over time; work on market expansion strategies by identifying the high-demand locations; improve inventory management by identifying high-demand products, and more.
Data Scientists & Analysts: They can carry out advanced data analytics by running complex SQL queries. For example, they can analyse customer behavior using segmentation, forecast sales and predict future demand. They can also use the data to make interactive dashbords using PowerBI, Tableau, etc.
Customer Support Teams: They can track orders to find order status and delivery updates for prompt response to customer inquiries; analyse customer reviews using sentiment analysis to improve customer satisfaction, and more.

Administrator

A database administrator (DBA) with domain expertise in e-commerce can administer the database. They will have key responsibilities like:

Installing, configuring, and optimizing the database.
Creating schemas, tables, and relationships.
Defining roles and permissions for different types of users, and restricting access of sensitive data.
Monitoring the database performance and optimizing queries.
Scheduling automated backups and replication.
Securing the database from SQL injections.
Scale the database as the data grows.

E/R Diagram

The E/R diagram for the database is shown in Figure 1.

This is the link to this E/R diagram on LucidChart.

Relationships in the E/R Diagram

`customers` to `orders`

One-to-Many relationship.
A customer can have many orders, but each order belongs to one customer.
Relationship: customer_id in orders references customer_id in customers.

`orders` to `order_items`

One-to-Many relationship.
An order can have many order items, but each order item belongs to one order.
Relationship: order_id in order_items references order_id in orders.

`products` to `order_items`

One-to-Many relationship.
A product can be in many order items, but each order item refers to one product.
Relationship: product_id in order_items references product_id in products.

`sellers` to `order_items`

One-to-Many relationship.
A seller can have many order items, but each order item is from one seller.
Relationship: seller_id in order_items references seller_id in sellers.

`orders` to `order_payments`

One-to-Many relationship.
An order can have multiple payments, but each payment belongs to one order.
Relationship: order_id in order_payments references order_id in orders.

`orders` to `order_review`

One-to-Many relationship.
An order can have multiple reviews, but each review belongs to one order.
Relationship: order_id in order_reviews references order_id in orders.

`products` to `product_categories`

Many-to-One relationship.
Many products can belong to one category.
Relationship: product_category_namein products references product_category_namein product_categories.

`geolocation`

This table doesn’t have direct relationships with other tables through foreign keys.
However, it can be conceptually related to customers and sellers through zip code prefixes.

Database Details

The name of the database is OLIST_ECOMMERCE_DB. It has one schema called RAW, and 9 tables namely customers, products, sellers, orders, order_items, order_payments, order_reviews, geolocation, and product_categories. We have uploaded the following files as a part of our submission which contains the SQL queries we used to create the database schema and insert data into the tables:

table_ddl.sql: This file contains the SQL queries to create the tables in the database.
insert_stmt.sql: This file contains the SQL queries to insert data into the tables.
alter.sql: This file contains the SQL queries to alter tables with different conditions.
update.sql: This file contains the SQL queries to update the tables with different conditions.
delete.sql: This file contains the SQL queries to delete tables.
Four_SQL_queries.sql: This file contains four advanced SQL queries that we have designed to extract useful information from the database.

Let us discuss how each of the tables will be updated and the actions to be taken on any foreign key when the primary key is deleted.

Table `customers`

The detailed description of each attribute of this table is shown in Table 1.

Table 1: The customers table.

Attribute	Key Type	Description	Null / Default	FK Action
`customer_id`	PK	Unique identifier for each customer	Not null, no default	N/A
`customer_unique_id`	–	Another unique identifier for the customer	Nullable, no default	N/A
`customer_zip_code_prefix`	–	First few digits of the customer’s zip code	Nullable, no default	N/A
`customer_city`	–	City where the customer resides	Nullable, no default	N/A
`customer_state`	–	State where the customer resides	Nullable, no default	N/A

This table has one foreign key called customer_id, which is unique for every row in this table. As there is no foreign key in this table, there is no foreign key action as well.

The following are update scenarios for this table:

New Entries: Added when new customers register on the platform.
Updates: Customer information may be updated by customers themselves through their account settings or by customer service representatives.
Frequency: Continuous, as new customers join and existing customers update their information.

Table `products`

The detailed description of each attribute of this table is shown in Table 2.

Table 2: The products table.

Attribute	Key Type	Description	Null / Default	FK Action
`product_id`	PK	Unique identifier for each product	Not null, no default	N/A
`product_category_name`	FK	Name of the product category	Nullable, no default	`SET NULL`
`product_name_length`	–	Length of the product name	Nullable, no default	N/A
`product_description_length`	–	Length of the product description	Nullable, no default	N/A
`product_photos_qty`	–	Number of photos for the product	Nullable, no default	N/A
`product_weight_g`	–	Weight of the product in grams	Nullable, no default	N/A
`product_length_cm`	–	Length of the product in centimeters	Nullable, no default	N/A
`product_height_cm`	–	Height of the product in centimeters	Nullable, no default	N/A
`product_width_cm`	–	Width of the product in centimeters	Nullable, no default	N/A

This table has one foreign key called product_category_name, which is a reference to the product_categories table. The foreign key action is SET NULL, which means that if the primary key in the product_categories table is deleted, the corresponding foreign key in this table will be set to NULL.

The following are the update scenarios for this table:

New Entries: Added when new products are introduced to the catalog.
Updates: Product details may be updated by sellers or administrators to reflect changes in price, description, or availability.
Frequency: Regular updates, potentially daily or even more frequently in a large e-commerce system.

Table `sellers`

The detailed description of each attribute of this table is shown in Table 3.

Table 3: The sellers table.

Attribute	Key Type	Description	Null / Default	FK Action
`seller_id`	PK	Unique identifier for each seller	Not null, no default	N/A
`seller_zip_code_prefix`	–	First few digits of the seller’s zip code	Nullable, no default	N/A
`seller_city`	–	City where the seller is located	Nullable, no default	N/A
`seller_state`	–	State where the seller is located	Nullable, no default	N/A

This table has no foreign keys, so there are no foreign key actions.

The following are the update scenarios for this table:

New Entries: Added when new sellers register on the platform
Updates: Seller information may be updated by sellers themselves or by platform administrators.
Frequency: Less frequent than customers or products, but still ongoing as new sellers join.

Table `orders`

The detailed description of each attribute of this table is shown in Table 4.

Table 4: The orders table.

Attribute	Key Type	Description	Null / Default	FK Action
`order_id`	PK	Unique identifier for each order	Not null, no default	N/A
`customer_id`	FK	ID of the customer who placed the order	Nullable, no default	`SET NULL`
`order_status`	–	Current status of the order	Nullable, no default	N/A
`order_purchase_timestamp`	–	When the order was placed	Nullable, no default	N/A
`order_approved_at`	–	When the order was approved	Nullable, no default	N/A
`order_delivered_carrier_date`	–	When the order was handed to the carrier	Nullable, no default	N/A
`order_delivered_customer_date`	–	When the order was delivered to the customer	Nullable, no default	N/A
`order_estimated_delivery_date`	–	Estimated delivery date	Nullable, no default	N/A

This table has one foreign key called customer_id, which is a reference to the customers table. The foreign key action is SET NULL, which means that if the primary key in the customers table is deleted, the corresponding foreign key in this table will be set to NULL.

The following are the update scenarios for this table:

New Entries: Created when customers place new orders.
Updates: Order status is updated as the order progresses (e.g., from processing' toshipped’ to `delivered’).
Frequency: Continuous, with new orders being placed and existing orders being updated regularly.

Table `order_items`

The detailed description of each attribute of this table is shown in Table 5.

Table 5: The order_items table.

Attribute	Key Type	Description	Null / Default	FK Action
`order_id`	PK, FK	ID of the order	Not null, no default	`SET NULL`
`order_item_id`	PK	ID of the item within the order	Not null, no default	N/A
`product_id`	FK	ID of the product	Nullable, no default	`SET NULL`
`seller_id`	FK	ID of the seller	Nullable, no default	`SET NULL`
`shipping_limit_date`	–	Shipping deadline	Nullable, no default	N/A
`price`	–	Price of the item	Nullable, no default	N/A
`freight_value`	–	Freight cost for the item	Nullable, no default	N/A

This table has two foreign keys called product_id and seller_id, which are references to the products and sellers tables respectively. The foreign key action is SET NULL, which means that if the primary key in the products or sellers table is deleted, the corresponding foreign key in this table will be set to NULL.

The following are the update scenarios for this table:

New Entries: Added simultaneously with new orders, detailing each item in the order.
Updates: Rarely updated after creation, except perhaps in case of order modifications or corrections.
Frequency: As frequent as new orders are placed.

Table `order_payments`

The detailed description of each attribute of this table is shown in Table 6.

Table 6: The order_payments table.

Attribute	Key Type	Description	Null / Default	FK Action
`order_id`	PK, FK	ID of the order	Not null, no default	`SET NULL`
`payment_sequential`	PK	Sequential number of the payment	Not null, no default	N/A
`payment_type`	–	Type of payment	Nullable, no default	N/A
`payment_installments`	–	Number of installments	Nullable, no default	N/A
`payment_value`	–	Value of the payment	Nullable, no default	N/A

This table has one foreign key called order_id, which is a reference to the orders table. The foreign key action is SET NULL, which means that if the primary key in the orders table is deleted, the corresponding foreign key in this table will be set to NULL.

The following are the update scenarios for this table:

New Entries: Created when payments are made for orders.
Updates: Might be updated if payment status changes (e.g., payment declined, refunded).
Frequency: Typically created once per order, with potential updates as needed.

Table `order_reviews`

The detailed description of each attribute of this table is shown in Table 7.

Table 7: The order_reviews table.

Attribute	Key Type	Description	Null / Default	FK Action
`review_id`	PK	Unique identifier for each review	Not null, no default	N/A
`order_id`	FK	ID of the order being reviewed	Nullable, no default	`SET NULL`
`review_score`	–	Score given in the review	Nullable, no default	N/A
`review_comment_title`	–	Title of the review comment	Nullable, no default	N/A
`review_comment_message`	–	Full text of the review comment	Nullable, no default	N/A
`review_creation_date`	–	When the review was created	Nullable, no default	N/A
`review_answer_timestamp`	–	When the review was answered	Nullable, no default	N/A

The following are the update scenarios for this table:

New Entries: Added when customers submit reviews for their orders.
Updates: Might be updated if customers edit their reviews or if sellers respond.
Frequency: Less frequent than orders, as not all customers leave reviews.

Table `geolocation`

The detailed description of each attribute of this table is shown in Table 8.

Table 8: The geolocation table.

Attribute	Key Type	Description	Null / Default	FK Action
`geolocation_zip_code_prefix`	–	Zip code prefix	Nullable, no default	N/A
`geolocation_lat`	–	Latitude	Nullable, no default	N/A
`geolocation_lng`	–	Longitude	Nullable, no default	N/A
`geolocation_city`	–	City name	Nullable, no default	N/A
`geolocation_state`	–	State in two-letter code format	Nullable, no default	N/A

This table has no foreign keys, so there are no foreign key actions.

The following are the update scenarios for this table:

New Entries: Typically bulk-loaded from geographical data providers.
Updates: Periodically updated to reflect changes in geographical information.
Frequency: Less frequent, perhaps quarterly or annually, depending on the data provider.

Table `product_categories`

The detailed description of each attribute of this table is shown in Table 9.

Table 9: The product_categories table.

Attribute	Key Type	Description	Null / Default	FK Action
`product_category_name`	PK	Name of the product category	Not null, no default	N/A
`product_category_name_english`	–	English translation of the category name	Nullable, no default	N/A

This table has no foreign keys, so there are no foreign key actions.

The following are the update scenarios for this table:

New Entries: Added when new product categories are created.
Updates: Infrequent updates to category names or translations.
Frequency: Relatively static, with occasional updates as the product catalogue evolves.

One peculiar thing that is evident from the E/R diagram (Figure 1) and also from the geolocation table (Table 8) is that this table neither has a primary key nor any foreign keys. This is because the geolocation table serves as a reference table. It contains geographical information that can be linked to other tables conceptually, rather than through strict database relationships. By using the zip code prefix as a common field, the geolocation data can be flexibly linked to both customers and sellers without enforcing strict referential integrity. This allows for more dynamic querying and analysis. The geolocation table contains a comprehensive list of locations, including those that may not currently be associated with any customer or seller. This also ensures that the data is available for future use or expansion.

Four Basic SQL Queries

Using `GROUP BY` and `ORDER BY`

This query finds the top-5 product categories by total sales amount.

SELECT
    pc.product_category_name_english,
    COUNT(oi.order_id) AS total_orders,
    SUM(oi.price) AS total_sales
FROM
    "RAW".order_items oi
JOIN
    "RAW".products p ON oi.product_id = p.product_id
JOIN
    "RAW".product_categories pc ON p.product_category_name = pc.product_category_name
GROUP BY
    pc.product_category_name_english
ORDER BY
    total_sales DESC
LIMIT 5;

The output of this query is shown in Figure 2.

Figure 2: Top-5 product categories by total sales amount.

Using `GROUP BY` and `HAVING`

This query finds customers who have made purchases above the average order value, i.e., valued customers.

SELECT
    c.customer_id,
    c.customer_city,
    c.customer_state,
    COUNT(o.order_id) AS order_count,
    AVG(op.payment_value) AS avg_order_value
FROM
    "RAW".customers c
JOIN
    "RAW".orders o ON c.customer_id = o.customer_id
JOIN
    "RAW".order_payments op ON o.order_id = op.order_id
GROUP BY
    c.customer_id, c.customer_city, c.customer_state
HAVING
    AVG(op.payment_value) > (
        SELECT AVG(payment_value)
        FROM "RAW".order_payments
    )
ORDER BY
    avg_order_value DESC;

The output of this query is shown in Figure 3.

Figure 3: Customers who have made purchases above the average order value.

Using `CASE` and `AVG`

This query analyzes order delivery performance by calculating the percentage of orders delivered on time and the average delivery time.

SELECT
    EXTRACT(YEAR FROM o.order_purchase_timestamp) AS year,
    EXTRACT(MONTH FROM o.order_purchase_timestamp) AS month,
    COUNT(o.order_id) AS total_orders,
    AVG(
        CASE
            WHEN o.order_delivered_customer_date <= o.order_estimated_delivery_date THEN 1
            ELSE 0
        END
    ) * 100 AS on_time_delivery_percentage,
    AVG(
        EXTRACT(DAY FROM (o.order_delivered_customer_date - o.order_purchase_timestamp))
    ) AS avg_delivery_days
FROM
    "RAW".orders o
WHERE
    o.order_status = 'delivered'
GROUP BY
    year, month
ORDER BY
    year, month;

The output of this query is shown in Figure 4.

Using `WITH` and `PERCENT_RANK`

This query categorizes sellers based on their total sales volume and calculates the average sales percentile for each category.

WITH seller_sales AS (
    SELECT 
        s.seller_id,
        s.seller_city,
        s.seller_state,
        SUM(oi.price) AS total_sales
    FROM 
        "RAW".sellers s
    JOIN 
        "RAW".order_items oi ON s.seller_id = oi.seller_id
    GROUP BY 
        s.seller_id, s.seller_city, s.seller_state
),
sales_categories AS (
    SELECT 
        seller_id,
        seller_city,
        seller_state,
        total_sales,
        CASE 
            WHEN total_sales >= 10000 THEN 'High Volume'
            WHEN total_sales >= 5000 THEN 'Medium Volume'
            ELSE 'Low Volume'
        END AS sales_category,
        PERCENT_RANK() OVER (ORDER BY total_sales) AS sales_percentile
    FROM 
        seller_sales
)
SELECT 
    sc.sales_category,
    COUNT(sc.seller_id) AS seller_count,
    AVG(sc.total_sales) AS avg_sales,
    MIN(sc.total_sales) AS min_sales,
    MAX(sc.total_sales) AS max_sales,
    AVG(sc.sales_percentile) AS avg_percentile
FROM 
    sales_categories sc
GROUP BY 
    sc.sales_category
ORDER BY 
    avg_sales DESC;

The output of this query is shown in Figure 5.

Figure 5: Seller categories based on total sales volume.

Global BCNF Verification

We analyzed all nine datasets. For each dataset, we identified the candidate keys and enumerated the functional dependencies using both the dataset documentation and the observed schema semantics.

Initially, we discovered that two relations, namely geolocation and order_reviews, did not satisfy the Boyce-Codd Normal Form (BCNF) because they lacked natural candidate keys due to the presence of duplicate records. To bring these relations into BCNF, we introduced surrogate primary keys: geolocation_id and review_record_id, respectively.

We verified that after the addition of these surrogate keys, every non-trivial functional dependency has a left-hand side that is a superkey. This satisfies the definition of Boyce-Codd Normal Form (BCNF), which states that a relation is in BCNF if, for every non-trivial functional dependency \(X \to Y\), \(X\) is a superkey.

After a thorough review of each relation, we found that:

All attributes in every relation are fully functionally dependent on candidate keys.
No partial dependencies or transitive dependencies exist.
All relations satisfy the BCNF condition after introducing surrogate keys where necessary.

Conclusion

All nine relations are in BCNF after introducing surrogate primary keys for geolocation and order_reviews. No decomposition was required. We now proceed to formally list the functional dependencies, candidate keys, and justifications for each dataset in the following subsections.

Since this milestone specifically requires ensuring that the functional dependencies are satisfied at the schema level, we will issue appropriate ALTER TABLE commands in our SQL scripts. These commands will:

Add NOT NULL constraints on attributes that are functionally dependent on the candidate keys.
Introduce surrogate primary keys (geolocation_id and review_record_id) for the geolocation and order_reviews tables, respectively.
Drop any earlier primary keys if necessary and redefine the correct primary key structure.

Each row in this table maps a Portuguese product category name to its English equivalent. Since each product_category_name has exactly one English translation, and this attribute fully determines the value of product_category_name_english, the relation satisfies the condition for BCNF.

Finalizing the E/R Diagram

As we have added two surrogate keys in the tables geolocation and order_reviews to ensure BCNF applicability, the E/R diagram will reflect this change. Figure 6 shows the updated E/R diagram (LucidChart URL).

Figure 6: The updated E/R diagram for the database.

Constraints

Table 10 shows the relationship constraints in the database.

Table 10: Relationship constraints in the database.

Relationship	Type	Explanation	Validation
`customers` → `orders`	One-to-Many	A customer can place multiple orders.	`customer_id` appears in both `customers` and `orders`.
`orders` → `order_items`	One-to-Many	Each order can have multiple items.	`order_id` is present in both tables, e.g., `00010242fe8...`.
`products` → `order_items`	One-to-Many	A product can appear in many order items.	`product_id` like `4244733e06...` shows up in multiple items.
`sellers` → `order_items`	One-to-Many	A seller can fulfill many order items.	`seller_id` like `48436dade1...` maps to multiple rows.
`orders` → `order_payments`	One-to-Many	One order can have multiple payments.	Confirmed with different `payment_sequential` values for the same `order_id`.
`orders` → `order_reviews`	One-to-Many	One order can have multiple review updates.	The same `order_id` appears in multiple review rows.
`products` → `product_categories`	Many-to-One	Each product belongs to one category.	`product_category_name` maps to `product_category_name_english`.
`geolocation` (reference table)	Conceptual (by zip)	Relates by `zip_code_prefix` to both `customers` and `sellers`.	Zip codes like `01046` appear in both `geolocation` and `customers`.

The E/R diagram is accurately aligned with the dataset structure. All entity relationships and keys (primary keys, foreign keys) reflect how the data is organized. We have also chosen to make geolocation a loosely connected reference table using zip prefixes rather than enforcing foreign key constraints.

Validating Constraints Using Real Data

`customers`

customer_id: "06b8999e2fba1a1fbc88172c00ba8bc7"
customer_unique_id: "861eff4711a542e4b93843c6dd7febb0"
customer_zip_code_prefix: "14409"
customer_city: franca
customer_state: SP

We have proper normalization of this table. The attribute customer_unique_id exists separately. Also, this table is related to the geolocation table via zip_code_prefix.

`orders`

order_id: "e481f51cbdc54678b7cc49136f2d6af7"
customer_id: "9ef432eb6251297304e76186b10a928d"
order_status: delivered

This links customers via customer_id. Other fields like order_status and the timestamp related fields align with the project schema.

`order_items`

order_id: "00010242fe8c5a6d1ba2dd792cb16214"
product_id: "4244733e06e7ecb4970a6e2683c13e61"
seller_id: "48436dade18ac8b2bce089ec2a041202"

We have handled the composite primary key (order_id, order_item_id) correctly. We confirmed that multiple order_items can be present for one order.

Database Design Theory for Redundancies

We will check for redundancies and violations of normal forms (1NF, 2NF, 3NF) to apply database design principles. Table 11 summarizes this.

Table 11: Check for redundancies and violations of normal forms.

Table	PK	Attributes	Normalization	Redundancy Risk	Suggestion
`customers`	`customer_id`	`customer_unique_id`, `zip_code_prefix`, `city`, `state`	1NF, 2NF, 3NF (possible transitive)	Repetition of `city` / `state` for the same `zip_code_prefix`.	Separate `zip_lookup` (`zip_code_prefix`, `city`, `state`).
`orders`	`order_id`	`customer_id`, `order_status`, timestamps	1NF, 2NF, 3NF	None	None
`order_items`	(`order_id`, `order_item_id`)	`product_id`, `seller_id`, `price`, `freight_value`, `shipping_limit_date`	1NF, 2NF, 3NF	None	None
`products`	`product_id`	`product_category_name`, `name`, `description`, weight, dimensions	1NF, 2NF, 3NF	Would cause redundancy if category names were not separated	Already separated into the `product_categories` table.
`sellers`	`seller_id`	`zip_code_prefix`, `city`, `state`	1NF, 2NF, 3NF (same as `customers`)	Repetition of `city` / `state` for the same `zip_code_prefix`.	Use the shared `zip_lookup` table with customers.
`order_payments`	(`order_id`, `payment_sequential`)	`payment_type`, `installments`, `value`	1NF, 2NF, 3NF	None	None
`order_reviews`	`review_record_id`	`review_id`, `order_id`, `score`, `title`, `message`, `creation_date`, `answer_timestamp`	1NF, 2NF, 3NF	None	None

As you can see, some redundancies are still present in our system. However, for performance optimization and ease of querying, we chose to retain the zip_code_prefix, along with city and state, directly within the customers and sellers tables instead of normalizing them into a separate lookup table.

Problems Handling the Large Data and Using Indexing

Yes, we did run into problems while handling the larger dataset, like slow query performance, especially joins and filters.

Problem

As tables like order_itemsand orders grow into millions of rows, queries using JOIN, GROUP BY, and ORDER BY will become noticeably slower, especially without indexes.

Examples

Finding top product categories,
Aggregating seller sales,
Filtering delayed deliveries.

Slow SQL Queries

Joining Tables

SELECT
    o.order_id,
    o.order_status,
    o.order_purchase_timestamp,
    c.customer_id,
    c.customer_city,
    c.customer_state,
    g.geolocation_lat,
    g.geolocation_lng,
    oi.order_item_id,
    p.product_id,
    p.product_category_name,
    pc.product_category_name_english,
    s.seller_id,
    s.seller_city,
    s.seller_state,
    op.payment_type,
    op.payment_value,
    r.review_score,
    r.review_comment_message
FROM "RAW".orders o
JOIN "RAW".customers c ON o.customer_id = c.customer_id
LEFT JOIN "RAW".geolocation g ON c.customer_zip_code_prefix = g.geolocation_zip_code_prefix
JOIN "RAW".order_items oi ON o.order_id = oi.order_id
JOIN "RAW".products p ON oi.product_id = p.product_id
LEFT JOIN "RAW".product_categories pc ON p.product_category_name = pc.product_category_name
JOIN "RAW".sellers s ON oi.seller_id = s.seller_id
JOIN "RAW".order_payments op ON o.order_id = op.order_id
LEFT JOIN "RAW".order_reviews r ON o.order_id = r.order_id;

The result of this query is shown in Figure 7.

Figure 7: This `JOIN` query takes around 50 seconds to execute. This is before indexing.

As we can see, this query takes 50 seconds to execute completely, which is rather time consuming. To solve this problem, we used indexing.

In database systems, the B-Tree index is the default and most commonly utilized indexing method. It is particularly effective for supporting range queries, sorting, and filtering operations involving conditions such as =, <, >, BETWEEN, and ORDER BY.

In contrast, the Hash index is specifically designed to optimize exact match queries that use the equality (=) operator. Although Hash indexes can provide faster performance than B-Tree indexes for pure equality lookups, they have significant limitations as they do not support range queries or sorting operations. Therefore, the choice between B-Tree and Hash indexing must consider the specific access patterns and query requirements of the application.

Figure 8 shows the output of this query after indexing.

Figure 8: The `JOIN` query takes around 42 seconds to execute after indexing.

There is a marginal improvement. The reason for not a very substantial improvement may be the built in performance optimization of PostgreSQL. It may also be because the data we are using is reasonably simple.

We have uploaded a indexing.sql file in which we have provided the commands that we used for indexing.

-- Fast JOIN on order_id in order_items
CREATE INDEX idx_hash_order_items_order_id
ON "RAW".order_items USING hash(order_id);

-- Fast JOIN on order_id in order_payments
CREATE INDEX idx_hash_order_payments_order_id
ON "RAW".order_payments USING hash(order_id);

-- Filter on order_status = 'delivered'
CREATE INDEX idx_hash_orders_status
ON "RAW".orders USING hash(order_status);

-- B-TREE Indexes: for sorting, grouping, and join keys

-- Join & group on seller_id in order_items
CREATE INDEX idx_btree_order_items_seller_id
ON "RAW".order_items(seller_id);

-- Group on seller_state
CREATE INDEX idx_btree_sellers_state
ON "RAW".sellers(seller_state);

-- Support subquery from order_reviews
CREATE INDEX idx_btree_order_reviews_order_id
ON "RAW".order_reviews(order_id);

-- Support join from orders to order_items
CREATE INDEX idx_btree_orders_order_id
ON "RAW".orders(order_id);

-- Index payment_value for aggregation
CREATE INDEX idx_btree_order_payments_value
ON "RAW".order_payments(payment_value);

Testing the Database Using More Than Eight Queries

We have designed two INSERT queries, two DELETE queries, two UPDATE queries, and four SELECT queries.

`INSERT` Queries

Insert a New Customer

The query is the following.

INSERT INTO "RAW".customers (
    customer_id,
    customer_unique_id,
    customer_zip_code_prefix,
    customer_city,
    customer_state
) VALUES (
    'new_cust_001',
    'unique_cust_999',
    '12345',
    'new city',
    'SP'
);

Figure 9 shows the output of this query.

Insert a New Product Category

The query is the following.

INSERT INTO "RAW".product_categories (
    product_category_name,
    product_category_name_english
) VALUES (
    'eletronicos_domesticos',
    'home_electronics'
);

Figure 10 shows the output of this query.

Figure 10: Inserting a new product category.

`DELETE` Queries

Delete a Specific Customer by their ID

The query is the following.

DELETE FROM "RAW".customers
WHERE customer_id = 'new_cust_001';

Figure 11 shows the output of this query

Figure 11: Deleting a specific customer by their ID.

Delete All Reviews with `NULL` Scores

Such queries can be used for data cleanup. The query is the following.

DELETE FROM "RAW".order_reviews
WHERE review_score IS NULL;

Figure 12 shows the output of this query

Figure 12: Deleting all reviews with `NULL` scores.

`UPDATE` Queries

Update a Customer’s City and State

The query is the following.

UPDATE "RAW".customers
SET customer_city = 'updated city',
    customer_state = 'RJ'
WHERE customer_id = 'new_cust_001';

Figure 13 shows the output of this query.

Figure 13: Updating a customer’s city and state.

Correcting a Product Category Name

The query is the following.

UPDATE "RAW".product_categories
SET product_category_name_english = 'home_electronics_and_appliances'
WHERE product_category_name = 'eletronicos_domesticos';

Figure 14 shows the output of this query.

Figure 14: Correcting a product category name.

`SELECT` Queries

Top-5 Products by Revenue

We will use JOIN and ORDER BY in this query.

SELECT
    p.product_id,
    pc.product_category_name_english,
    SUM(oi.price) AS total_sales
FROM "RAW".order_items oi
JOIN "RAW".products p ON oi.product_id = p.product_id
JOIN "RAW".product_categories pc ON p.product_category_name = pc.product_category_name
GROUP BY p.product_id, pc.product_category_name_english
ORDER BY total_sales DESC
LIMIT 5;

Figure 15 shows the output of this query.

Average Review Score Per Seller

We will use GROUP BY in this query.

SELECT
    s.seller_id,
    s.seller_city,
    AVG(r.review_score) AS avg_review_score
FROM "RAW".sellers s
JOIN "RAW".order_items oi ON s.seller_id = oi.seller_id
JOIN "RAW".orders o ON oi.order_id = o.order_id
JOIN "RAW".order_reviews r ON o.order_id = r.order_id
GROUP BY s.seller_id, s.seller_city
ORDER BY avg_review_score DESC;

Figure 16 shows the output of this query.

Figure 16: Average review score per seller.

Customers Who Spent More Than Average

We will use subquery here.

SELECT
    c.customer_id,
    AVG(op.payment_value) AS avg_spending
FROM "RAW".customers c
JOIN "RAW".orders o ON c.customer_id = o.customer_id
JOIN "RAW".order_payments op ON o.order_id = op.order_id
GROUP BY c.customer_id
HAVING AVG(op.payment_value) > (
    SELECT AVG(payment_value) FROM "RAW".order_payments
);

Figure 17 shows the output of this query.

Figure 17: Customers who spent more than average.

Delivered Orders with Low Ratings

We will use nested JOIN along with FILTER in this query.

SELECT
    o.order_id,
    r.review_score,
    o.order_status,
    r.review_comment_message
FROM "RAW".orders o
JOIN "RAW".order_reviews r ON o.order_id = r.order_id
WHERE r.review_score <= 2 AND o.order_status = 'delivered'
ORDER BY r.review_score;

Figure 18 shows the output of this query.

Figure 18: Delivered orders with low ratings.

Query Execution Analysis

Here, we will identify three problematic queries by showing their cost, and mention where the performance can be improved. However, note that our dataset is reasonably straightforward with more or less simple relations. Hence, not many complex queries take a lot of time to execute due to internal performance optimization in PostgreSQL. However, we will still mention some queries that are taking some time to finish.

Identifying High-Value Sellers by State

This query can help in detecting underperformers.

SELECT
    s.seller_id,
    s.seller_state,
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(oi.price + oi.freight_value) AS total_revenue,
    AVG(op.payment_value) AS avg_payment,
    PERCENT_RANK() OVER (PARTITION BY s.seller_state ORDER BY SUM(oi.price + oi.freight_value)) AS state_sales_percentile,
    (
        SELECT AVG(LENGTH(r.review_comment_message))
        FROM "RAW".order_reviews r
        WHERE r.order_id = o.order_id
    ) AS avg_review_length
FROM "RAW".sellers s
JOIN "RAW".order_items oi ON s.seller_id = oi.seller_id
JOIN "RAW".orders o ON o.order_id = oi.order_id
JOIN "RAW".order_payments op ON op.order_id = o.order_id
WHERE o.order_status = 'delivered'
GROUP BY s.seller_id, s.seller_state, o.order_id
ORDER BY total_revenue DESC;

Figure 19 shows the output of this query.

Figure 19: Identifying high-value sellers by state.

We can see that this query takes about 14.5 minutes to execute. The graphical and statistical output of the explain tool is shown in Figure 20 and Figure 21 respectively.

Figure 20: Identifying high-value sellers by state.

Figure 21: Identifying high-value sellers by state.

The query uses SUM(), AVG(), or COUNT() on grouped data. PostgreSQL is using parallel workers and merging their results. Hash tables were built to support joins or groupings, efficient on medium to large datasets. PostgreSQL did full table scans on 5 tables. It had to sort rows—often happens with ORDER BY. We used a window function, like PERCENT_RANK() or ROW_NUMBER().

Overall Customer Satisfaction Per Seller

SELECT 
    s.seller_id,
    s.seller_city,
    (SELECT AVG(r.review_score) 
    FROM "RAW".order_items oi 
    JOIN "RAW".orders o ON oi.order_id = o.order_id
    JOIN "RAW".order_reviews r ON o.order_id = r.order_id
    WHERE oi.seller_id = s.seller_id
    ) AS avg_review_score
FROM "RAW".sellers s;

Figure 22 shows the output of this query.

Figure 22: Overall customer satisfaction per seller.

We can see that this query takes about 34 seconds to execute. The graphical and statistical output of the explain tool is shown in Figure 23 and Figure 24 respectively.

Figure 23: Overall customer satisfaction per seller.

Figure 24: Overall customer satisfaction per seller.

Each Customer’s Average Payment Per Order

SELECT 
    c.customer_id,
    c.customer_city,
    (SELECT AVG(op.payment_value)
    FROM "RAW".orders o
    JOIN "RAW".order_payments op ON o.order_id = op.order_id
    WHERE o.customer_id = c.customer_id
    ) AS avg_order_payment
FROM "RAW".customers c;

Figure 25 shows the output of this query.

Figure 25: Each customer’s average payment per order.

We can see that this query takes about 11 minutes to execute. The graphical and statistical output of the explain tool is shown in Figure 26 and Figure 27 respectively.

Figure 26: Each customer’s average payment per order.

Figure 27: Each customer’s average payment per order.

The entire customers table was scanned row by row. Same for the orders table—no filtering/indexing, so full scan happens. The order_payments table uses an index for fast lookup on a filtered/joined column. We plan to resolve these queries using indexing. We are using B-tree and Hashing for this use case.

Database Migration and Application Deployment

Initially, the Olist E-Commerce dataset was loaded and processed in a PostgreSQL database environment. Appropriate schema definitions, data insertions, and index creations were performed using PostgreSQL-specific SQL commands. This setup enabled efficient querying and local experimentation.

For the purpose of building a publicly accessible application, the database was migrated to SQLite. SQLite was chosen due to its lightweight, serverless architecture, making it ideal for simple web deployment without the overhead of managing a PostgreSQL server.

A Python script, create_db.py, was used to recreate the database schema and populate the tables directly from the provided CSV files.

import os
import sqlite3
import pandas as pd

CSV_FOLDER_PATH = r"data"
DB_FILE = "olist.db"

# Connect to SQLite
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()

# Enable foreign keys (optional)
cursor.execute("PRAGMA foreign_keys = ON;")

# -----------------------------
# Table creation DDLs for SQLite that incorporate the FDs
# -----------------------------
ddl = {
    "customers": """
        CREATE TABLE IF NOT EXISTS customers (
            customer_id TEXT PRIMARY KEY,
            customer_unique_id TEXT NOT NULL,
            customer_zip_code_prefix TEXT NOT NULL,
            customer_city TEXT NOT NULL,
            customer_state TEXT NOT NULL
        );
    """,
    "geolocation": """
        CREATE TABLE IF NOT EXISTS geolocation (
            geolocation_id INTEGER PRIMARY KEY AUTOINCREMENT,
            geolocation_zip_code_prefix TEXT NOT NULL,
            geolocation_lat REAL NOT NULL,
            geolocation_lng REAL NOT NULL,
            geolocation_city TEXT NOT NULL,
            geolocation_state TEXT NOT NULL
        );
    """,
    "orders": """
        CREATE TABLE IF NOT EXISTS orders (
            order_id TEXT PRIMARY KEY,
            customer_id TEXT NOT NULL,
            order_status TEXT NOT NULL,
            order_purchase_timestamp TEXT NOT NULL,
            order_approved_at TEXT,
            order_delivered_carrier_date TEXT,
            order_delivered_customer_date TEXT,
            order_estimated_delivery_date TEXT NOT NULL,
            FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
        );
    """,
    "order_items": """
        CREATE TABLE IF NOT EXISTS order_items (
            order_id TEXT NOT NULL,
            order_item_id INTEGER NOT NULL,
            product_id TEXT NOT NULL,
            seller_id TEXT NOT NULL,
            shipping_limit_date TEXT NOT NULL,
            price REAL NOT NULL,
            freight_value REAL NOT NULL,
            PRIMARY KEY (order_id, order_item_id),
            FOREIGN KEY (order_id) REFERENCES orders(order_id),
            FOREIGN KEY (product_id) REFERENCES products(product_id),
            FOREIGN KEY (seller_id) REFERENCES sellers(seller_id)
        );
    """,
    "order_payments": """
        CREATE TABLE IF NOT EXISTS order_payments (
            order_id TEXT NOT NULL,
            payment_sequential INTEGER NOT NULL,
            payment_type TEXT NOT NULL,
            payment_installments INTEGER NOT NULL,
            payment_value REAL NOT NULL,
            PRIMARY KEY (order_id, payment_sequential),
            FOREIGN KEY (order_id) REFERENCES orders(order_id)
        );
    """,
    "order_reviews": """
        CREATE TABLE IF NOT EXISTS order_reviews (
            review_record_id INTEGER PRIMARY KEY AUTOINCREMENT,
            review_id TEXT,
            order_id TEXT NOT NULL,
            review_score INTEGER NOT NULL,
            review_comment_title TEXT,
            review_comment_message TEXT,
            review_creation_date TEXT NOT NULL,
            review_answer_timestamp TEXT NOT NULL,
            FOREIGN KEY (order_id) REFERENCES orders(order_id)
        );
    """,
    "products": """
        CREATE TABLE IF NOT EXISTS products (
            product_id TEXT PRIMARY KEY,
            product_category_name TEXT,
            product_name_lenght INTEGER,
            product_description_lenght INTEGER,
            product_photos_qty INTEGER,
            product_weight_g INTEGER,
            product_length_cm INTEGER,
            product_height_cm INTEGER,
            product_width_cm INTEGER
        );
    """,
    "sellers": """
        CREATE TABLE IF NOT EXISTS sellers (
            seller_id TEXT PRIMARY KEY,
            seller_zip_code_prefix TEXT NOT NULL,
            seller_city TEXT NOT NULL,
            seller_state TEXT NOT NULL
        );
    """,
    "product_categories": """
        CREATE TABLE IF NOT EXISTS product_categories (
            product_category_name TEXT PRIMARY KEY,
            product_category_name_english TEXT NOT NULL
        );
    """
}

# -----------------------------
# Create all tables
# -----------------------------
for table, statement in ddl.items():
    print(f"Creating table: {table}")
    cursor.execute(statement)
conn.commit()

# -----------------------------
# CSV to table mapping
# -----------------------------
csv_to_table = {
    "olist_customers_dataset.csv": "customers",
    "olist_products_dataset.csv": "products",
    "olist_sellers_dataset.csv": "sellers",
    "olist_orders_dataset.csv": "orders",
    "olist_order_items_dataset.csv": "order_items",
    "olist_order_payments_dataset.csv": "order_payments",
    "olist_order_reviews_dataset.csv": "order_reviews",
    "olist_geolocation_dataset.csv": "geolocation",
    "product_category_name_translation.csv": "product_categories"
}

# -----------------------------
# Simple direct loading of CSVs into tables
# -----------------------------
for csv_file, table_name in csv_to_table.items():
    path = os.path.join(CSV_FOLDER, csv_file)
    print(f"Inserting data into '{table_name}' from '{csv_file}'...")
    df = pd.read_csv(path)
    df.to_sql(table_name, conn, if_exists='append', index=False)

conn.close()

The indexing strategy originally implemented in PostgreSQL was appropriately adapted for SQLite using a second script, index_db.py.

import os
import sqlite3
import pandas as pd

# Connect to your existing SQLite database
conn = sqlite3.connect("olist.db")
cursor = conn.cursor()

# Enable foreign keys
cursor.execute("PRAGMA foreign_keys = ON;")

# Index creation commands (translated from PostgreSQL to SQLite)
indexes = [
    # order_items.order_id
    """
    CREATE INDEX IF NOT EXISTS idx_order_items_order_id
    ON order_items(order_id);
    """,

    # order_payments.order_id
    """
    CREATE INDEX IF NOT EXISTS idx_order_payments_order_id
    ON order_payments(order_id);
    """,

    # orders.order_status
    """
    CREATE INDEX IF NOT EXISTS idx_orders_order_status
    ON orders(order_status);
    """,

    # order_items.seller_id
    """
    CREATE INDEX IF NOT EXISTS idx_order_items_seller_id
    ON order_items(seller_id);
    """,

    # sellers.seller_state
    """
    CREATE INDEX IF NOT EXISTS idx_sellers_seller_state
    ON sellers(seller_state);
    """,

    # order_reviews.order_id
    """
    CREATE INDEX IF NOT EXISTS idx_order_reviews_order_id
    ON order_reviews(order_id);
    """,

    # orders.order_id (already a primary key — no need, but harmless)
    """
    CREATE INDEX IF NOT EXISTS idx_orders_order_id
    ON orders(order_id);
    """,

    # order_payments.payment_value
    """
    CREATE INDEX IF NOT EXISTS idx_order_payments_payment_value
    ON order_payments(payment_value);
    """
]

# Execute all the index creation statements
for idx_stmt in indexes:
    cursor.execute(idx_stmt)

# Commit and close
conn.commit()
conn.close()

Specifically, while PostgreSQL allowed different indexing methods such as hash and btree, SQLite internally supports only B-tree indexing, which was uniformly applied.

Following database creation and indexing, a Streamlit application was developed to allow users to execute custom SQL queries on the database through an interactive web interface. The final application was deployed and made publicly accessible.

Abstract

Tech Stack

Introduction

Problem Statement

Why Not Excel?

Target Users

Administrator

E/R Diagram

Relationships in the E/R Diagram

customers to orders

orders to order_items

products to order_items

sellers to order_items

orders to order_payments

orders to order_review

products to product_categories

geolocation

Database Details

Table customers

Table products

Table sellers

Table orders

Table order_items

Table order_payments

Table order_reviews

Table geolocation

Table product_categories

Four Basic SQL Queries

Using GROUP BY and ORDER BY

Using GROUP BY and HAVING

Using CASE and AVG

Using WITH and PERCENT_RANK

Global BCNF Verification

Conclusion

customers

Functional Dependencies

Candidate Keys

Justification

geolocation

Functional Dependencies

Candidate Keys

Justification

order_items

Functional Dependencies

Candidate Keys

Justification

order_payments

Functional Dependencies

Candidate Keys

Justification

order_reviews

Functional Dependencies

Candidate Keys

Justification

orders

Functional Dependencies

Candidate Keys

Justification

products

Functional Dependencies

Candidate Keys

Justification

sellers

Functional Dependencies

Candidate Keys

Justification

product_category_name_translation

Functional Dependencies

Candidate Keys

Justification

Finalizing the E/R Diagram

Constraints

Validating Constraints Using Real Data

customers

orders

order_items

Database Design Theory for Redundancies

Problems Handling the Large Data and Using Indexing

Problem

Examples

`customers` to `orders`

`orders` to `order_items`

`products` to `order_items`

`sellers` to `order_items`

`orders` to `order_payments`

`orders` to `order_review`

`products` to `product_categories`

`geolocation`

Table `customers`

Table `products`

Table `sellers`

Table `orders`

Table `order_items`

Table `order_payments`

Table `order_reviews`

Table `geolocation`

Table `product_categories`

Using `GROUP BY` and `ORDER BY`

Using `GROUP BY` and `HAVING`

Using `CASE` and `AVG`

Using `WITH` and `PERCENT_RANK`

`customers`

`geolocation`

`order_items`

`order_payments`

`order_reviews`

`orders`

`products`

`sellers`

`product_category_name_translation`

`customers`

`orders`

`order_items`

`INSERT` Queries

`DELETE` Queries

Delete All Reviews with `NULL` Scores

`UPDATE` Queries

`SELECT` Queries