A1 — Methodology

Data Pipeline Overview

The project follows a 4-step pipeline: Scrape → Generate → Dirty → Clean. Each step is a separate Python script, fully reproducible.

Step	Script	Input	Output
1	`01_scrape_blibli_categories.py`	Blibli search pages	113 products (JSON)
2	`02_generate_sales_data.py`	Product catalog	12,172 transactions (CSV)
3	(inject dirty data)	Clean CSV	12,188 dirty rows (CSV)
4	`03_clean_sales_data.py`	Dirty CSV	Clean CSV (validated)

Step 1: Data Collection — The Scraping Journey

The goal was to get a product catalog with Gondowangi products alongside competitors — with real prices, ratings, and store info.

❌ Attempt 1: Tokopedia

Tokopedia uses client-side rendering (SPA). Product data only loads after JavaScript executes. Programmatic access returns empty HTML shells. Anti-bot CAPTCHA triggered after a few requests.

❌ Attempt 2: Shopee

Same SPA architecture as Tokopedia, plus even more aggressive CAPTCHA. Blocked before page even renders.

❌ Attempt 3: Gondowangi Website

Company website focuses on branding, not e-commerce. No product catalog with prices. No competitor data.

✅ Attempt 4: Blibli

Blibli uses server-side rendering — product data is embedded in the initial HTML response. No JavaScript execution needed. Successfully extracted 113 unique products across 8 categories with prices, ratings, sold counts, and store info.

Step 2: Sales Data Generation

Since internal Gondowangi data isn't publicly available, realistic sales transactions were generated using:

Real product catalog from Blibli as SKU master (113 products, real prices)
10 Indonesian regions with weighted distribution (Jabodetabek 30%, Jawa Barat 12%, etc.)
6 sales channels with realistic share (Modern Trade 30%, General Trade 25%, E-Commerce 35%)
12-month period (Mar 2025 – Feb 2026) with seasonality (Ramadan boost, year-end spike)
Brand-weighted demand — multinational brands (Dove, Pantene) get higher base volume than local brands

Step 3: Dirty Data Injection

9 types of realistic data quality issues were injected:

Issue	Rate	Example
Duplicate rows	15 rows	Exact row duplicates
Region typos	5%	"jabodetabek" instead of "Jabodetabek"
Channel typos	5%	"modern trade" instead of "Modern Trade (Supermarket)"
Category typos	3%	"shampo" instead of "Shampoo"
Missing Brand	3%	Empty brand field
Missing Region	2%	Empty region field
Negative Qty	1%	-274 instead of 274
Zero Price	1%	Harga_Satuan = 0
Date format mix	3%	DD/MM/YYYY vs YYYY-MM-DD

Tools & Stack

Tool	Purpose
Python	Data collection, generation, cleaning (pandas, numpy)
SQL	Alternative cleaning pipeline (03_clean_sales_data.sql)
Tableau	Dashboard visualization (interactive)
HTML/CSS/JS	Portfolio presentation (this website)

📋 Reproducibility

All scripts use random.seed(42) for reproducible output. Running the pipeline from scratch produces identical datasets every time.

Methodology

Data Pipeline & Technical Approach