Data Pipeline Overview
The project follows a 4-step pipeline: Scrape โ Generate โ Dirty โ Clean. Each step is a separate Python script, fully reproducible.
| Step | Script | Input | Output |
|---|---|---|---|
| 1 | 01_scrape_blibli_categories.py | Blibli search pages | 113 products (JSON) |
| 2 | 02_generate_sales_data.py | Product catalog | 12,172 transactions (CSV) |
| 3 | (inject dirty data) | Clean CSV | 12,188 dirty rows (CSV) |
| 4 | 03_clean_sales_data.py | Dirty CSV | Clean CSV (validated) |
Step 1: Data Collection โ The Scraping Journey
The goal was to get a product catalog with Gondowangi products alongside competitors โ with real prices, ratings, and store info.
โ Attempt 1: Tokopedia
Tokopedia uses client-side rendering (SPA). Product data only loads after JavaScript executes. Programmatic access returns empty HTML shells. Anti-bot CAPTCHA triggered after a few requests.
โ Attempt 2: Shopee
Same SPA architecture as Tokopedia, plus even more aggressive CAPTCHA. Blocked before page even renders.
โ Attempt 3: Gondowangi Website
Company website focuses on branding, not e-commerce. No product catalog with prices. No competitor data.
โ Attempt 4: Blibli
Blibli uses server-side rendering โ product data is embedded in the initial HTML response. No JavaScript execution needed. Successfully extracted 113 unique products across 8 categories with prices, ratings, sold counts, and store info.
Step 2: Sales Data Generation
Since internal Gondowangi data isn't publicly available, realistic sales transactions were generated using:
- Real product catalog from Blibli as SKU master (113 products, real prices)
- 10 Indonesian regions with weighted distribution (Jabodetabek 30%, Jawa Barat 12%, etc.)
- 6 sales channels with realistic share (Modern Trade 30%, General Trade 25%, E-Commerce 35%)
- 12-month period (Mar 2025 โ Feb 2026) with seasonality (Ramadan boost, year-end spike)
- Brand-weighted demand โ multinational brands (Dove, Pantene) get higher base volume than local brands
Step 3: Dirty Data Injection
9 types of realistic data quality issues were injected:
| Issue | Rate | Example |
|---|---|---|
| Duplicate rows | 15 rows | Exact row duplicates |
| Region typos | 5% | "jabodetabek" instead of "Jabodetabek" |
| Channel typos | 5% | "modern trade" instead of "Modern Trade (Supermarket)" |
| Category typos | 3% | "shampo" instead of "Shampoo" |
| Missing Brand | 3% | Empty brand field |
| Missing Region | 2% | Empty region field |
| Negative Qty | 1% | -274 instead of 274 |
| Zero Price | 1% | Harga_Satuan = 0 |
| Date format mix | 3% | DD/MM/YYYY vs YYYY-MM-DD |
Tools & Stack
| Tool | Purpose |
|---|---|
| Python | Data collection, generation, cleaning (pandas, numpy) |
| SQL | Alternative cleaning pipeline (03_clean_sales_data.sql) |
| Tableau | Dashboard visualization (interactive) |
| HTML/CSS/JS | Portfolio presentation (this website) |
๐ Reproducibility
All scripts use random.seed(42) for reproducible output. Running the pipeline from scratch produces identical datasets every time.