Methodology

Appendix ยท A1

Data Pipeline & Technical Approach

From marketplace scraping to clean dataset โ€” every step documented.

Data Pipeline Overview

The project follows a 4-step pipeline: Scrape โ†’ Generate โ†’ Dirty โ†’ Clean. Each step is a separate Python script, fully reproducible.

StepScriptInputOutput
101_scrape_blibli_categories.pyBlibli search pages113 products (JSON)
202_generate_sales_data.pyProduct catalog12,172 transactions (CSV)
3(inject dirty data)Clean CSV12,188 dirty rows (CSV)
403_clean_sales_data.pyDirty CSVClean CSV (validated)

Step 1: Data Collection โ€” The Scraping Journey

The goal was to get a product catalog with Gondowangi products alongside competitors โ€” with real prices, ratings, and store info.

โŒ Attempt 1: Tokopedia

Tokopedia uses client-side rendering (SPA). Product data only loads after JavaScript executes. Programmatic access returns empty HTML shells. Anti-bot CAPTCHA triggered after a few requests.

โŒ Attempt 2: Shopee

Same SPA architecture as Tokopedia, plus even more aggressive CAPTCHA. Blocked before page even renders.

โŒ Attempt 3: Gondowangi Website

Company website focuses on branding, not e-commerce. No product catalog with prices. No competitor data.

โœ… Attempt 4: Blibli

Blibli uses server-side rendering โ€” product data is embedded in the initial HTML response. No JavaScript execution needed. Successfully extracted 113 unique products across 8 categories with prices, ratings, sold counts, and store info.

Step 2: Sales Data Generation

Since internal Gondowangi data isn't publicly available, realistic sales transactions were generated using:

Step 3: Dirty Data Injection

9 types of realistic data quality issues were injected:

IssueRateExample
Duplicate rows15 rowsExact row duplicates
Region typos5%"jabodetabek" instead of "Jabodetabek"
Channel typos5%"modern trade" instead of "Modern Trade (Supermarket)"
Category typos3%"shampo" instead of "Shampoo"
Missing Brand3%Empty brand field
Missing Region2%Empty region field
Negative Qty1%-274 instead of 274
Zero Price1%Harga_Satuan = 0
Date format mix3%DD/MM/YYYY vs YYYY-MM-DD

Tools & Stack

ToolPurpose
PythonData collection, generation, cleaning (pandas, numpy)
SQLAlternative cleaning pipeline (03_clean_sales_data.sql)
TableauDashboard visualization (interactive)
HTML/CSS/JSPortfolio presentation (this website)

๐Ÿ“‹ Reproducibility

All scripts use random.seed(42) for reproducible output. Running the pipeline from scratch produces identical datasets every time.