A1 — Methodology

Data Pipeline Overview

The project follows a 5-step pipeline: Scrape Locations → Scrape Details → Generate Legal Data → Inject Dirty Data → Clean. Each step is documented and reproducible.

Step	Description	Input	Output
1	Google Maps scraping (locations)	25+ regional search queries	557 outlet locations (JSON)
2	Detail scraping (per outlet)	557 Google Maps URLs	Address, rating, reviews, hours (JSON)
3	Legal data generation	Outlet master data	IMB, lease, halal, disputes, rent (CSV)
4	Dirty data injection	Clean CSV	562 dirty rows (5 duplicates + anomalies)
5	Data cleaning & validation	Dirty CSV	557 clean rows (validated)

Step 1: Google Maps Scraping

The goal was to collect all Mie Gacoan outlet locations across Indonesia from Google Maps — a publicly available data source.

✅ Approach: Playwright + StealthyFetcher

Used Playwright with stealth mode to search Google Maps for "Mie Gacoan" across 25+ regional queries (e.g., "Mie Gacoan Surabaya", "Mie Gacoan Jakarta", "Mie Gacoan Bandung"). Each query returns a scrollable list of locations. Successfully extracted 557 unique outlet locations with coordinates, names, and place IDs.

Step 2: Detail Scraping

For each of the 557 locations, additional details were scraped from individual Google Maps pages:

Full address — street, city, province, postal code
Rating & reviews — average rating and review count
Operating hours — daily schedule
Accessibility info — wheelchair access, parking availability

Step 3: Legal Data Generation

Since internal legal compliance data isn't publicly available, realistic legal attributes were generated for each outlet:

IMB status — Approved (80%), In Progress (15%), Bermasalah (5%) — weighted by province and outlet age
Lease dates — start/end dates with realistic 2-5 year terms, rent amounts weighted by province (Jakarta highest)
Halal certification — Certified (85%), In Process (10%), Pending Audit (5%)
Dispute levels — Low (70%), Medium (20%), High (10%) — weighted by urban density
Annual rent — Rp 150M-800M range, correlated with province and location type

Step 4: Dirty Data Injection

Realistic data quality issues were injected to simulate real-world data problems:

Issue	Count	Example
Duplicate rows	5	Exact row duplicates
Province typos	~3%	"jawa timur" instead of "Jawa Timur"
Missing IMB status	~2%	Empty IMB field
Inconsistent date formats	~3%	DD/MM/YYYY vs YYYY-MM-DD
Missing rent values	~2%	Empty rent field

Step 5: Data Cleaning

The cleaning pipeline handles all injected anomalies:

Deduplication — remove 5 exact duplicate rows
Standardize province names — title case, fix typos
Standardize date formats — convert all to YYYY-MM-DD
Handle missing values — impute or flag for review
Recalculate derived fields — days to expiry, compliance score

Tools & Stack

Tool	Purpose
Python	Scraping (Playwright), data generation, cleaning (Pandas)
SQL	Alternative cleaning pipeline (ClickHouse-compatible)
Looker Studio	Dashboard visualization (interactive)
HTML/CSS/JS	Portfolio presentation (this website)

📋 Reproducibility

All generation scripts use random.seed(42) for reproducible output. The scraping scripts capture a snapshot of Google Maps data at a specific point in time — outlet counts may change as Mie Gacoan continues expanding.

Methodology

Data Pipeline & Technical Approach