Methodology

Appendix ยท A1
๐Ÿœ

Data Pipeline & Technical Approach

From Google Maps scraping to clean legal compliance dataset โ€” every step documented.

Data Pipeline Overview

The project follows a 5-step pipeline: Scrape Locations โ†’ Scrape Details โ†’ Generate Legal Data โ†’ Inject Dirty Data โ†’ Clean. Each step is documented and reproducible.

StepDescriptionInputOutput
1Google Maps scraping (locations)25+ regional search queries557 outlet locations (JSON)
2Detail scraping (per outlet)557 Google Maps URLsAddress, rating, reviews, hours (JSON)
3Legal data generationOutlet master dataIMB, lease, halal, disputes, rent (CSV)
4Dirty data injectionClean CSV562 dirty rows (5 duplicates + anomalies)
5Data cleaning & validationDirty CSV557 clean rows (validated)

Step 1: Google Maps Scraping

The goal was to collect all Mie Gacoan outlet locations across Indonesia from Google Maps โ€” a publicly available data source.

โœ… Approach: Playwright + StealthyFetcher

Used Playwright with stealth mode to search Google Maps for "Mie Gacoan" across 25+ regional queries (e.g., "Mie Gacoan Surabaya", "Mie Gacoan Jakarta", "Mie Gacoan Bandung"). Each query returns a scrollable list of locations. Successfully extracted 557 unique outlet locations with coordinates, names, and place IDs.

Step 2: Detail Scraping

For each of the 557 locations, additional details were scraped from individual Google Maps pages:

Step 3: Legal Data Generation

Since internal legal compliance data isn't publicly available, realistic legal attributes were generated for each outlet:

Step 4: Dirty Data Injection

Realistic data quality issues were injected to simulate real-world data problems:

IssueCountExample
Duplicate rows5Exact row duplicates
Province typos~3%"jawa timur" instead of "Jawa Timur"
Missing IMB status~2%Empty IMB field
Inconsistent date formats~3%DD/MM/YYYY vs YYYY-MM-DD
Missing rent values~2%Empty rent field

Step 5: Data Cleaning

The cleaning pipeline handles all injected anomalies:

Tools & Stack

ToolPurpose
PythonScraping (Playwright), data generation, cleaning (Pandas)
SQLAlternative cleaning pipeline (ClickHouse-compatible)
Looker StudioDashboard visualization (interactive)
HTML/CSS/JSPortfolio presentation (this website)

๐Ÿ“‹ Reproducibility

All generation scripts use random.seed(42) for reproducible output. The scraping scripts capture a snapshot of Google Maps data at a specific point in time โ€” outlet counts may change as Mie Gacoan continues expanding.