Data Pipeline Overview
The project follows a 5-step pipeline: Scrape Locations โ Scrape Details โ Generate Legal Data โ Inject Dirty Data โ Clean. Each step is documented and reproducible.
| Step | Description | Input | Output |
|---|---|---|---|
| 1 | Google Maps scraping (locations) | 25+ regional search queries | 557 outlet locations (JSON) |
| 2 | Detail scraping (per outlet) | 557 Google Maps URLs | Address, rating, reviews, hours (JSON) |
| 3 | Legal data generation | Outlet master data | IMB, lease, halal, disputes, rent (CSV) |
| 4 | Dirty data injection | Clean CSV | 562 dirty rows (5 duplicates + anomalies) |
| 5 | Data cleaning & validation | Dirty CSV | 557 clean rows (validated) |
Step 1: Google Maps Scraping
The goal was to collect all Mie Gacoan outlet locations across Indonesia from Google Maps โ a publicly available data source.
โ Approach: Playwright + StealthyFetcher
Used Playwright with stealth mode to search Google Maps for "Mie Gacoan" across 25+ regional queries (e.g., "Mie Gacoan Surabaya", "Mie Gacoan Jakarta", "Mie Gacoan Bandung"). Each query returns a scrollable list of locations. Successfully extracted 557 unique outlet locations with coordinates, names, and place IDs.
Step 2: Detail Scraping
For each of the 557 locations, additional details were scraped from individual Google Maps pages:
- Full address โ street, city, province, postal code
- Rating & reviews โ average rating and review count
- Operating hours โ daily schedule
- Accessibility info โ wheelchair access, parking availability
Step 3: Legal Data Generation
Since internal legal compliance data isn't publicly available, realistic legal attributes were generated for each outlet:
- IMB status โ Approved (80%), In Progress (15%), Bermasalah (5%) โ weighted by province and outlet age
- Lease dates โ start/end dates with realistic 2-5 year terms, rent amounts weighted by province (Jakarta highest)
- Halal certification โ Certified (85%), In Process (10%), Pending Audit (5%)
- Dispute levels โ Low (70%), Medium (20%), High (10%) โ weighted by urban density
- Annual rent โ Rp 150M-800M range, correlated with province and location type
Step 4: Dirty Data Injection
Realistic data quality issues were injected to simulate real-world data problems:
| Issue | Count | Example |
|---|---|---|
| Duplicate rows | 5 | Exact row duplicates |
| Province typos | ~3% | "jawa timur" instead of "Jawa Timur" |
| Missing IMB status | ~2% | Empty IMB field |
| Inconsistent date formats | ~3% | DD/MM/YYYY vs YYYY-MM-DD |
| Missing rent values | ~2% | Empty rent field |
Step 5: Data Cleaning
The cleaning pipeline handles all injected anomalies:
- Deduplication โ remove 5 exact duplicate rows
- Standardize province names โ title case, fix typos
- Standardize date formats โ convert all to YYYY-MM-DD
- Handle missing values โ impute or flag for review
- Recalculate derived fields โ days to expiry, compliance score
Tools & Stack
| Tool | Purpose |
|---|---|
| Python | Scraping (Playwright), data generation, cleaning (Pandas) |
| SQL | Alternative cleaning pipeline (ClickHouse-compatible) |
| Looker Studio | Dashboard visualization (interactive) |
| HTML/CSS/JS | Portfolio presentation (this website) |
๐ Reproducibility
All generation scripts use random.seed(42) for reproducible output. The scraping scripts capture a snapshot of Google Maps data at a specific point in time โ outlet counts may change as Mie Gacoan continues expanding.