Gdrive Historical Scan

Task Details

← Task Board

Task Description

# TASK: Run Field Audit on Doug's GDrive Historical Data

**Priority:** MEDIUM (great test dataset, not customer-blocking)
**Agent:** data_pipeline
**Filed:** 2026-02-16
**Filed by:** Doug (via Claude Code session)

---

## The Goal

Doug has 20+ years of farming data in Google Drive, already mounted on the Pi via rclone at `/data/clients/`. Run the field naming audit engine against this data to:

1. Catalog every boundary file across 20+ years
2. Spatially cluster fields and track name evolution over time
3. Find naming inconsistencies, merges, splits
4. Produce the most comprehensive field audit report we've ever generated

This is the ultimate stress test for the audit engine AND will produce real insights Doug can use.

---

## What to Run

```bash
python FieldNamingAudit/field_naming_audit.py \
--scan "/data/clients/Dan Weist" "/data/clients/Doug Weist" \
--output FieldNamingAudit/output/weist_historical_audit.json \
--batch
```

**Note:** The rclone mount is at `/data/clients/`. Both "Dan Weist" and "Doug Weist" directories contain historical data (Dan is Doug's father — same operation).

---

## Expected Challenges

1. **File format variety:** Over 20 years there will be shapefiles, GeoJSON, KML, ISOXML ZIPs, maybe even old MapInfo .TAB files. The audit engine should handle what it can and skip what it can't.

2. **Volume:** Potentially thousands of files. The scan might take a while. Run with `--batch` to suppress any GUI prompts.

3. **Name evolution:** The same physical field will have been called different things across decades. This is exactly what the audit engine is designed to find.

4. **Multiple operations:** Dan Weist (father) and Doug Weist (son) may have overlapping but different field sets. The audit should handle this — cluster by geography, not by folder.

---

## After the Audit

1. Commit the output files to GitHub:
- `FieldNamingAudit/output/weist_historical_audit.json`
- `FieldNamingAudit/output/weist_historical_audit.md` (if the engine generates a markdown summary)

2. Create a summary of findings:
- How many unique physical fields were found?
- How many naming variants per field on average?
- Any obvious merges/splits detected?
- Which fields have the most complete data coverage (most years)?

3. Post a task summary to the dashboard

---

## Prerequisites

- rclone mount must be active at `/data/clients/`
- If mount is down, remount: `rclone mount gdrive:Clients /data/clients/ --daemon --vfs-cache-mode writes`
- The `gdrive_scan_daemon.py` should already be cataloging files — check its output first to see what's available

---

## DO NOT

- Do NOT modify any files in `/data/clients/` — read only
- Do NOT run this during other heavy processing — it could take significant time
- Do NOT upload results to the web UI — this is internal analysis, not customer-facing

Job Queue (0)

No job queue entries for this task yet