Field Lineage Architecture

Task Details

← Task Board

Task Description

# TASK: Design Field Lineage / Merge-Split Architecture

**Priority:** HIGH (architectural foundation — gets harder to retrofit later)
**Agent:** engineer
**Filed:** 2026-02-16
**Filed by:** Doug (via Claude Code session)

---

## The Problem

Fields change over time. Not just names — their physical boundaries merge, split, and recombine across years. The system needs to handle this WITHOUT destroying or "combining" historical data.

### Real-World Scenarios

**Merge:** Fields 1, 2, 3 existed separately for years. In 2026, the farmer tears out fence lines and combines them into "Field A." Now:
- Field 1 has nitrogen maps, OM maps, yield data, spray records from 2018-2025
- Field 2 has the same from 2019-2025
- Field 3 has data from 2020-2025
- Field A needs to show ALL of that history, but also needs to show that pre-2026 it was 3 separate management zones

**Partial cropping:** Fields 1 and 3 were cropped in 2020, but Field 2 was fallow. In 2021 all three were cropped. In 2022 only Field 1 was cropped. Then in 2026 they're all combined into Field A. The data is discontinuous and inconsistent across the fields being merged.

**Split:** Field A (a big quarter section) gets split into Field B (irrigated pivot) and Field C (dryland corners) in 2024. Historical data for Field A needs to be accessible from both B and C, but only the spatially-relevant portions.

**Rename across years:** "T09" was called "Turner 09" in 2018, "T09 East" in 2020, and "Turner East 09" in 2022. Same physical field, different names. The audit tool already handles this with cluster naming, but the lineage needs to carry forward.

**Re-split after merge:** Fields 1+2+3 merged into A in 2023, then A split into B+C in 2025. Full chain: 1,2,3 → A → B,C.

---

## Core Principle: NEVER Move Data, Move the Boundary

All data in FarmIQ has GPS coordinates:
- Yield points: lat/lon from combine GPS
- Soil samples: lat/lon from sampling grid
- Spray coverage: GPS-traced polygons
- Seeding data: GPS-traced lines/polygons
- Satellite imagery: raster tiles with known extent

The field tag on raw data is just a convenience label from the original source (combine console, spray controller, etc). It should NEVER be the primary key for querying.

**The right approach:** To get "Field A's nitrogen map," do a spatial query — find all nitrogen sample points that fall within Field A's boundary polygon. Don't care what field tag the lab put on the sample.

This means a merge is just: "here's a new, bigger boundary. Query against it."
A split is just: "here are two new, smaller boundaries. Query against each."

---

## What Needs to Be Designed

### 1. Field Lineage Data Model

The field audit tool already saves this in `progress.json`:
```json
{
"field_lineage": {
"12": {
"supersedes": [3, 5, 8],
"type": "merge",
"effective_date": "2026-03-01",
"note": "Combined north + south sections"
}
}
}
```

This needs to be formalized into a standalone lineage model that:
- Tracks merge, split, and rename events
- Has effective dates (when did this change take effect?)
- Is chainable (A supersedes 1,2,3 — then B,C supersede A)
- Stores the boundary polygon that was active during each period
- Lives in `processed_data/{customer}/field_lineage.json` (alongside field data)

**Proposed schema:**
```json
{
"version": 1,
"customer": "doug",
"fields": {
"field_a": {
"canonical_name": "Field A",
"current_boundary": "boundaries/field_a_2026.geojson",
"active_from": "2026-03-01",
"active_to": null,
"lineage": {
"type": "merge",
"predecessors": ["field_1", "field_2", "field_3"],
"effective_date": "2026-03-01"
}
},
"field_1": {
"canonical_name": "Field 1",
"current_boundary": "boundaries/field_1_2018.geojson",
"active_from": "2018-01-01",
"active_to": "2026-02-28",
"lineage": {
"type": "superseded",
"successor": "field_a",
"effective_date": "2026-03-01"
}
}
},
"events": [
{
"date": "2026-03-01",
"type": "merge",
"sources": ["field_1", "field_2", "field_3"],
"result": "field_a",
"note": "Fences removed, combined into single management unit"
}
]
}
```

### 2. Time-Aware Boundary Resolution

Given a field name and a date, return the correct boundary:
```python
def get_boundary(field_id, as_of_date=None):
"""Return the boundary polygon active for this field on this date.

If field_a supersedes fields 1,2,3 as of 2026-03-01:
- get_boundary("field_a", "2025-06-01") → None (didn't exist yet)
- get_boundary("field_a", "2026-06-01") → field_a boundary
- get_boundary("field_1", "2025-06-01") → field_1 boundary
- get_boundary("field_1", "2026-06-01") → None (superseded)
"""
```

### 3. Spatial Data Query with Lineage Awareness

Given a field and a date range, return all relevant data:
```python
def query_field_data(field_id, data_type, start_date=None, end_date=None):
"""Query all data for a field, including predecessor data.

For field_a (merged from 1,2,3 in 2026):
- query_field_data("field_a", "yield", "2024", "2024")
→ Returns yield from fields 1,2,3 boundaries (they were separate then)
→ Each data point tagged with its original field for context

- query_field_data("field_a", "yield", "2026", "2026")
→ Returns yield from field_a boundary (it's one field now)
"""
```

### 4. Report Generation Awareness

Reports need to handle the discontinuity:
- "Field A Yield History" should show:
- 2026+: single yield average for combined area
- Pre-2026: three separate bars/lines showing fields 1, 2, 3 individually
- Clear visual indicator of the merge event
- Nitrogen/OM maps spanning the merge: show the data but note management zone boundaries

### 5. Integration with Existing Processors

The existing processors (`isoxml_processor.py`, `yield_cleaner.py`, `process_spray_data.py`) currently tag output with field names from the source data. They don't need to change — they should keep tagging with whatever the source says. The lineage resolution happens at query time, not processing time.

---

## Research Tasks for AutoClaude

1. **Scan Doug's actual data** in `processed_data/doug/` — catalog which fields have data across which years. Find real examples of:
- Fields that changed names across years (the audit tool already found these)
- Fields where boundaries shifted significantly between years
- Fields with gaps (cropped some years, not others)

2. **Review the existing boundary reconciler** (`BoundaryReconciler/boundary_reconciler.py`) — it already does IoU matching and temporal tracking. How much of its logic can feed the lineage model?

3. **Review the field naming audit** (`FieldNamingAudit/audit_engine.py`) — it already clusters boundaries spatially and tracks name variants. The cluster = the lineage chain.

4. **Prototype the lineage JSON** — generate `field_lineage.json` for Doug's fields using existing audit results + processed_data.

5. **Prototype spatial query** — write a function that, given a field name + date + data type, finds the right boundary and returns matching data points. Test with Doug's yield data.

---

## DO NOT

- Do NOT modify existing processors to tag data differently — lineage resolution is at query time
- Do NOT build a full UI — this is backend architecture first
- Do NOT try to physically merge/combine data files — the whole point is spatial re-query
- Do NOT assume fields are always simple polygons — some are MultiPolygon (pivot corners, waterway splits)
- Do NOT delete any original data or original field tags — always preserve provenance

## Output Expected

1. `FieldLineage/field_lineage.py` — core lineage model + resolution functions
2. `FieldLineage/spatial_query.py` — boundary-aware data query
3. `FieldLineage/lineage_builder.py` — build lineage from audit results
4. `processed_data/doug/field_lineage.json` — Doug's actual field lineage
5. Notes on edge cases found in real data

Job Queue (0)

No job queue entries for this task yet