Yield Cleaner Hardening

Task Details
← Task Board

Task Description

# TASK: CNH Yield Cleaner Hardening — Moderate & Minor Fixes

**Priority:** HIGH
**Agent:** engineer
**Filed:** 2026-02-16
**Filed by:** Claude Code session (processor audit)

---

## Goal

Harden the CNH yield cleaner (`CNH/yield_cleaner.py`) for reliable batch processing across all crop types and combine configurations. Critical fixes (column validation, per-crop conversion, timestamp safety) are already done. This task covers moderate and minor improvements.

## ALREADY FIXED (do not redo)
- CSV pre-flight column validation (`load_data()` checks for required columns)
- Per-crop conversion factor support (`Config.CROP_CONVERSIONS` dict + `--crop` CLI flag)
- Timestamp parsing uses `errors='coerce'` + dropna
- Warning for all-zero speed data

## Moderate Issues to Fix

### 1. Flow delay NaN tracking
When `_apply_flow_delay()` shifts points, some edge points get NaN coordinates. These should be tracked in `removal_reason` column as 'flow_delay_edge' rather than silently dropped. Check if NaN lat/lon after flow delay are being counted in removal stats.

### 2. Multi-combine normalization robustness
The `_normalize_combines()` function assumes a `task` column exists and has non-null values. Some CSVs may not have this column (single combine). Add a guard: if no `task` column or all same value, skip normalization.

### 3. Pass detection with GPS gaps
`pass3_combine_artifacts()` detects passes from time gaps. But if GPS drops out briefly (common in tree lines, buildings), it creates false pass boundaries. Consider also checking distance between consecutive points — a 10s gap with <5m movement isn't a real pass break.

### 4. Boundary polygon validation
`load_boundary()` doesn't validate the polygon. If the GeoJSON has an invalid geometry (self-intersecting, empty), the boundary clip in pass1 will crash. Add `polygon.is_valid` check and attempt `polygon.buffer(0)` fix for invalid geometries.

### 5. Report PDF generation resilience
If matplotlib fails during PDF generation (missing fonts, display issues), the whole run crashes. Wrap PDF generation in try-except and still save the cleaned CSV/TIF even if the PDF fails.

## Minor Issues

### 6. Column name normalization
Different CNH firmware versions may use different column names (e.g., 'Dry Yield' vs 'dry_yield', 'Latitude' vs 'lat'). Add a column name normalization step in load_data().

### 7. Memory usage on large fields
For fields with 500K+ points, the IQR + local spatial outlier detection builds a full cKDTree. This can use significant memory on the Pi. Consider chunk-based processing for very large datasets.

### 8. Grain cart calibration factor output
When the cleaning is done, output the implied calibration factor: (total_cleaned_volume / grain_cart_volume). This helps Doug calibrate future runs. Just print it if grain cart data is available nearby.

### 9. CSV output encoding
Ensure CSV output uses UTF-8 encoding consistently. Some field names may have special characters.

## DO NOT
- Do NOT change the 5-pass cleaning algorithm structure
- Do NOT modify the already-fixed critical issues
- Do NOT change the Config class defaults without calibration data
- Do NOT add new output formats — hardening only

Job Queue (0)

No job queue entries for this task yet