On-Device Vision
A crab walks into the pot. A camera takes its picture. A tiny computer figures out what species it is, draws a bounding box around it, and measures how big it is — all underwater, in the dark, with no internet connection, in a fraction of a second.
That’s the entire system. Everything else on this page is about why it works that way and what trade-offs were made to get there.
Why On-Device?
Section titled “Why On-Device?”The obvious question: why not just send the photo somewhere smarter? A cloud server with a big GPU could detect species with higher accuracy and more sophistication. Three problems kill that idea:
-
There’s no internet at the bottom of the Chesapeake Bay. The submerged unit sits in 3–30 meters of saltwater. Radio doesn’t penetrate water. Wi-Fi doesn’t exist down there. The only communication path is a wired tether to the surface buoy, which talks to the base station over LoRa — a low-bandwidth radio protocol designed for tiny packets, not photos.
-
Latency matters. A crab doesn’t wait around. From the moment it triggers the motion sensor to the moment the door lock decides “keeper” or “release,” the system has a narrow window. A round-trip to the cloud — even if you could get the photo there — adds seconds to minutes of delay. The decision needs to happen now.
-
Power budget. Transmitting a full image over LoRa would take minutes and drain the battery significantly. Running inference locally on the ESP32-P4 costs about 400mA for a fraction of a second. Transmitting the same data costs far more energy spread over far more time.
| Approach | Latency | Power per detection | Connectivity required |
|---|---|---|---|
| Cloud inference | 2–30 seconds | High (radio TX dominates) | Continuous uplink |
| Edge inference (ESP32-P4) | <200ms (estimated) | ~60 mJ | None |
The math is clear: the brain has to live in the trap.
What the Camera Sees
Section titled “What the Camera Sees”Human eyes are useless at depth. Below a few meters — especially in turbid estuary water — visible light scatters and absorbs. The pot interior is pitch black.
The submerged unit uses 850nm infrared LEDs for illumination. Why IR?
- Penetrates turbidity better than visible light. Suspended sediment scatters shorter wavelengths (blue, green) more than longer ones. Near-IR cuts through murk that would blind a visible-light camera.
- Doesn’t attract organisms. Visible light underwater draws phytoplankton, baitfish, and other organisms to the lens, fouling the image. Most marine invertebrates can’t see 850nm.
- The OV5647 NoIR sensor responds to it natively. The “NoIR” variant ships without an IR-cut filter — it sees near-IR as clearly as visible light.
The camera captures frames via MIPI CSI at up to 5MP, but inference runs on a 320×320 center crop. The ESP32-P4’s on-chip hardware ISP handles debayering, white balance, and noise reduction before the frame reaches the detection model — these operations run in dedicated hardware, not on the CPU.
How Detection Works
Section titled “How Detection Works”The model is a YOLO-class object detector. Unlike a classifier (which answers “what species?”), a detector answers “where is it, what is it, and how big is it?” — all in one pass.
Bounding Boxes Enable Size Measurement
Section titled “Bounding Boxes Enable Size Measurement”This is the key capability. A classifier outputs a single label for the whole image. A detector outputs bounding boxes — pixel coordinates of rectangles around each detected animal. Those pixel coordinates, combined with known physical geometry, give you actual size.
The pinhole camera model makes this work:
real_size = (pixel_size × distance_to_subject) / focal_lengthIn a crab pot, the distance-to-subject variable is constrained — the crab is standing on the bottom panel at a known distance from the camera lens. A one-time calibration procedure (imaging a reference target at known distances) establishes the focal length and distortion parameters. After that, every bounding box translates directly to a physical measurement.
This eliminates the need for size-split training classes entirely. Instead of training separate “keeper” and “undersized” classes (which requires enormous labeled datasets and doesn’t generalize across crab postures), the system detects “blue crab” and measures it directly. Compare the measured size against the regulatory minimum for your jurisdiction — done.
Transfer Learning
Section titled “Transfer Learning”Training a neural network from scratch requires millions of images and significant compute. We don’t have millions of underwater crab photos. Instead, we use transfer learning: start with a model that already knows how to recognize shapes, textures, and edges, then teach it the specific differences between crab species.
Think of it like hiring someone who already knows how to draw and teaching them to draw crabs specifically. You don’t start from “what is a line.”
Framework: ESP-DL
Section titled “Framework: ESP-DL”The inference engine is Espressif’s ESP-DL — purpose-built for the ESP32-P4’s memory architecture.
The key advantage over TFLite Micro (used on the legacy ESP32-CAM) is tiered memory awareness. The ESP32-P4 has two tiers of memory: 768KB of fast SRAM and 32MB of OPI PSRAM at 200MHz. ESP-DL places latency-sensitive activation tensors in SRAM and bulk weight storage in PSRAM, scheduling data movement to minimize stalls. On the dual-core RISC-V CPU, it pipelines inference across both cores — one computing layer N while the other prefetches weights for layer N+1.
| Aspect | TFLite Micro (ESP32-CAM) | ESP-DL (ESP32-P4) |
|---|---|---|
| Memory allocation | Flat arena (one tier) | Tiered SRAM/PSRAM placement |
| Multi-core | Single-core only | Dual-core pipeline scheduling |
| Hardware accel | Generic C kernels | P4 vector instructions + PPA |
| Model format | .tflite (FlatBuffer) | ESP-DL format (via ESP-PPQ quantizer) |
The training pipeline stays the same — data collection, labeling, and PyTorch training are identical. Only the export stage changes:
PyTorch → ONNX → ESP-PPQ (quantized) → ESP-DL format → SD cardThe Quantization Trade-off
Section titled “The Quantization Trade-off”Standard neural networks use 32-bit floating-point weights (FP32). Quantized networks use 8-bit integers (INT8).
| Property | FP32 | INT8 |
|---|---|---|
| Model size | ~8–20 MB | ~2–5 MB |
| Memory per weight | 4 bytes | 1 byte |
| Inference speed | Baseline | 2–4× faster |
| Accuracy | Baseline | 1–3% lower (typical) |
INT8 quantization maps continuous floating-point values into 256 discrete integers. ESP-PPQ (Espressif’s fork of PyTorch Post-training Quantization) calibrates these mappings using representative images from the target environment. The output format is optimized for the P4’s memory controller.
From Training to Trap
Section titled “From Training to Trap”The full pipeline from raw photos to a working model on the ESP32-P4:
Photos → Label → Train (PyTorch) → Export (ONNX/ESP-PPQ) → SD Card → ESP32-P4Each step produces a specific artifact:
-
Photos — Raw images from underwater cameras, IR-illuminated, inside crab pots. Augmented with rotation, flipping, brightness variation.
-
Label — Each image annotated with bounding boxes and species labels using CVAT or LabelImg.
-
Train — PyTorch trains a YOLO-class detection model via transfer learning. Output:
best_model.pth. -
Export — The model is converted to ONNX, then quantized via ESP-PPQ for the P4’s memory layout. Output:
species_detect.espdl. -
SD Card — The quantized model and
calibration.binare copied to the MicroSD card. -
ESP32-P4 — On boot, the firmware loads the model into PSRAM and calibration data into SRAM. Detection runs whenever the LP-Core detects motion.
What Can Go Wrong
Section titled “What Can Go Wrong”Honest accounting of current limitations:
-
Calibration drift. If deployment conditions change significantly from training data — different water clarity, different bait (which changes background color), different pot geometry — accuracy drops. The system needs periodic retraining with fresh field data.
-
IR illumination variation. LED aging, biofouling on the lens window, and battery voltage sag all change the illumination intensity over a deployment. The model is trained with brightness augmentation to tolerate some variation, but extreme cases (heavy fouling) will degrade detection.
-
Single-frame detection. The current system runs detection on one frame per motion event. A crab partially occluded by mesh wire or another crab may be missed or mis-measured. Future versions may aggregate multiple frames for higher confidence.
-
Training data scarcity. The model is only as good as its training data. Early prototypes will have small datasets and correspondingly lower accuracy. Accuracy improves as field deployments feed real catch images back into the training pipeline.
-
MIPI CSI flex cable reliability. The OV5647 NoIR connects via a flex cable, not a rigid PCB like the ESP32-CAM’s integrated sensor. In a waterproof enclosure on a fishing boat, flex cable connections need careful strain relief and potting at the connector junction. See Waterproof Enclosure for protection guidelines.
-
ESP-DL maturity. ESP-DL is actively developed but younger than TFLite Micro. Expect occasional undocumented behavior and the need to file Espressif GitHub issues during early development.
-
Datasheet v0.5 unknowns. The ESP32-P4 datasheet is preliminary. Deep sleep current, some peripheral specifications, and LP-Core power draw are not finalized. These must be measured on production hardware.
Further Reading
Section titled “Further Reading”- Legacy vision system: On-Device Vision — ESP32-CAM (archived) — the original classification-only architecture
- Hands-on training guide: Train the Species Model — step-by-step pipeline from dataset preparation through deployment
- Hardware specs: Submerged Unit Reference — ESP32-P4 module, OV5647 NoIR sensor, IR LED array specifications
- Platform comparison: Platform Decision — why we chose the ESP32-P4 over the ESP32-CAM
- System context: System Overview — where the vision system fits in the full SmartPot architecture