On-Device Vision — ESP32-CAM (Legacy)
A crab walks into the pot. A camera takes its picture. A tiny computer — smaller than a credit card — figures out what species it is and whether it’s big enough to keep. All of this happens underwater, in the dark, with no internet connection, in about a quarter of a second.
That’s the entire system. Everything else on this page is about why it works that way and what trade-offs were made to get there.
Why On-Device?
Section titled “Why On-Device?”The obvious question: why not just send the photo somewhere smarter? A cloud server with a big GPU could classify species with higher accuracy and more sophistication. Three problems kill that idea:
-
There’s no internet at the bottom of the Chesapeake Bay. The submerged unit sits in 3–30 meters of saltwater. Radio doesn’t penetrate water. Wi-Fi doesn’t exist down there. The only communication path is a wired tether to the surface buoy, which talks to the base station over LoRa — a low-bandwidth radio protocol designed for tiny packets, not photos.
-
Latency matters. A crab doesn’t wait around. From the moment it triggers the motion sensor to the moment the door lock decides “keeper” or “release,” the system has a narrow window. A round-trip to the cloud — even if you could get the photo there — adds seconds to minutes of delay. The decision needs to happen now.
-
Power budget. Transmitting a 96×96 image over LoRa would take ~30 seconds and drain the battery significantly. The submerged unit runs on a small LiPo recharged via the tether. Every milliamp counts. Running inference locally costs about 150mA for 300ms. Transmitting the same data costs far more energy spread over far more time.
| Approach | Latency | Power per classification | Connectivity required |
|---|---|---|---|
| Cloud inference | 2–30 seconds | High (radio TX dominates) | Continuous uplink |
| Edge inference (ESP32-CAM) | ~300ms | ~12.5 mJ | None |
The math is clear: the brain has to live in the trap.
What the Camera Sees
Section titled “What the Camera Sees”Human eyes are useless at depth. Below a few meters — especially in turbid estuary water — visible light scatters and absorbs. The pot interior is pitch black.
The submerged unit uses 850nm infrared LEDs for illumination. Why IR?
- Penetrates turbidity better than visible light. Suspended sediment scatters shorter wavelengths (blue, green) more than longer ones. Near-IR cuts through murk that would blind a visible-light camera.
- Doesn’t attract organisms. Visible light underwater draws phytoplankton, baitfish, and other organisms to the lens, fouling the image. Most marine invertebrates can’t see 850nm.
- The OV2640 sensor responds to it natively. The ESP32-CAM’s image sensor is sensitive to near-IR without modification — no special filter swap needed.
The camera captures frames in RGB565 format — 16 bits per pixel, packed into 5 bits red, 6 bits green, 5 bits blue. For classification, the firmware crops and scales the frame to 96×96 pixels, unpacks it to standard 8-bit RGB, and normalizes the values using ImageNet statistics (mean and standard deviation per channel). That normalized 96×96×3 tensor is what the model actually sees.
How Classification Works
Section titled “How Classification Works”At the simplest level, the model is a pattern matcher. Show it thousands of labeled photos — “this is a blue crab keeper,” “this is an empty trap,” “this is a horseshoe crab” — and it learns to distinguish the patterns that separate one class from another.
Transfer Learning
Section titled “Transfer Learning”Training a neural network from scratch requires millions of images and significant compute. We don’t have millions of underwater crab photos. Instead, we use transfer learning: start with a model that already knows how to recognize shapes, textures, and edges (trained on ImageNet’s 1.4 million everyday images), then teach it the specific differences between crab species.
Think of it like hiring someone who already knows how to draw and teaching them to draw crabs specifically. You don’t start from “what is a line.”
Why MobileNetV2
Section titled “Why MobileNetV2”The model architecture is MobileNetV2 — a neural network designed by Google to run on mobile phones. Its key innovation is depthwise separable convolutions, which factor a standard convolution into two smaller operations. This reduces computation by roughly 8–9× compared to a standard convolution with the same receptive field.
Why this matters for a crab trap:
- ~2.2 million parameters — small enough to fit in the ESP32’s 4MB of PSRAM
- Low computational cost — inference completes in ~200–300ms on the ESP32’s 240MHz dual-core processor
- Well-studied — extensive literature on quantizing and deploying MobileNetV2 to microcontrollers
The model takes a 96×96×3 input image and outputs a probability distribution across the target classes (e.g., empty, blue crab keeper, blue crab undersized, female sook, finfish, horseshoe crab).
The Quantization Trade-off
Section titled “The Quantization Trade-off”Here’s the constraint that drives the hardest engineering decision in this system: the ESP32 has no floating-point accelerator for neural network operations. It has an FPU for scalar math, but matrix multiplications — the core operation in every neural network layer — run as software loops over integer or floating-point arrays.
Standard neural networks use 32-bit floating-point weights (FP32). Quantized networks use 8-bit integers (INT8). The differences:
| Property | FP32 | INT8 |
|---|---|---|
| Model size | ~8.8 MB | ~2.2 MB |
| Memory per weight | 4 bytes | 1 byte |
| Inference speed (ESP32) | ~1.2 seconds | ~300ms |
| Accuracy (typical) | Baseline | 1–3% lower |
INT8 quantization maps the continuous range of floating-point weights into 256 discrete integer values. The mapping is calibrated using a representative dataset — the quantizer observes the actual distribution of activations during inference on real images and chooses scale factors that minimize information loss.
The 1–3% accuracy drop is acceptable because the alternative — FP32 inference at 1.2 seconds — is too slow for real-time catch decisions and consumes 4× more memory in a device with only 4MB to spare.
From Training to Trap
Section titled “From Training to Trap”The full pipeline from raw photos to a working model on the ESP32:
Photos → Label → Train (PyTorch) → Export (ONNX/TFLite) → SD Card → ESP32-CAMEach step produces a specific artifact:
-
Photos — Raw images from underwater cameras, IR-illuminated, inside crab pots. Augmented with rotation, flipping, brightness variation to increase effective dataset size.
-
Label — Each image tagged with a species class using annotation tools (LabelImg, CVAT). The labels define the ground truth the model learns from.
-
Train — PyTorch with the timm library loads a pre-trained MobileNetV2, replaces the final classification head with one matching our class count, and fine-tunes on the labeled dataset. Output:
best_model.pth(a PyTorch checkpoint). -
Export — The PyTorch model is converted to ONNX format, then quantized to INT8 using calibration data. If tooling permits, further converted to TFLite for direct use with TFLite Micro on the ESP32. Output:
species_model.tfliteorspecies_model.onnx(INT8 quantized). -
SD Card — The quantized model file is copied to the ESP32-CAM’s MicroSD card as
/species_model.tflite. -
ESP32-CAM — On boot, the firmware loads the model from SD into PSRAM and initializes the TFLite Micro interpreter. The model is ready to classify frames as soon as the motion sensor triggers.
What Can Go Wrong
Section titled “What Can Go Wrong”Honest accounting of current limitations:
-
No size estimation. The model is a classifier — it tells you what species, not how big. Distinguishing legal-size from undersized relies on training separate classes for size ranges, which requires careful labeling. True size measurement would need a bounding box detector or stereo vision, neither of which fits in the current compute budget.
-
Calibration drift. If deployment conditions change significantly from training data — different water clarity, different bait (which changes background color), different pot geometry — accuracy drops. The system needs periodic retraining with fresh field data.
-
IR illumination variation. LED aging, biofouling on the lens window, and battery voltage sag all change the illumination intensity over a deployment. The model is trained with brightness augmentation to tolerate some variation, but extreme cases (heavy fouling) will degrade classification.
-
Single-frame classification. The current system classifies one frame per motion event. A crab partially occluded by mesh wire or another crab may be misclassified. Future versions may aggregate multiple frames for higher confidence.
-
Training data scarcity. The model is only as good as its training data. Early prototypes will have small datasets and correspondingly lower accuracy. Accuracy improves as field deployments feed real catch images back into the training pipeline.
What the Next Generation Looks Like
Section titled “What the Next Generation Looks Like”Several of the limitations listed above — no size estimation, single-frame classification, 80mA idle monitoring — are architectural constraints of the ESP32-S, not fundamental limits of edge vision. The ESP32-P4, with 768KB SRAM, MIPI CSI camera support, and a low-power monitoring core, enables bounding-box object detection (YOLO-class models) that directly solves size measurement via the pinhole camera model. See Platform Decision: ESP32-CAM to ESP32-P4 for the full hardware comparison and transition plan.
Further Reading
Section titled “Further Reading”- Hands-on training guide: Train the Species Model — step-by-step pipeline from dataset preparation through deployment
- Hardware specs: Submerged Unit Reference — ESP32-CAM module, OV2640 sensor, IR LED array specifications
- System context: System Overview — where the vision system fits in the full SmartPot architecture
- The bigger picture: Why SmartPot? — the industry problems that drive these design decisions