On-Device Vision — ESP32-CAM (Legacy)

A crab walks into the pot. A camera takes its picture. A tiny computer — smaller than a credit card — figures out what species it is and whether it’s big enough to keep. All of this happens underwater, in the dark, with no internet connection, in about a quarter of a second.

That’s the entire system. Everything else on this page is about why it works that way and what trade-offs were made to get there.

Why On-Device?

The obvious question: why not just send the photo somewhere smarter? A cloud server with a big GPU could classify species with higher accuracy and more sophistication. Three problems kill that idea:

There’s no internet at the bottom of the Chesapeake Bay. The submerged unit sits in 3–30 meters of saltwater. Radio doesn’t penetrate water. Wi-Fi doesn’t exist down there. The only communication path is a wired tether to the surface buoy, which talks to the base station over LoRa — a low-bandwidth radio protocol designed for tiny packets, not photos.
Latency matters. A crab doesn’t wait around. From the moment it triggers the motion sensor to the moment the door lock decides “keeper” or “release,” the system has a narrow window. A round-trip to the cloud — even if you could get the photo there — adds seconds to minutes of delay. The decision needs to happen now.
Power budget. Transmitting a 96×96 image over LoRa would take ~30 seconds and drain the battery significantly. The submerged unit runs on a small LiPo recharged via the tether. Every milliamp counts. Running inference locally costs about 150mA for 300ms. Transmitting the same data costs far more energy spread over far more time.

Approach	Latency	Power per classification	Connectivity required
Cloud inference	2–30 seconds	High (radio TX dominates)	Continuous uplink
Edge inference (ESP32-CAM)	~300ms	~12.5 mJ	None

The math is clear: the brain has to live in the trap.

What the Camera Sees

Human eyes are useless at depth. Below a few meters — especially in turbid estuary water — visible light scatters and absorbs. The pot interior is pitch black.

The submerged unit uses 850nm infrared LEDs for illumination. Why IR?

Penetrates turbidity better than visible light. Suspended sediment scatters shorter wavelengths (blue, green) more than longer ones. Near-IR cuts through murk that would blind a visible-light camera.
Doesn’t attract organisms. Visible light underwater draws phytoplankton, baitfish, and other organisms to the lens, fouling the image. Most marine invertebrates can’t see 850nm.
The OV2640 sensor responds to it natively. The ESP32-CAM’s image sensor is sensitive to near-IR without modification — no special filter swap needed.

The camera captures frames in RGB565 format — 16 bits per pixel, packed into 5 bits red, 6 bits green, 5 bits blue. For classification, the firmware crops and scales the frame to 96×96 pixels, unpacks it to standard 8-bit RGB, and normalizes the values using ImageNet statistics (mean and standard deviation per channel). That normalized 96×96×3 tensor is what the model actually sees.

How Classification Works

At the simplest level, the model is a pattern matcher. Show it thousands of labeled photos — “this is a blue crab keeper,” “this is an empty trap,” “this is a horseshoe crab” — and it learns to distinguish the patterns that separate one class from another.

Transfer Learning

Training a neural network from scratch requires millions of images and significant compute. We don’t have millions of underwater crab photos. Instead, we use transfer learning: start with a model that already knows how to recognize shapes, textures, and edges (trained on ImageNet’s 1.4 million everyday images), then teach it the specific differences between crab species.

Think of it like hiring someone who already knows how to draw and teaching them to draw crabs specifically. You don’t start from “what is a line.”

Why MobileNetV2

The model architecture is MobileNetV2 — a neural network designed by Google to run on mobile phones. Its key innovation is depthwise separable convolutions, which factor a standard convolution into two smaller operations. This reduces computation by roughly 8–9× compared to a standard convolution with the same receptive field.

Why this matters for a crab trap:

~2.2 million parameters — small enough to fit in the ESP32’s 4MB of PSRAM
Low computational cost — inference completes in ~200–300ms on the ESP32’s 240MHz dual-core processor
Well-studied — extensive literature on quantizing and deploying MobileNetV2 to microcontrollers

The model takes a 96×96×3 input image and outputs a probability distribution across the target classes (e.g., empty, blue crab keeper, blue crab undersized, female sook, finfish, horseshoe crab).

The Quantization Trade-off

Here’s the constraint that drives the hardest engineering decision in this system: the ESP32 has no floating-point accelerator for neural network operations. It has an FPU for scalar math, but matrix multiplications — the core operation in every neural network layer — run as software loops over integer or floating-point arrays.

Standard neural networks use 32-bit floating-point weights (FP32). Quantized networks use 8-bit integers (INT8). The differences:

Property	FP32	INT8
Model size	~8.8 MB	~2.2 MB
Memory per weight	4 bytes	1 byte
Inference speed (ESP32)	~1.2 seconds	~300ms
Accuracy (typical)	Baseline	1–3% lower

INT8 quantization maps the continuous range of floating-point weights into 256 discrete integer values. The mapping is calibrated using a representative dataset — the quantizer observes the actual distribution of activations during inference on real images and chooses scale factors that minimize information loss.

The 1–3% accuracy drop is acceptable because the alternative — FP32 inference at 1.2 seconds — is too slow for real-time catch decisions and consumes 4× more memory in a device with only 4MB to spare.

From Training to Trap

The full pipeline from raw photos to a working model on the ESP32:

Photos → Label → Train (PyTorch) → Export (ONNX/TFLite) → SD Card → ESP32-CAM

Each step produces a specific artifact:

Photos — Raw images from underwater cameras, IR-illuminated, inside crab pots. Augmented with rotation, flipping, brightness variation to increase effective dataset size.
Label — Each image tagged with a species class using annotation tools (LabelImg, CVAT). The labels define the ground truth the model learns from.
Train — PyTorch with the timm library loads a pre-trained MobileNetV2, replaces the final classification head with one matching our class count, and fine-tunes on the labeled dataset. Output: best_model.pth (a PyTorch checkpoint).
Export — The PyTorch model is converted to ONNX format, then quantized to INT8 using calibration data. If tooling permits, further converted to TFLite for direct use with TFLite Micro on the ESP32. Output: species_model.tflite or species_model.onnx (INT8 quantized).
SD Card — The quantized model file is copied to the ESP32-CAM’s MicroSD card as /species_model.tflite.
ESP32-CAM — On boot, the firmware loads the model from SD into PSRAM and initializes the TFLite Micro interpreter. The model is ready to classify frames as soon as the motion sensor triggers.

What Can Go Wrong

Honest accounting of current limitations:

No size estimation. The model is a classifier — it tells you what species, not how big. Distinguishing legal-size from undersized relies on training separate classes for size ranges, which requires careful labeling. True size measurement would need a bounding box detector or stereo vision, neither of which fits in the current compute budget.
Calibration drift. If deployment conditions change significantly from training data — different water clarity, different bait (which changes background color), different pot geometry — accuracy drops. The system needs periodic retraining with fresh field data.
IR illumination variation. LED aging, biofouling on the lens window, and battery voltage sag all change the illumination intensity over a deployment. The model is trained with brightness augmentation to tolerate some variation, but extreme cases (heavy fouling) will degrade classification.
Single-frame classification. The current system classifies one frame per motion event. A crab partially occluded by mesh wire or another crab may be misclassified. Future versions may aggregate multiple frames for higher confidence.
Training data scarcity. The model is only as good as its training data. Early prototypes will have small datasets and correspondingly lower accuracy. Accuracy improves as field deployments feed real catch images back into the training pipeline.

What the Next Generation Looks Like

Several of the limitations listed above — no size estimation, single-frame classification, 80mA idle monitoring — are architectural constraints of the ESP32-S, not fundamental limits of edge vision. The ESP32-P4, with 768KB SRAM, MIPI CSI camera support, and a low-power monitoring core, enables bounding-box object detection (YOLO-class models) that directly solves size measurement via the pinhole camera model. See Platform Decision: ESP32-CAM to ESP32-P4 for the full hardware comparison and transition plan.