Train the Species Model

This guide walks you through training, quantizing, and deploying a species detection model to the ESP32-P4 in the submerged unit. By the end, you’ll have a working model on an SD card that the trap firmware can load and run.

Overview

Here’s what you’ll do in this guide:

Collect underwater training images under IR illumination
Label images with bounding boxes by species
Train a YOLO-class detection model via transfer learning (PyTorch)
Export to INT8-quantized ESP-DL format for the ESP32-P4
Deploy the model to the ESP32-P4’s SD card

All tooling lives in ml-pipeline/ — a Python project managed with uv that wraps PyTorch and timm.

Data Collection Strategy

Before training begins, you need a dataset that actually represents what the camera will see at depth. Stock photos of crabs on a white background will not cut it — the model needs to learn what a crab looks like through wire mesh, in murky water, under IR light.

Underwater Photography Setup

The training images should match deployment conditions as closely as possible:

Primary illumination: 850nm IR LED array (same as the submerged unit). This is what the camera will actually see in the field.
Backup illumination: White LED panel for reference shots and ground truth labeling. Useful during data collection, but don’t rely on it for training — the deployed system runs IR only.
Camera housing: Any waterproof housing that can mount inside or adjacent to a crab pot. Match the camera angle and field of view to the OV5647 NoIR lens as closely as you can.

Image Sources

You don’t need to collect every image yourself. Combine sources to build a diverse dataset faster:

Personal trap deployments — the best source. Mount a camera in your test pots and pull the SD card after soak periods. This gives you real-world conditions that no other source can match.
State fisheries departments — many publish survey photos from trawl and pot surveys. Contact your state’s marine resources division directly; they’re often willing to share data for conservation-aligned projects.
iNaturalist marine observations — searchable by species and location. Quality varies, but useful for building out species diversity. Filter for “research grade” observations.
NOAA fisheries survey databases — the NOAA Fisheries Photo Gallery and regional science center publications contain species reference images.

Dataset Size

Stage	Images per class	Expected accuracy
Initial prototype	300-500	~80-85% (enough to validate the pipeline works)
Field-ready	1,000+	~90-95% (production threshold)
Mature system	3,000+	95%+ (with continued field data feedback)

Start small and iterate. A prototype model trained on 300 images per class is enough to prove the hardware and firmware work end-to-end. Accuracy improves as you feed real field captures back into the training set.

Image Diversity

Each class needs images across these axes of variation:

Angle: Top-down, side profile, angled — crabs don’t pose for the camera
Lighting: Full IR, partial shadow, backlit by ambient light leaking into the pot
Occlusion: Partially hidden behind mesh wires, stacked on top of other crabs, half-buried in bait
Size range: Span the full range from clearly undersized to clearly legal and everything in between
Water clarity: Clear, silty, algae-tinged — turbidity changes the apparent contrast and color cast

Negative Examples

These are just as important as positives. The model needs to learn what is not a catch event:

Empty trap interior (different lighting conditions, angles)
Debris: seaweed, shells, rope fragments, bait remnants
Non-target species you expect to encounter in your fishery
Murky water with no visible objects (the camera will see this often)

Labeling

Use CVAT for a browser-based collaborative workflow or LabelImg for a lightweight desktop tool. Both support bounding box annotation.

Label consistently: draw tight bounding boxes around each animal. Species label only — size is measured from the box, not from the label.
Have a second person verify a random sample of labels. Labeling errors propagate directly into model errors.

Data Augmentation

Beyond the automatic augmentations applied during training (see below), consider generating augmented copies of your raw dataset:

Rotation: 0-360 degrees (crabs have no preferred orientation in a pot)
Horizontal and vertical flip
Brightness and contrast variation: +/- 30% to simulate changing ambient light
Synthetic blur: Gaussian blur at varying radii to simulate water turbidity and camera focus drift
Color channel shifts: Simulate the spectral differences between IR illumination sources

A 500-image raw dataset with aggressive augmentation can effectively behave like a 2,000-3,000 image dataset during training.

Training Environment

The ml-pipeline/ directory contains the complete training pipeline. Install with uv:

cd ml-pipeline
uv sync --extra dev --extra export-p4

This installs PyTorch, timm, and ESP-DL export dependencies. GPU acceleration is optional — the model is small enough to train on CPU in minutes for prototype datasets.

Dataset Preparation

Image Requirements

Parameter	Value
Resolution	320x320 px (model input size)
Format	RGB JPEG
Illumination	850nm IR (matches deployment)
Background	Inside a crab pot (wire mesh, sediment)
Minimum per class	200 images (500+ recommended)

Target Classes

The detection model uses simplified classes — no size-split variants needed since size is measured directly from bounding boxes:

dataset/
├── blue_crab/             # All blue crabs regardless of size (index 0)
├── finfish/               # Any non-crab species (index 1)
├── horseshoe_crab/        # (index 2)

Toy Dataset (Pipeline Validation)

Before collecting real images, validate the pipeline end-to-end with synthetic data:

uv run python scripts/generate_toy_dataset.py --output dataset --per-class 10

This creates synthetic images with distinct color palettes. The toy model should reach ~100% validation accuracy — it just proves the plumbing works.

Data Augmentation

Training automatically applies:

Random rotation (±15 degrees)
Random brightness adjustment (±20%)
Random horizontal flip
Mild Gaussian blur (simulates water turbidity)

Training

uv run smartpot-train \
    --dataset dataset \
    --epochs 20 \
    --batch-size 32 \
    --lr 1e-3 \
    --output models/checkpoints \
    --device auto

The training loop:

Uses Adam optimizer with CrossEntropyLoss
Checkpoints the best model by validation accuracy
Prints per-epoch train/val loss and accuracy
Outputs a confusion matrix and classification report at the end

Output is saved to models/checkpoints/best_model.pth.

Export

ESP32-P4 (ESP-DL)

Export the trained detection model for ESP32-P4 deployment:

uv run smartpot-export-p4 \
    --checkpoint models/checkpoints/best_model.pth \
    --output models/species_detect

The export pipeline:

PyTorch → ONNX export with detection head
ONNX → ESP-PPQ quantization (INT8, calibrated for P4 memory layout)
ESP-PPQ → ESP-DL format (.espdl)
Output validation

If esp-ppq is not installed, the tool exports ONNX and prints manual conversion instructions.

ESP32-CAM (TFLite — Legacy)

The original classification export path is preserved:

uv run smartpot-export \
    --checkpoint models/checkpoints/best_model.pth \
    --output models/species_model

See the archived training guide for full ESP32-CAM deployment details.

Deployment

ESP32-P4

Generate calibration data for your pot geometry:

python ml-pipeline/scripts/generate_calibration.py \
    --focal-length 500.0 \
    --pot-distance 350.0 \
    --output calibration.bin

Copy species_detect.espdl and calibration.bin to the MicroSD card root

Flash the firmware:

cd firmware
idf.py set-target esp32p4
idf.py build
idf.py flash monitor

Verify with the serial monitor:

SmartPot Trap Unit v0.2-p4
Camera initialized: OV5647 NoIR (MIPI CSI)
Calibration loaded: focal=500.0px dist=350mm img=320x320
ESP-DL model ready
Door default: UNLOCKED
Tether UART initialized
Waiting for tether connection...

Testing

Run the full test suite:

cd ml-pipeline
uv run pytest -v

Tests cover:

Dataset loading and augmentation (test_dataset.py)
Model architecture and output shapes (test_model.py)
ONNX export, validation, and C header generation (test_export.py)

Lint with ruff:

uv run ruff check src/ tests/ scripts/

Accuracy Targets

Metric	Target	Notes
Detection (mAP@50)	>85%	Bounding box accuracy
Size measurement	±10mm	At typical pot dimensions
Species classification	>90%	Important for data logging
False release rate	<2%	Keepers incorrectly released
False retention rate	<5%	Bycatch incorrectly kept