Platform Decision: ESP32-CAM to ESP32-P4

The ESP32-CAM works. It classifies species at 300ms per frame, runs on milliwatts, and fits inside a crab pot. But it has a hard ceiling: the 520KB of SRAM and lack of hardware AI acceleration mean it can run a classifier (what species?) but not a detector (where is it, and how big?). Size estimation — the single most valuable missing capability — requires bounding box detection, and bounding box detection requires more compute than the ESP32-S has to give.

The ESP32-P4 removes that ceiling. This page documents the evaluation and the decision to adopt it.

Current Constraints

The On-Device Vision page documents the full architecture and its limitations. The short version:

Classification only. MobileNetV2 on the ESP32-CAM tells you what species, not how big. Distinguishing legal-size from undersized relies on training separate size-range classes — a blunt instrument that doesn’t generalize well across camera angles and crab postures.
No object detection. YOLO-class models that output bounding boxes need 2–8MB of SRAM for intermediate activations. The ESP32-S has 520KB. Even with PSRAM offload, the memory bus bandwidth (8-bit SPI at 80MHz) creates a bottleneck that makes real-time detection impractical.
Wasted radio silicon. The ESP32-CAM includes Wi-Fi and Bluetooth radios that are permanently disabled in the submerged unit. The buoy handles all communications. That radio hardware still draws leakage current and occupies die area that could be compute.
80mA idle monitoring. The ESP32-S must run at full clock to do 1 fps frame differencing for motion detection. There’s no low-power core that could handle this while the main CPU sleeps.

For the full picture, see On-Device Vision: What Can Go Wrong and the Submerged Unit Reference.

What the P4 Changes

The ESP32-P4 is Espressif’s first SoC designed purely for edge compute — no wireless radio at all. All silicon area goes to CPU, memory, and hardware accelerators.

Spec	ESP32-CAM (ESP32-S)	ESP32-P4	SmartPot Impact
CPU	Dual Xtensa LX6, 240MHz	Dual RISC-V HP, 400MHz	~3× inference throughput (wider pipeline + higher clock)
SRAM	520KB	768KB	Enough for YOLO activation tensors without PSRAM spill
PSRAM	4MB (8-bit SPI, 80MHz)	32MB (OPI, 200MHz)	20× bandwidth — PSRAM stops being a bottleneck
Camera interface	8-bit DVP (parallel)	MIPI CSI (2-lane)	Higher resolution, lower power, standardized connector
AI acceleration	None	PPA (pixel processing accelerator), hardware vector instructions	Accelerates pre/post-processing; inference benefits from RISC-V vector ops
Video encoding	None	H.264 hardware encoder	On-demand video clips for remote inspection (future)
Wireless	Wi-Fi 802.11 b/g/n + BT 4.2	None	Eliminates ~15mA leakage from unused radio; buoy handles comms
Low-power core	None	LP RISC-V core, 40MHz	Motion detection at single-digit mA instead of 80mA
Price (module)	~$7 (Ai-Thinker)	~$19 (Waveshare P4-NANO)	2.7× cost increase, but enables size measurement

Three Capabilities That Matter

Object Detection with Bounding Boxes

This is the big one.

The ESP32-P4’s memory architecture — 768KB SRAM + 32MB OPI PSRAM at 200MHz — can run YOLO-class object detection models that output bounding boxes, not just class labels. Espressif’s own esp-detection framework demonstrates this: real-time person/object detection at >18 FPS on the P4.

Why bounding boxes change everything for SmartPot: if you know the pixel coordinates of a crab’s bounding box and you know the physical geometry of the pot interior (fixed, known dimensions), you can compute the crab’s actual size using the pinhole camera model:

real_size = (pixel_size × distance_to_subject) / focal_length

The pot geometry constrains the distance-to-subject variable — the crab is standing on the bottom panel at a known distance from the lens. That turns a 2D bounding box into a 3D size measurement, accurate to within ~5–10mm at typical pot dimensions.

This eliminates the “no size estimation” limitation entirely. Instead of training separate classes for “keeper” and “undersized” (which requires enormous labeled datasets and doesn’t generalize), the system measures the crab directly and compares against regulatory minimums per species.

LP-Core for Motion Detection

The ESP32-P4 includes a secondary low-power RISC-V core running at 40MHz. This core stays awake during deep sleep while the main dual-core CPU is powered down.

Currently, the ESP32-CAM must run its full 240MHz CPU to do 1 fps frame differencing for motion detection — 80mA of current draw just to watch an empty pot. The LP-Core can handle the same frame differencing at a fraction of the power, waking the main CPU only when something actually enters the trap.

Mode	ESP32-CAM	ESP32-P4 (estimated)
Deep sleep	10µA	TBD (datasheet v0.5 — not finalized)
Idle monitoring (1 fps)	80mA (main CPU)	~5–10mA (LP-Core)
Active inference	350mA	~400–500mA (higher clock, more cores)

The idle monitoring phase dominates total energy consumption — pots sit empty for hours between catch events. An 8–16× reduction in idle current directly extends tether power budget or enables smaller buoy solar panels.

No Wireless Is a Feature

The ESP32-CAM includes Wi-Fi 802.11 b/g/n and Bluetooth 4.2. In the submerged unit, both are permanently disabled — radio doesn’t penetrate saltwater, and the buoy handles all communications over the wired tether.

But “disabled” doesn’t mean “free.” The radio frontend still draws leakage current, and RF calibration data occupies flash. The ESP32-P4 simply doesn’t have a radio. That silicon area is reallocated to compute and memory controllers.

This matches SmartPot’s architecture perfectly: the submerged unit is a sensor/actuator node that talks to the Smart Buoy over a wired tether. It has no need for wireless, and the P4 doesn’t pretend otherwise.

Framework Shift: ESP-DL over TFLite Micro

The current inference stack is TensorFlow Lite Micro (TFLite Micro). On the ESP32-P4, the optimal framework is Espressif’s ESP-DL — and the reasons are architectural, not political.

TFLite Micro treats all memory as a single flat arena. On the ESP32-CAM with its 4MB PSRAM over an 8-bit SPI bus, this works fine because everything ends up in PSRAM anyway. On the P4, with 768KB of fast SRAM and 32MB of fast-but-not-as-fast OPI PSRAM, a flat arena leaves performance on the table.

ESP-DL is aware of the memory hierarchy. It places latency-sensitive activation tensors in SRAM and bulk weight storage in PSRAM, scheduling data movement to minimize stalls. On dual-core RISC-V, it can also pipeline inference across both cores — one core computing layer N while the other prefetches weights for layer N+1.

Aspect	TFLite Micro	ESP-DL
Memory allocation	Flat arena (one tier)	Tiered SRAM/PSRAM placement
Multi-core	Single-core only	Dual-core pipeline scheduling
Hardware accel	Generic C kernels	Uses P4 vector instructions + PPA
Model format	`.tflite` (FlatBuffer)	ESP-DL format (via ESP-PPQ quantizer)
Maintained by	Google (community)	Espressif (first-party for P4)

The training pipeline doesn’t change. Data collection, labeling, and PyTorch training remain identical. What changes is the export stage:

Current:  PyTorch → ONNX → TFLite (quantized) → SD card
P4:       PyTorch → ONNX → ESP-PPQ (quantized) → ESP-DL format → SD card

ESP-PPQ is Espressif’s quantization toolkit, a fork of PPQ (PyTorch Post-training Quantization) tailored for ESP-DL’s memory layout. The quantization math is the same — INT8 with calibration data — but the output format is optimized for the P4’s memory controller.

For the current training pipeline, see Train the Species Model.

Camera Upgrade Path

The ESP32-CAM uses a DVP (parallel) camera interface limited to the OV2640 (2MP, 1600×1200). The ESP32-P4 uses MIPI CSI (2-lane), opening the door to higher-resolution sensors.

The leading candidate is the OV5647 NoIR — the same 5MP sensor used in Raspberry Pi NoIR Camera Module v1, but without an IR-cut filter:

Spec	OV2640 (current)	OV5647 NoIR (P4 candidate)
Resolution	2MP (1600×1200)	5MP (2592×1944)
Pixel size	2.2µm	1.4µm
IR sensitivity	Native (no filter needed)	Native (NoIR variant ships without IR-cut)
Interface	DVP (8-bit parallel)	MIPI CSI-2 (2-lane)
Frame rate (full res)	15 fps	15 fps
Frame rate (VGA)	30 fps	60 fps
Availability	Ubiquitous, $2–3	Ubiquitous, $5–8

5× the pixel count means the bounding box detector has more spatial resolution to work with — the difference between a 20-pixel-wide crab and a 45-pixel-wide crab at the same physical distance. More pixels per crab means more accurate size measurement.

The P4 also includes an on-chip ISP (Image Signal Processor) that handles debayering, white balance, noise reduction, and lens distortion correction in hardware. On the ESP32-CAM, these operations are done in software (or skipped), consuming CPU cycles that could be running inference.

Board Candidate: Waveshare ESP32-P4-NANO

Off-the-shelf ESP32-P4 development boards are emerging. The leading candidate for SmartPot prototyping is the Waveshare ESP32-P4-NANO:

Form factor: 50×50mm — fits inside standard crab pot wire mesh panels
Price: ~$19 USD
MIPI CSI connector: Direct OV5647 attachment
PoE header: 802.3af/at header — voltage compatibility with our 24V tether topology needs validation, but the electrical interface is there
ESP32-C6 companion: On-board wireless chip for bench testing and diagnostics (Wi-Fi 6 + BLE 5.0 + 802.15.4). Not used in deployment, but valuable during firmware development
TF card slot: MicroSD for model storage, same as current workflow
RTC battery header: Real-time clock backup for timestamped catch logging across power cycles
USB-C: Programming and debug interface

At $19, this is roughly 2.7× the cost of an ESP32-CAM Ai-Thinker module (~$7). The cost delta buys bounding box detection, size measurement, LP-Core idle savings, MIPI CSI camera support, and 20× the memory bandwidth. For a production SmartPot unit targeting $80 total BOM, a $12 increase in the compute module is acceptable if it enables the feature that makes every other feature more valuable — knowing not just what you caught, but how big it is.

Open Questions

This evaluation is based on published specifications and Espressif’s esp-detection demos. Several items need hands-on validation before committing to a P4 transition:

Deep sleep current. The ESP32-P4 datasheet is v0.5 (preliminary). Deep sleep figures aren’t finalized. If deep sleep current is significantly higher than the ESP32-S’s 10µA, it changes the power budget math. This must be measured on real hardware.
Firmware maturity. ESP-IDF support for the P4 is solid and actively developed. Arduino-ESP32 support is catching up but not feature-complete. Our firmware is currently Arduino-based — a port to ESP-IDF may be required. This is engineering effort, not a technical blocker.
No drop-in camera module. The ESP32-CAM is a single board: MCU + camera + SD slot. The P4-NANO requires a separate MIPI CSI camera module, connected via flex cable. In a waterproof enclosure that gets banged around on a fishing boat, flex cable reliability is a concern that needs mechanical testing.
Export pipeline rework. Moving from TFLite to ESP-DL means the model export and quantization toolchain changes. The training pipeline (PyTorch + timm) stays the same, but CI/CD scripts, model validation tests, and the SD card deployment workflow all need updates.
PoE header voltage compatibility. The P4-NANO’s PoE header is designed for 802.3af/at (48V nominal). Our tether runs at 24V. The header may work with a custom injector, or we may need to bypass it and wire directly to a buck converter input. Needs electrical testing.
IR illumination compatibility. The OV5647 NoIR’s IR sensitivity profile may differ from the OV2640. The 850nm LED array should still work, but optimal exposure settings and calibration data will need to be re-established.

Decision Outcome

The evaluation confirmed that the ESP32-P4 enables the single most valuable missing capability: bounding-box size measurement via YOLO detection + pinhole camera model. The decision is made — ESP32-P4 is the primary submerged unit platform.

Current submerged unit specs: Submerged Unit Reference
Vision architecture: On-Device Vision
Training and export pipeline: Train the Species Model
Flash guide: Flash the ESP32-P4

The ESP32-CAM remains functional for classification-only deployments. Legacy documentation is preserved in the Archive.

Sources

ESP32-P4 Datasheet v0.5 — Espressif preliminary specifications
esp-detection — Espressif’s object detection framework for ESP32-P4
ESP-DL — Espressif Deep Learning library (inference engine for P4)
ESP-PPQ — Espressif’s quantization toolkit for ESP-DL model export
Waveshare ESP32-P4-NANO — Board specifications and pinout
CNX Software ESP32-P4 coverage — Independent hardware reviews and benchmarks
ESP-IDF Programming Guide: ESP32-P4 — Official firmware development documentation