Platform Decision: ESP32-CAM to ESP32-P4
The ESP32-CAM works. It classifies species at 300ms per frame, runs on milliwatts, and fits inside a crab pot. But it has a hard ceiling: the 520KB of SRAM and lack of hardware AI acceleration mean it can run a classifier (what species?) but not a detector (where is it, and how big?). Size estimation — the single most valuable missing capability — requires bounding box detection, and bounding box detection requires more compute than the ESP32-S has to give.
The ESP32-P4 removes that ceiling. This page documents the evaluation and the decision to adopt it.
Current Constraints
Section titled “Current Constraints”The On-Device Vision page documents the full architecture and its limitations. The short version:
- Classification only. MobileNetV2 on the ESP32-CAM tells you what species, not how big. Distinguishing legal-size from undersized relies on training separate size-range classes — a blunt instrument that doesn’t generalize well across camera angles and crab postures.
- No object detection. YOLO-class models that output bounding boxes need 2–8MB of SRAM for intermediate activations. The ESP32-S has 520KB. Even with PSRAM offload, the memory bus bandwidth (8-bit SPI at 80MHz) creates a bottleneck that makes real-time detection impractical.
- Wasted radio silicon. The ESP32-CAM includes Wi-Fi and Bluetooth radios that are permanently disabled in the submerged unit. The buoy handles all communications. That radio hardware still draws leakage current and occupies die area that could be compute.
- 80mA idle monitoring. The ESP32-S must run at full clock to do 1 fps frame differencing for motion detection. There’s no low-power core that could handle this while the main CPU sleeps.
For the full picture, see On-Device Vision: What Can Go Wrong and the Submerged Unit Reference.
What the P4 Changes
Section titled “What the P4 Changes”The ESP32-P4 is Espressif’s first SoC designed purely for edge compute — no wireless radio at all. All silicon area goes to CPU, memory, and hardware accelerators.
| Spec | ESP32-CAM (ESP32-S) | ESP32-P4 | SmartPot Impact |
|---|---|---|---|
| CPU | Dual Xtensa LX6, 240MHz | Dual RISC-V HP, 400MHz | ~3× inference throughput (wider pipeline + higher clock) |
| SRAM | 520KB | 768KB | Enough for YOLO activation tensors without PSRAM spill |
| PSRAM | 4MB (8-bit SPI, 80MHz) | 32MB (OPI, 200MHz) | 20× bandwidth — PSRAM stops being a bottleneck |
| Camera interface | 8-bit DVP (parallel) | MIPI CSI (2-lane) | Higher resolution, lower power, standardized connector |
| AI acceleration | None | PPA (pixel processing accelerator), hardware vector instructions | Accelerates pre/post-processing; inference benefits from RISC-V vector ops |
| Video encoding | None | H.264 hardware encoder | On-demand video clips for remote inspection (future) |
| Wireless | Wi-Fi 802.11 b/g/n + BT 4.2 | None | Eliminates ~15mA leakage from unused radio; buoy handles comms |
| Low-power core | None | LP RISC-V core, 40MHz | Motion detection at single-digit mA instead of 80mA |
| Price (module) | ~$7 (Ai-Thinker) | ~$19 (Waveshare P4-NANO) | 2.7× cost increase, but enables size measurement |
Three Capabilities That Matter
Section titled “Three Capabilities That Matter”Object Detection with Bounding Boxes
Section titled “Object Detection with Bounding Boxes”This is the big one.
The ESP32-P4’s memory architecture — 768KB SRAM + 32MB OPI PSRAM at 200MHz — can run YOLO-class object detection models that output bounding boxes, not just class labels. Espressif’s own esp-detection framework demonstrates this: real-time person/object detection at >18 FPS on the P4.
Why bounding boxes change everything for SmartPot: if you know the pixel coordinates of a crab’s bounding box and you know the physical geometry of the pot interior (fixed, known dimensions), you can compute the crab’s actual size using the pinhole camera model:
real_size = (pixel_size × distance_to_subject) / focal_lengthThe pot geometry constrains the distance-to-subject variable — the crab is standing on the bottom panel at a known distance from the lens. That turns a 2D bounding box into a 3D size measurement, accurate to within ~5–10mm at typical pot dimensions.
This eliminates the “no size estimation” limitation entirely. Instead of training separate classes for “keeper” and “undersized” (which requires enormous labeled datasets and doesn’t generalize), the system measures the crab directly and compares against regulatory minimums per species.
LP-Core for Motion Detection
Section titled “LP-Core for Motion Detection”The ESP32-P4 includes a secondary low-power RISC-V core running at 40MHz. This core stays awake during deep sleep while the main dual-core CPU is powered down.
Currently, the ESP32-CAM must run its full 240MHz CPU to do 1 fps frame differencing for motion detection — 80mA of current draw just to watch an empty pot. The LP-Core can handle the same frame differencing at a fraction of the power, waking the main CPU only when something actually enters the trap.
| Mode | ESP32-CAM | ESP32-P4 (estimated) |
|---|---|---|
| Deep sleep | 10µA | TBD (datasheet v0.5 — not finalized) |
| Idle monitoring (1 fps) | 80mA (main CPU) | ~5–10mA (LP-Core) |
| Active inference | 350mA | ~400–500mA (higher clock, more cores) |
The idle monitoring phase dominates total energy consumption — pots sit empty for hours between catch events. An 8–16× reduction in idle current directly extends tether power budget or enables smaller buoy solar panels.
No Wireless Is a Feature
Section titled “No Wireless Is a Feature”The ESP32-CAM includes Wi-Fi 802.11 b/g/n and Bluetooth 4.2. In the submerged unit, both are permanently disabled — radio doesn’t penetrate saltwater, and the buoy handles all communications over the wired tether.
But “disabled” doesn’t mean “free.” The radio frontend still draws leakage current, and RF calibration data occupies flash. The ESP32-P4 simply doesn’t have a radio. That silicon area is reallocated to compute and memory controllers.
This matches SmartPot’s architecture perfectly: the submerged unit is a sensor/actuator node that talks to the Smart Buoy over a wired tether. It has no need for wireless, and the P4 doesn’t pretend otherwise.
Framework Shift: ESP-DL over TFLite Micro
Section titled “Framework Shift: ESP-DL over TFLite Micro”The current inference stack is TensorFlow Lite Micro (TFLite Micro). On the ESP32-P4, the optimal framework is Espressif’s ESP-DL — and the reasons are architectural, not political.
TFLite Micro treats all memory as a single flat arena. On the ESP32-CAM with its 4MB PSRAM over an 8-bit SPI bus, this works fine because everything ends up in PSRAM anyway. On the P4, with 768KB of fast SRAM and 32MB of fast-but-not-as-fast OPI PSRAM, a flat arena leaves performance on the table.
ESP-DL is aware of the memory hierarchy. It places latency-sensitive activation tensors in SRAM and bulk weight storage in PSRAM, scheduling data movement to minimize stalls. On dual-core RISC-V, it can also pipeline inference across both cores — one core computing layer N while the other prefetches weights for layer N+1.
| Aspect | TFLite Micro | ESP-DL |
|---|---|---|
| Memory allocation | Flat arena (one tier) | Tiered SRAM/PSRAM placement |
| Multi-core | Single-core only | Dual-core pipeline scheduling |
| Hardware accel | Generic C kernels | Uses P4 vector instructions + PPA |
| Model format | .tflite (FlatBuffer) | ESP-DL format (via ESP-PPQ quantizer) |
| Maintained by | Google (community) | Espressif (first-party for P4) |
The training pipeline doesn’t change. Data collection, labeling, and PyTorch training remain identical. What changes is the export stage:
Current: PyTorch → ONNX → TFLite (quantized) → SD cardP4: PyTorch → ONNX → ESP-PPQ (quantized) → ESP-DL format → SD cardESP-PPQ is Espressif’s quantization toolkit, a fork of PPQ (PyTorch Post-training Quantization) tailored for ESP-DL’s memory layout. The quantization math is the same — INT8 with calibration data — but the output format is optimized for the P4’s memory controller.
For the current training pipeline, see Train the Species Model.
Camera Upgrade Path
Section titled “Camera Upgrade Path”The ESP32-CAM uses a DVP (parallel) camera interface limited to the OV2640 (2MP, 1600×1200). The ESP32-P4 uses MIPI CSI (2-lane), opening the door to higher-resolution sensors.
The leading candidate is the OV5647 NoIR — the same 5MP sensor used in Raspberry Pi NoIR Camera Module v1, but without an IR-cut filter:
| Spec | OV2640 (current) | OV5647 NoIR (P4 candidate) |
|---|---|---|
| Resolution | 2MP (1600×1200) | 5MP (2592×1944) |
| Pixel size | 2.2µm | 1.4µm |
| IR sensitivity | Native (no filter needed) | Native (NoIR variant ships without IR-cut) |
| Interface | DVP (8-bit parallel) | MIPI CSI-2 (2-lane) |
| Frame rate (full res) | 15 fps | 15 fps |
| Frame rate (VGA) | 30 fps | 60 fps |
| Availability | Ubiquitous, $2–3 | Ubiquitous, $5–8 |
5× the pixel count means the bounding box detector has more spatial resolution to work with — the difference between a 20-pixel-wide crab and a 45-pixel-wide crab at the same physical distance. More pixels per crab means more accurate size measurement.
The P4 also includes an on-chip ISP (Image Signal Processor) that handles debayering, white balance, noise reduction, and lens distortion correction in hardware. On the ESP32-CAM, these operations are done in software (or skipped), consuming CPU cycles that could be running inference.
Board Candidate: Waveshare ESP32-P4-NANO
Section titled “Board Candidate: Waveshare ESP32-P4-NANO”Off-the-shelf ESP32-P4 development boards are emerging. The leading candidate for SmartPot prototyping is the Waveshare ESP32-P4-NANO:
- Form factor: 50×50mm — fits inside standard crab pot wire mesh panels
- Price: ~$19 USD
- MIPI CSI connector: Direct OV5647 attachment
- PoE header: 802.3af/at header — voltage compatibility with our 24V tether topology needs validation, but the electrical interface is there
- ESP32-C6 companion: On-board wireless chip for bench testing and diagnostics (Wi-Fi 6 + BLE 5.0 + 802.15.4). Not used in deployment, but valuable during firmware development
- TF card slot: MicroSD for model storage, same as current workflow
- RTC battery header: Real-time clock backup for timestamped catch logging across power cycles
- USB-C: Programming and debug interface
At $19, this is roughly 2.7× the cost of an ESP32-CAM Ai-Thinker module (~$7). The cost delta buys bounding box detection, size measurement, LP-Core idle savings, MIPI CSI camera support, and 20× the memory bandwidth. For a production SmartPot unit targeting $80 total BOM, a $12 increase in the compute module is acceptable if it enables the feature that makes every other feature more valuable — knowing not just what you caught, but how big it is.
Open Questions
Section titled “Open Questions”This evaluation is based on published specifications and Espressif’s esp-detection demos. Several items need hands-on validation before committing to a P4 transition:
-
Deep sleep current. The ESP32-P4 datasheet is v0.5 (preliminary). Deep sleep figures aren’t finalized. If deep sleep current is significantly higher than the ESP32-S’s 10µA, it changes the power budget math. This must be measured on real hardware.
-
Firmware maturity. ESP-IDF support for the P4 is solid and actively developed. Arduino-ESP32 support is catching up but not feature-complete. Our firmware is currently Arduino-based — a port to ESP-IDF may be required. This is engineering effort, not a technical blocker.
-
No drop-in camera module. The ESP32-CAM is a single board: MCU + camera + SD slot. The P4-NANO requires a separate MIPI CSI camera module, connected via flex cable. In a waterproof enclosure that gets banged around on a fishing boat, flex cable reliability is a concern that needs mechanical testing.
-
Export pipeline rework. Moving from TFLite to ESP-DL means the model export and quantization toolchain changes. The training pipeline (PyTorch + timm) stays the same, but CI/CD scripts, model validation tests, and the SD card deployment workflow all need updates.
-
PoE header voltage compatibility. The P4-NANO’s PoE header is designed for 802.3af/at (48V nominal). Our tether runs at 24V. The header may work with a custom injector, or we may need to bypass it and wire directly to a buck converter input. Needs electrical testing.
-
IR illumination compatibility. The OV5647 NoIR’s IR sensitivity profile may differ from the OV2640. The 850nm LED array should still work, but optimal exposure settings and calibration data will need to be re-established.
Decision Outcome
Section titled “Decision Outcome”The evaluation confirmed that the ESP32-P4 enables the single most valuable missing capability: bounding-box size measurement via YOLO detection + pinhole camera model. The decision is made — ESP32-P4 is the primary submerged unit platform.
- Current submerged unit specs: Submerged Unit Reference
- Vision architecture: On-Device Vision
- Training and export pipeline: Train the Species Model
- Flash guide: Flash the ESP32-P4
The ESP32-CAM remains functional for classification-only deployments. Legacy documentation is preserved in the Archive.
Sources
Section titled “Sources”- ESP32-P4 Datasheet v0.5 — Espressif preliminary specifications
- esp-detection — Espressif’s object detection framework for ESP32-P4
- ESP-DL — Espressif Deep Learning library (inference engine for P4)
- ESP-PPQ — Espressif’s quantization toolkit for ESP-DL model export
- Waveshare ESP32-P4-NANO — Board specifications and pinout
- CNX Software ESP32-P4 coverage — Independent hardware reviews and benchmarks
- ESP-IDF Programming Guide: ESP32-P4 — Official firmware development documentation
See Also
Section titled “See Also”- On-Device Vision — Current ESP32-CAM classification architecture and constraints
- Submerged Unit Reference — Hardware specifications for the current submerged unit
- Train the Species Model — Training pipeline (shared between ESP32-CAM and P4 paths)
- Why SmartPot? — The industry problems driving these design decisions