OpenMV N6 - firmware mpool limits NPU to 16 MB of 64 MB PSRAM

Hi team, long time no talk :wink:

I’m trying to deploy a custom DeepLabV3+ MobileNetV2 semantic segmentation model (512x512, 5 classes, INT8 quantized, 4.8 MiB TFLite) on the OpenMV N6.
The model uses the same ops as ST’s own deeplab_v3_mobilenetv2_05_16_512_asppv1_int8.tflite in the IDE model zoo.

Offline compilation works with the scripts mpool:
Using the bundled stedgeai v3.0.0 with the n6-allmems-O3 profile (which has 32 MB hyperRAM), the model compiles successfully = 134 NPU epochs, 5.1 MiB weights, ~24 MiB activations. I also generated a valid network_rel.bin via npu_driver.py.

Compilation fails with the firmware mpool:
When I compile with the firmware’s own mpool (firmware/OPENMV_N6/stm32n6.mpool), which is what the ROMFS editor uses via neuralart.json, atonn fails:

Warning: Oauto did not find valid compile options: aborting
total bytes left unallocated=9674752

The firmware mpool exposes only 16 MB hyperRAM to the NPU. The model needs ~24 MiB activations = 9.7 MiB more than available.

The hardware has the memory:

I probed the PSRAM directly from the N6’s MicroPython REPL using uctypes.bytearray_at(), writing unique values at 0 MB, 16 MB, 32 MB, and 48 MB offsets, then reading them all back:

+0MB  0x90000000: wrote 0xAAAA0000 read 0xAAAA0000 OK
+16MB 0x91000000: wrote 0xBBBB1111 read 0xBBBB1111 OK
+32MB 0x92000000: wrote 0xCCCC2222 read 0xCCCC2222 OK
+48MB 0x93000000: wrote 0xDDDD3333 read 0xDDDD3333 OK
+64MB 0x94000000: FAULT (boundary)

The N6 has 64 MB of real, writable PSRAM. The fault at +64 MB confirms the boundary.

The question:

The firmware mpool at firmware/OPENMV_N6/stm32n6.mpool defines hyperRAM as 16 MB at 0x92000000 with USEMODE_RELATIVE. Could this be expanded to 32 MB (or more) so that larger models like the 512x512 segmentation models in ST’s own model zoo can be compiled and deployed via the ROMFS editor?

The model compiles fine with 32 MB. It does not fit in 16 MB. The hardware has 64 MB.

Setup: OpenMV N6, firmware v4.8.1, MicroPython v1.26.0-77, stedgeai v3.0.0

Thanks!

Hi, we only expose 16MB for models to use the N6. The memory is actually relocatable, so we put the alloc on the heap. The heap on the N6 is 24MB. We had this at 32MB before, but that causes issues with MicroPython’s heap allocator.

So, to fix this issue, we’d want to change the MP heap to be larger and update the memory pool to allow this.

A custom firmware will let you do this easily. Can you setup your system to compile the firmware? I can then tell you what you need to modify to make a custom firmware image that supports a model of this size.

We may increase the model heap size in the future; however, please understand that the NPU performance is very bad once it needs external RAM. You really want to avoid models that need a lot of external memory (i.e. need more than the 1.75MB onboard on the N6).

Hi @kwagyeman,

Build system is up and running. I can compile the N6 firmware successfully using Docker:

cd openmv/docker
SDK_DIR=~/openmv-sdk-1.1.0 make TARGET=OPENMV_N6

Output:

firmware.bin    1.8 MB
romfs0.img      9.4 MB (38.9% of 24 MB partition)
bootloader.bin  47 KB

Note: I had to add libglib2.0-0 to the Dockerfile for STM32_SigningTool_CLI, and the native build (without Docker) fails with “dangerous relocation” linker errors on GCC 14.3 Docker build works fine though.

Key files I’ve identified so far:

  • boards/OPENMV_N6/omv_boardconfig.h : memory layout (DRAM=64M, FB=20M, GC=24M, etc.)
  • boards/OPENMV_N6/romfs.json : model manifest with "profile": "default"
  • share/qtcreator/firmware/OPENMV_N6/stm32n6.mpool : NPU memory pools (hyperRAM=16MB at 0x92000000)
  • share/qtcreator/firmware/OPENMV_N6/neuralart.json : atonn compiler profile

Ready for your instructions on what to modify for the larger memory pool. My model needs ~24 MiB activations = the current 16 MB hyperRAM window is 9.7 MiB short.

Thanks!

What OS are you on? The IDE caches the local files it uses for model conversion under:

C:\ProgramData\OpenMV\openmvide\firmware\OPENMV_N6\stm32n6.mpool

On windows and ~/.config/OpenMV\openmvide\firmware\OPENMV_N6\stm32n6.mpool on linux/mac.

You just need to edit the pool size there and then increase openmv/boards/OPENMV_N6/omv_boardconfig.h at master · openmv/openmv · GitHub to 28MB or so and rebuild the firmware.

Note that the MP heap is not stable at 32MB. Do not set it to that.

Hi @kwagyeman,

Progress update – the build system works and the model compiles into romfs successfully. But I’m hitting a hard fault during model load that crashes the camera.

What works:

  • Custom firmware builds via Docker (GCC 14.3.1, SDK 1.1.0)
  • mpool expanded to 28 MB hyperRAM in ~/.config/OpenMV/openmvide/firmware/OPENMV_N6/stm32n6.mpool
  • Model added via IDE ROMFS menu – compiles with “Maximum optimization”, success
  • Model shows in /rom/model_seg_n6.tflite (5.55 MiB)
  • ml.Model('/rom/model_seg_n6.tflite') loads successfully when run alone:
model_size: 5,550,936
model_addr: 0x7113F880
ram_size:   18,496,896 (17.6 MiB)
ram_addr:   0x3410AD80
input:  (1,512,512,3) uint8
output: (1,512,512,5) uint8

What crashes:

  • When I also initialize the CSI camera (csi.CSI(), VGA, 512x512 window), the camera hard-faults during model.predict([img]) – not a Python exception, a system reboot.
  • If main.py on SD card runs first and uses memory, model load fails with MemoryError (needs 18.3 MiB contiguous).

The issue seems to be that ram_addr: 0x3410AD80 (SRAM2) with 18.3 MiB of activations extends into DRAM space that overlaps with the framebuffer. My boardconfig changes:

OMV_FB_SIZE = 14M (was 20M)
OMV_FB_ALLOC_SIZE = 5M (was 11M)
OMV_GC_BLOCK1_SIZE = 22M (was 24M)

Total: 14+5+1+22 = 42M of 64M DRAM. Free heap shows 23.5 MiB on clean boot.

Questions:

  1. Is the NPU activation buffer (ram_addr 0x3410AD80, 18.3 MiB) allocated from the Python heap or from a separate DRAM region?
  2. Do the FB/FB_ALLOC and the NPU activations overlap in DRAM?
  3. What’s the correct way to partition DRAM between FB, GC heap, and NPU for a model this size?

Thanks!

Hi, as I mentioned, the model is relocatable. So, the address that’s used by the MPOOL doesn’t matter. It’s actually allocated in the OMV_GC_BLOCK1 area.

Note that you’ll need to have 512x512x5 * float32 bytes of output space without a post-processor. So, 5,242,880 bytes to handle the translated output that’s generated. Add a post-processor, and this goes away as you just get a raw ndarray of the uint8 values versus that being converted to floats for you.

You should have gotten an out-of-memory error, not a hard fault. It’s hard to say what is wrong.

Do you have a jtagger?

Hi @kwagyeman,

Quick update with test results. I found the root cause of the hard fault.

Finding: NPU models that overflow activations into hyperRAM crash on predict().

I tested on custom firmware (FB=20M, FB_ALLOC=2M, SB=1M, GC=28M):

Model Activations npuRAM % Result
yolov8n_192 (stock ROM) ~200 KB <15% WORKS 22ms
yolov8n_256_seg (model zoo) 236 KB 47% WORKS 46ms
yolov8n_320_seg (model zoo) 242 KB 79% WORKS 64ms
Our DeepLabV3 512x512 23.7 MB overflow to DRAM HARD FAULT
ST deeplab_v3_512 (model zoo) 27.5 MB overflow to DRAM MemoryError / crash

Models that fit entirely in npuRAM (1.75 MiB) work perfectly. Any model whose activations overflow into hyperRAM (xSPI1 DRAM) causes a hard fault during LL_ATON_RT_RunEpochBlock() – not a Python exception, a system crash. USB CDC dies instantly so I can’t capture fault registers without JTAG.

The stedgeai compilation report for my model shows 21.2 MiB of activations placed in hyperRAM. The mpool USEMODE_RELATIVE relocation doesn’t seem to be the issue – I tested with matching and non-matching xSPI1 offsets, same crash.

Plan: retraining segmentation model at 320x320 with a lighter architecture to keep activations under 1.5 MiB (within npuRAM). The yolov8n_320_seg benchmark at 64ms/15.6 FPS is a great target.

Is the hyperRAM activation overflow a known limitation of the current N6 firmware, or a bug? The DK board (32 MB PSRAM) presumably handles this – is there a firmware difference?

Thanks!

Note if you pull the latest changes and run make sdk, it will install an SDK that includes all dependencies needed for building any firmware (gcc, llvm, stedgeai, vela etc..) on linux (x86_64) and macos (arm64). This is how the CI builds the firmware for all boards.

Hi, we have a few customers using larger models that work using Hyperam. There’s no issue there.

I think I just need to do the work to get Deeplab working on the N6 OpenMV Cam, and then I’ll be able to answer your issues. It appears the model punishes the N6 camera memory.

I’ve created this ticket to track this: Get Deep Lab working on N6. · Issue #3026 · openmv/openmv · GitHub

I have a large queue of issues to resolve. But I can tackle this by the end of next weekend if not sooner.

1 Like

@patrickpoirier51 Hi, I tried getting this working with DeepLab and didn’t get good results.

deeplabv3_person_256.zip (2.4 MB)

So, I took the models from stm32ai-modelzoo/semantic_segmentation/deeplabv3 at main · STMicroelectronics/stm32ai-modelzoo · GitHub

Then, I converted them from ONNX to TFLite. However, the models work and run at acceptable speeds, but they don’t really seem to see anything. This might be the reason:

{ model_size: 1066696, model_addr: 0x713fee20, ram_size: 130880, ram_addr: 0x34109a80, input_shape: ((1, 256, 256, 3),), input_scale: (0.01865845,), input_zero_point: (-14,), input_dtype: ('b',), output_shape: ((1, 256, 256, 2),), output_scale: (2956686.0,), output_zero_point: (-7,), output_dtype: ('b',) }

The output scale is insane. So, probably the conversion didn’t go well.

I ran into this same issue with ST’s depth_estimation models. The output seems to be quite useless. These didn’t have to be converted, though.

I then had Claude look for other models, like DeepLab 512x512 from Google’s TPU repo. The models there are too big. After trying all sorts of things, it became clear we could not get the RAM on these models down from 5MB or so at peak from the external memory. Running these models onboard, though, results in an immediate crash of the N6. Claude spent a lot of time messing with the model to get it to pass the N6 model compiler.

deeplabv3_mnv2_dm05_pascal_n6_int8.zip (721.3 KB)

{ model_size: 881120, model_addr: 0x71774ac0, ram_size: 9032960, ram_addr: 0x34109a80, input_shape: ((1, 513, 513, 3),), input_scale: (0.00784314,), input_zero_point: (-1,), input_dtype: ('b',), output_shape: ((1, 65, 65, 21),), output_scale: (0.10376116,), output_zero_point: (-101,), output_dtype: ('b',) }

While the output of this model looks more sane. I get a crash when running it. This happens inside the ST library.

I think it makes more sense to focus on yolov8n segmentation. We have a path for this to be trainable in the cloud through Roboflow. I just tested one of these models and it runs fine. I need to make a post-processor to have nice output, but there are no weird conversion issues. We can get Roboflow to support these farily quickly.

Hi @kwagyeman,
Thanks for testing this yourself — good to have confirmation that the hyperRAM overflow crash is on the ST library side.

Happy to pivot to YOLOv8n segmentation. A few questions:

  1. Resolution: Is 320x320 the sweet spot for seg on N6, or can we push to 352?
  2. Post-processor: You mentioned needing to build one for seg output. Is that something that will land in firmware, or should I handle it in MicroPython?
  3. In the meantime: Is there a working YOLOv8n-seg .tflite I can test to validate the inference path end-to-end while waiting for Roboflow support?

Thanks for your help, N6 is a nice product with great potential, and your support makes it a great combination.

  1. Yes, probably, it’s really how many classes at once it supports which blows the RAM size up.
  2. It will land in firmware once I have time to write it. You are not blocked at all from writing it in Python though. All library features necessary to handle this have been done. So, it’s a pure Python thing you have to write.
  3. qt-creator/share/qtcreator/models/stmicroelectronics/instance_segmentation/yolov8n_seg at 0c8ccd58ead3e4fd78ab5861a344f02c2b1d2bc1 · openmv/qt-creator · GitHub
    1. I trivially tried these and dumped one of the second tensor outputs prototype masks to an image and it clearly was segmenting me out of the image. It only took me 15 minutes to get this working in comparison to deep lab which is just broken…

Regarding RoboFlow support. I’ll ping them to get started on this now.

Hi kwagyeman,

Quick update — I got the YOLOv8n-seg 320 model working on N6! Thanks for pointing me in the right direction, the seg models from the qt-creator repo work great.

What I (and Claude…) found:

The seg model loads and runs fine (64ms inference, 79% npuRAM), but I initially got zero detections from the postprocessor. Turns out the STAI backend auto-dequantizes outputs to float32, but model.output_dtype still reports ‘b’ (int8) and output_scale/output_zero_point still reflect the int8 quantization parameters.

The ml.utils.quantize() and threshold() functions see dtype=‘b’, compute an int8-scaled threshold (~38.7), and compare it against actual float values (~0.3) — nothing ever passes. The stock YoloV8 detection model isn’t affected because its metadata already says dtype=‘f’ with scale=1.0.

Fix: Threshold directly on float values instead of using quantize()/threshold():
max_scores = np.max(class_scores, axis=0)
score_indices = np.nonzero(max_scores > conf_threshold)[0]

Results:

  • Detection boxes only (no masks): 9.5-11 FPS
  • Detection + mask rendering (Python): ~1 FPS — the np.dot([6400,32],[32]) per detection + draw_rectangle per mask pixel is too heavy in MicroPython
  • Detection + JPEG streaming to PC: 5 FPS
  • 10/10 detection on COCO test images displayed on monitor (N6 camera pointed at screen)
  • Person detected at 81-87% confidence

The prototype masks (output 1) are alive and working — as you showed, they clearly segment. The bottleneck is purely the Python mask rendering loop. Looking forward to the C firmware postprocessor when you get time for it!

I wrote an optimized Python YoloV8Seg postprocessor class matching the ml.utils / ultralytics.py pattern — happy to share if useful.

Is the dtype mismatch (output_dtype=‘b’ but actual float output) intentional or a bug? Should I report it as an issue?

Thanks again for all the help!

Hi, great to hear!

Please post your work for others to use until we add support for this.

Also, RoboFlow just started looking into supporting this.

Finally, when no post-processor is attached, we will output the model data as float32. We do this to make running image classification models easy. Our API just dequantizes for you. This is described in the documentation.

I suppose I can update the model code to report float output when no post-processor is present.

Anyway, if you want any performance, attach a post-processor and then handle things in that. You’ll have access to the RAW INT8 values.

Detection + mask rendering (Python): ~1 FPS — the np.dot([6400,32],[32]) per detection + draw_rectangle per mask pixel is too heavy in MicroPython

The Image() class directly accepts ndArrays, just pass directly to it and it will turn a floating point array with values between 0-255 into a grayscale image.

YOLOv8n-seg Instance Segmentation on N6

Model: yolov8n_320_quant_pc_uf_seg_coco-st.tflite from qt-creator/share/qtcreator/models/stmicroelectronics/instance_segmentation/yolov8n_seg, added via IDE > Tools > ROMFS.

How it works

The postprocessor handles two model outputs:

  • Output 0: [1, 116, 2100] - box coords, 80 class scores, 32 mask coefficients per anchor
  • Output 1: [1, 80, 80, 32] - prototype mask features

Steps: reshape output 0 to [116, N], threshold class scores on float values directly, decode boxes, run NMS, then dot-product mask coefficients against prototype features for per-detection masks.

Important: When no firmware post-processor is attached, outputs are auto-dequantized to float32 even though model.output_dtype reports 'b' (int8). Do not use ml.utils.quantize()/threshold() - they compute int8-scaled thresholds against float values and nothing passes. Threshold directly:

max_scores = np.max(class_scores, axis=0)
score_indices = np.nonzero(max_scores > 0.3)[0]

Test method

PC displays COCO sample images fullscreen. N6 camera pointed at monitor runs the seg script, streams detections + annotated JPEG frames back over serial.

Results

Mode FPS
NPU inference only 15.6
Boxes + labels (Python postproc) 9.5-11
Boxes + mask rendering (Python pixel loop) ~1

10/10 COCO images detected correctly, 81-87% confidence.

Mask rendering bottleneck is the Python pixel loop.

yolov8_seg_n6.py (7.0 KB)

Will test the Image() ndarray approach next.

Full script attached. Enjoy :wink:

Just use Image() to convert to an image and then img.draw_image() using that image object on the frame buffer.

@kwagyeman Tested your suggestion and it works perfectly. The performance jump is huge.

Results

Mode FPS
NPU inference only 15.6
Boxes + labels (Python postproc) 9-11
Boxes + masks via Image()+draw_image() 12
Boxes + masks (old Python pixel loop) ~1

Method used, for each detection:

  1. np.dot(proto, coeffs) to get the 80x80 raw mask
  2. Crop to bounding box region in prototype coordinates
  3. np.clip(crop * 1000.0, 0.0, 200.0) to threshold (no Python loop needed)
  4. image.Image(mask_vis) to create grayscale Image from the ndarray
  5. img.draw_image(mask_img, x, y, x_scale=4.0, y_scale=4.0, alpha=128)

Steps 4 and 5 are both C-level firmware functions you already built. The old approach was a nested Python for-loop calling draw_rectangle() per prototype cell, that was the bottleneck.

The float output finding from post #14 still applies. I threshold directly on float values, never use ml.utils.quantize()/threshold() for the seg model.

Masks are now viable at full frame rate. Thanks for the tip!

Full script available if useful for the firmware postprocessor reference.

Can you post the full script in the forums? Instead of attaching, just use the code paste option:

image