Heap RAM limitations and tf module

Hi OpenMV team,

I’m currently working with your OpenMV RT1062 MCU and I’m successfully loading YOLO models (such as YOLOv8n,YOLOv5nu and YOLOv8n-cls) with your new firmware. I can successfully run inference on the image classification model YOLOv8n-cls with an output vector of shape [1,1000]. However, when I run the object detection models I run out of heap memory. Due to this I retrained YOLOv8n to output fewer classes and minimized the output array shape. With two classes I have the output shape [1,6,8400], with all 80 classes the output shape is [1,84,8400] - of type int8.

I’m running into MemoryErrors when calling net.detect on the object detection models and have the following questions:

  1. Why am I running out of heap memory when I call net.detect on the smaller NN model of output shape [1,6,8400] = 50400 bytes < gc.mem_free() = 265184 bytes at run time ?
  2. Is it possible to use the framebuffer instead of the heap to store inference results similar to this?:
fb_mem = sensor.alloc_extra_fb(84,8400,sensor.GRAYSCALE)
    fb_mem_ba = fb_mem.bytearray()
    fb_mem_ba = net.detect(img, thresholds=[(math.ceil(min_confidence * 255), 255)])
  1. I guess the problem with my approach in point 2 is due to how the net.detect source code allocates the output array. Where can I find the source code of the tensorflow library used in the OpenMV firmware? Such that I can modify net.classify and net.detect to use the framebuffer instead of the heap memory to allow for larger output shapes of NN model. I’ve been looking here , but I can’t seem to find the source code. I would greatly appreciate if you could point me to where I can find it.

Kindly find a simple script attached for reference if needed.
main_dev_object_detection.py (4.3 KB)

Thanks in advance!

Cheers,
Koray

Hi, I’m confused by your post.

net.detect() has been eliminated in our current refactoring of the firmware to support running any tensorflow model.

We have not yet released support for the new API and features. It’s still in dev mode. What firmware are you using 4.5.5 or the latest dev release?

Running with the latest dev firmware your script will look something like this:

import time
import ml
import image
from ml.utils import NMS

model = 'yolo/lpd-yolov5-int8-quantized.tflite'

# Alternatively, models can be loaded from the filesystem storage.
net = ml.Model(model, load_to_fb=True)
labels = ["plate"]
print(net)

colors = [  # Add more colors if you are detecting more than 7 types of classes at once.
    (255, 0, 0),
    (0, 255, 0),
    (255, 255, 0),
    (0, 0, 255),
    (255, 0, 255),
    (0, 255, 255),
    (255, 255, 255),
]

@micropython.native
def yolov5(model, input, output):
    out = output[0]
    ib, ih, iw, ic = model.input_shape[0]
    ob, ow, oc = model.output_shape[0]
    if ob != 1:
        raise(ValueError("Expected model output batch to be 1!"))
    if oc < 6:
        raise(ValueError("Expected model output channels to be >= 6"))
    # cx, cy, cw, ch, score, classes[n]
    ol = ob * ow * oc
    nms = NMS(iw, ih, input[0].roi)
    m = time.ticks_ms()
    # You must filter out the below scores with the least amount of code possible.
    # There are 6300 output heads with the tensorflow model. If you check all of them
    # and do useless work for invalid ones this will take in the seconds.
    for i in range(0, ol, oc):
        score = out[i + 4]
        if (score > 0.5):
            # Only do work on a valid box.
            cx = out[i + 0]
            cy = out[i + 1]
            cw = out[i + 2] * 0.5
            ch = out[i + 3] * 0.5
            xmin = (cx - cw) * iw
            ymin = (cy - ch) * ih
            xmax = (cx + cw) * iw
            ymax = (cy + ch) * ih
            labels = out[i + 5: i + oc]
            label_index = max(enumerate(labels), key=lambda x: x[1])[0]
            nms.add_bounding_box(xmin, ymin, xmax, ymax, score, label_index)
    boxes = nms.get_bounding_boxes()
    print("time", time.ticks_diff(time.ticks_ms(), m))
    return boxes

clock = time.clock()
while True:
    clock.tick()

    img = image.Image("yolo/plate.jpg", copy_to_fb=True)
    img.to_rgb565()

    norm = ml.Normalization(scale=(0, 255))

    for i, detection_list in enumerate(
        net.predict([norm(img)], callback=yolov5)
    ):
        print("********** %s **********" % labels[i])
        for rect, score in detection_list:
            print(rect, "score", score)
            img.draw_rectangle(rect, color=colors[i], thickness=2)

    print(clock.fps(), "fps", end="\n")

Post-processing has been moved to Python. However, we are still finalizing a few thing up before we release 4.5.6. You also need to update to the latest dev IDE given some updates in MicroPython.

Documentation and how to use the new features will be released with the new firmware probably within 2 weeks or so.

Hello @kwagyeman,

Thanks for your reply.

To clarify, I’m using your latest stable release (4.5.5). Initially, I used an older release that had been on the board and I couldn’t even load YOLO models then. I suspect that you guys added previously unsupported operators in version 4.5.5.

I’m happy to see the new iteration of your TF implementation - it looks versatile and I can’t wait to try it out! I’ll probably tinker with it on the dev release.

I’d still appreciate input on my concerns, since the knowledge can transfer to other code implementations with your board.

Is there a good way to store variables in the framebuffer by accessing the bytearray() of an Image variable (point 2) ? Where can I find the source code of the TensorFlow Lite Micro implementation of OpenMV (point3)? It would be very useful for see it for debugging purposes.

4.5.5 is very outdated compared to the current dev release, everything has changed. I’d suggest you give it a try and let us know how it goes. We don’t have updated docs yet, but the examples in the repo are up to date, maybe you can figure out the new API from that.
As for memory, the RT1060 now has an 8MBytes GC block, in addition for smaller, faster GC blocks. You don’t need the FB alloc hacks anymore and I don’t think you’ll easily run out of memory. If you do, we can add more GC blocks. There’s lots of unused SDRAM.

#define OMV_GC_BLOCK0_MEMORY            OCRM2   // Main GC block
#define OMV_GC_BLOCK0_SIZE              (26K)
#define OMV_GC_BLOCK1_MEMORY            DTCM    // Extra GC block 0.
#define OMV_GC_BLOCK1_SIZE              (293K)
#define OMV_GC_BLOCK2_MEMORY            DRAM    // Extra GC block 1.
#define OMV_GC_BLOCK2_SIZE              (8M)

All of the boards now have similar configurations. The ones that don’t have SDRAM, add extra SRAM GC blocks but those can’t run bigger models anyway.

That repo you linked, does Not contain any code, it only builds the static tflite-micro libraries on the runner, and posts the libraries to the repo. It’s added a submodule to the main repo. The actual tensorflow code is split between MicroPython module in modules/py_ml.c, tflite-micro backend in lib/tflm/tflm_backend.cc and frozen, extension/utils package in scripts/libraries/ml.

If you get the model working, could you share it here and the script ? I would like to test it too. And if it’s generic-enough we can probably add it to the built-in models for some boards.

2 Likes

Hi @iabdalkader,

Sorry for my late response.

Thanks for sharing where to find the source code.

Regarding the updated GC blocks - is that something part of fw v4.5.6? Because in fw v4.5.5 I’m running into RAM limits when doing object detection with YOLOv8n, YOLOv5 and YOLOv3.

I’m happy to share the code, kindly find it attached. Bear in mind that this is for fw v4.5.5, I’ll develop towards v4.5.6 in the future, especially with the goal to do object detection using eg. YOLOv8n.

The shared code contains a model of YOLOv8n-cls retrained on the HaGRID dataset to classify the following classes (call, dislike, fist, four, like)

Cheers,
Koray

yolov8n-cls_hand_gesture_OpenMV_v4_5_5.zip (990.6 KB)

Yes this is still in the development firmware (v4.5.6) which will be released very soon: v4.5.6 Milestone · GitHub

1 Like

@kurreman

If you can share your tensorflow model, how to post process the data, a test image, and what you expect as output for the test image I can verify this week that v4.5.6 will perfectly be able to run your network.

See here for what I mean: modules/py_tf: Add generic CNN processing support. by kwagyeman · Pull Request #2227 · openmv/openmv (github.com)

Hello @kwagyeman,

Again, sorry for my late reply. I’m AFK a lot this month.

This specific model is simply used for experimental purposes.

What would be even more interesting to know is at which fps the OpenMV RT1062 can run YOLOv10n with the new 4.5.6 firmware. The default input image size of 640x640 is probably way to big - so one should probably retrain it for smaller image

The post processing should be similar as for the YOLOv5 example you linked in your previous message.

I’ll try out the new firmware since it’s out now, exciting!

1 Like

Given how big the heap is now on the RT1062 it should actually work. Give it a shot. But, 240x240 already pushes the limits of the system… so, while it may fit in RAM the execution speed might be super long. YOLOV5 (with no optimizations - all floats) already takes 10 seconds.

Oh, regarding Ulab and NumPy, it’s like 99% of the way there. However, some important operations don’t work yet. I’m working with Zoltan to get these fixed. You may have to revert to non-vectorized code though for select/indexing operations right now.

Finally, you should print the network to see its shape output; this will help you figure out how to post-process it.

Also, I heavily recommend doing int8 quantization of the model first and verifying that before going to the MCU. It will save you a lot of time.

Awesome, thanks for the replies!

I look forward to accessing the raw outputs of the NN models with the new firmware!

Yes, the quantization is truly important. I’ve found that the calibration dataset when doing it is important when using a smaller retraining dataset. You mention specifically int8, when doing the full integer quantization using the ultralytics export command (Export - Ultralytics YOLO Docs) it’s a mix of int8 and int32 values. Do you mean that this mix can cause issue and that I should to a manual quantization to make sure all values are int8 only? I have done this before, but liked the one liner from Ultralytics. Will be interesting to hear your take on this.

Cheers

Cmsisnn only accelerated int8 ops. All values should work as long as TensorFlow lite supports them. However, if you have non int8 values for conv ops you will loose a lot of speed. Non conv ops don’t really matter if they aren’t int8.