Heap RAM limitations and tf module

Hi OpenMV team,

I’m currently working with your OpenMV RT1062 MCU and I’m successfully loading YOLO models (such as YOLOv8n,YOLOv5nu and YOLOv8n-cls) with your new firmware. I can successfully run inference on the image classification model YOLOv8n-cls with an output vector of shape [1,1000]. However, when I run the object detection models I run out of heap memory. Due to this I retrained YOLOv8n to output fewer classes and minimized the output array shape. With two classes I have the output shape [1,6,8400], with all 80 classes the output shape is [1,84,8400] - of type int8.

I’m running into MemoryErrors when calling net.detect on the object detection models and have the following questions:

  1. Why am I running out of heap memory when I call net.detect on the smaller NN model of output shape [1,6,8400] = 50400 bytes < gc.mem_free() = 265184 bytes at run time ?
  2. Is it possible to use the framebuffer instead of the heap to store inference results similar to this?:
fb_mem = sensor.alloc_extra_fb(84,8400,sensor.GRAYSCALE)
    fb_mem_ba = fb_mem.bytearray()
    fb_mem_ba = net.detect(img, thresholds=[(math.ceil(min_confidence * 255), 255)])
  1. I guess the problem with my approach in point 2 is due to how the net.detect source code allocates the output array. Where can I find the source code of the tensorflow library used in the OpenMV firmware? Such that I can modify net.classify and net.detect to use the framebuffer instead of the heap memory to allow for larger output shapes of NN model. I’ve been looking here , but I can’t seem to find the source code. I would greatly appreciate if you could point me to where I can find it.

Kindly find a simple script attached for reference if needed.
main_dev_object_detection.py (4.3 KB)

Thanks in advance!

Cheers,
Koray

Hi, I’m confused by your post.

net.detect() has been eliminated in our current refactoring of the firmware to support running any tensorflow model.

We have not yet released support for the new API and features. It’s still in dev mode. What firmware are you using 4.5.5 or the latest dev release?

Running with the latest dev firmware your script will look something like this:

import time
import ml
import image
from ml.utils import NMS

model = 'yolo/lpd-yolov5-int8-quantized.tflite'

# Alternatively, models can be loaded from the filesystem storage.
net = ml.Model(model, load_to_fb=True)
labels = ["plate"]
print(net)

colors = [  # Add more colors if you are detecting more than 7 types of classes at once.
    (255, 0, 0),
    (0, 255, 0),
    (255, 255, 0),
    (0, 0, 255),
    (255, 0, 255),
    (0, 255, 255),
    (255, 255, 255),
]

@micropython.native
def yolov5(model, input, output):
    out = output[0]
    ib, ih, iw, ic = model.input_shape[0]
    ob, ow, oc = model.output_shape[0]
    if ob != 1:
        raise(ValueError("Expected model output batch to be 1!"))
    if oc < 6:
        raise(ValueError("Expected model output channels to be >= 6"))
    # cx, cy, cw, ch, score, classes[n]
    ol = ob * ow * oc
    nms = NMS(iw, ih, input[0].roi)
    m = time.ticks_ms()
    # You must filter out the below scores with the least amount of code possible.
    # There are 6300 output heads with the tensorflow model. If you check all of them
    # and do useless work for invalid ones this will take in the seconds.
    for i in range(0, ol, oc):
        score = out[i + 4]
        if (score > 0.5):
            # Only do work on a valid box.
            cx = out[i + 0]
            cy = out[i + 1]
            cw = out[i + 2] * 0.5
            ch = out[i + 3] * 0.5
            xmin = (cx - cw) * iw
            ymin = (cy - ch) * ih
            xmax = (cx + cw) * iw
            ymax = (cy + ch) * ih
            labels = out[i + 5: i + oc]
            label_index = max(enumerate(labels), key=lambda x: x[1])[0]
            nms.add_bounding_box(xmin, ymin, xmax, ymax, score, label_index)
    boxes = nms.get_bounding_boxes()
    print("time", time.ticks_diff(time.ticks_ms(), m))
    return boxes

clock = time.clock()
while True:
    clock.tick()

    img = image.Image("yolo/plate.jpg", copy_to_fb=True)
    img.to_rgb565()

    norm = ml.Normalization(scale=(0, 255))

    for i, detection_list in enumerate(
        net.predict([norm(img)], callback=yolov5)
    ):
        print("********** %s **********" % labels[i])
        for rect, score in detection_list:
            print(rect, "score", score)
            img.draw_rectangle(rect, color=colors[i], thickness=2)

    print(clock.fps(), "fps", end="\n")

Post-processing has been moved to Python. However, we are still finalizing a few thing up before we release 4.5.6. You also need to update to the latest dev IDE given some updates in MicroPython.

Documentation and how to use the new features will be released with the new firmware probably within 2 weeks or so.

Hello @kwagyeman,

Thanks for your reply.

To clarify, I’m using your latest stable release (4.5.5). Initially, I used an older release that had been on the board and I couldn’t even load YOLO models then. I suspect that you guys added previously unsupported operators in version 4.5.5.

I’m happy to see the new iteration of your TF implementation - it looks versatile and I can’t wait to try it out! I’ll probably tinker with it on the dev release.

I’d still appreciate input on my concerns, since the knowledge can transfer to other code implementations with your board.

Is there a good way to store variables in the framebuffer by accessing the bytearray() of an Image variable (point 2) ? Where can I find the source code of the TensorFlow Lite Micro implementation of OpenMV (point3)? It would be very useful for see it for debugging purposes.

4.5.5 is very outdated compared to the current dev release, everything has changed. I’d suggest you give it a try and let us know how it goes. We don’t have updated docs yet, but the examples in the repo are up to date, maybe you can figure out the new API from that.
As for memory, the RT1060 now has an 8MBytes GC block, in addition for smaller, faster GC blocks. You don’t need the FB alloc hacks anymore and I don’t think you’ll easily run out of memory. If you do, we can add more GC blocks. There’s lots of unused SDRAM.

#define OMV_GC_BLOCK0_MEMORY            OCRM2   // Main GC block
#define OMV_GC_BLOCK0_SIZE              (26K)
#define OMV_GC_BLOCK1_MEMORY            DTCM    // Extra GC block 0.
#define OMV_GC_BLOCK1_SIZE              (293K)
#define OMV_GC_BLOCK2_MEMORY            DRAM    // Extra GC block 1.
#define OMV_GC_BLOCK2_SIZE              (8M)

All of the boards now have similar configurations. The ones that don’t have SDRAM, add extra SRAM GC blocks but those can’t run bigger models anyway.

That repo you linked, does Not contain any code, it only builds the static tflite-micro libraries on the runner, and posts the libraries to the repo. It’s added a submodule to the main repo. The actual tensorflow code is split between MicroPython module in modules/py_ml.c, tflite-micro backend in lib/tflm/tflm_backend.cc and frozen, extension/utils package in scripts/libraries/ml.

If you get the model working, could you share it here and the script ? I would like to test it too. And if it’s generic-enough we can probably add it to the built-in models for some boards.

2 Likes

Hi @iabdalkader,

Sorry for my late response.

Thanks for sharing where to find the source code.

Regarding the updated GC blocks - is that something part of fw v4.5.6? Because in fw v4.5.5 I’m running into RAM limits when doing object detection with YOLOv8n, YOLOv5 and YOLOv3.

I’m happy to share the code, kindly find it attached. Bear in mind that this is for fw v4.5.5, I’ll develop towards v4.5.6 in the future, especially with the goal to do object detection using eg. YOLOv8n.

The shared code contains a model of YOLOv8n-cls retrained on the HaGRID dataset to classify the following classes (call, dislike, fist, four, like)

Cheers,
Koray

yolov8n-cls_hand_gesture_OpenMV_v4_5_5.zip (990.6 KB)

Yes this is still in the development firmware (v4.5.6) which will be released very soon: v4.5.6 Milestone · GitHub

1 Like

@kurreman

If you can share your tensorflow model, how to post process the data, a test image, and what you expect as output for the test image I can verify this week that v4.5.6 will perfectly be able to run your network.

See here for what I mean: modules/py_tf: Add generic CNN processing support. by kwagyeman · Pull Request #2227 · openmv/openmv (github.com)