OpenMV Firmware v4.5.6 and up TensorFlow Porting Guide

Hi all,

In our latest firmware, 4.5.6, we’ve completely updated how our TensorFlow API works. This is a breaking change. But, the functionality increase is worth it. Here are the details:

The “tf” module is now gone (although there may be an alias to it for a while). You should now start using the new “ml” module. This will be where we keep any machine-learning code in the firmware. We decided to do the module renaming because as we support more MCUs with CNN accelerators (NPUs), it will not always be the case that TensorFlow is the execution engine under the hood. So, we opted to make the name generic moving forward in the future. This also lets us put other algorithms, as mentioned in this module too. That all said, the module right now just supports the TensorFlow lite for the microcontrollers framework.

Moving on, the new version of v4.5.6 has a number of substantial improvements that allow you to run all kinds of models.

  1. We enabled pretty much Every Operator in TensorFlow on every OpenMV Cam model. Only the old OpenMV Cam M4 doesn’t have TensorFlow support onboard.

  2. We updated to the latest version of TensorFlow. We will now track it directly versus using an old branch from Edge Impulse.

  3. Model state is now stored in the heap versus on the frame buffer stack. This means that models now hold their state across inference calls. This is huge as you can now work with Models with memory onboard.

  4. We massively increased the heap on all OpenMV Cam boards. On cameras with SDRAM support, the heap is now a couple of Megabytes in size, from 256KB. On ones without SDRAM we managed to find a few hundred more KB to add. We were able to do this thanks to a new MicroPython feature that lets you allocate Heap Blocks, which are located at different addresses (and thus in different SRAM/SDRAM locations). While this means that there are now multiple heap areas on each board. The heaps are all managed automatically by MicroPython as if there were one. Anyway, we had to increase the heap size so that the TensorArena, which holds the model state, could remain allocated across inference calls. The heap is now 8MB+ on the RT1062, 4MB+ on the H7 Plus, 8MB+ on the Pure Thermal, and 2.5MB+ on the Arduino Giga/Portenta! We may even increase the heap more on some boards too.

  5. We enabled 4 dimension ndarrays (from 2, which was not that useful) using the ulab numpy module on every OpenMV Cam. This brings the power of numpy and vector processing to every OpenMV Cam along with ML support. As I’ll explain below, our embrace of numpy for data processing support is key. At first, I didn’t think going all in for numpy support made sense. Enabling fast 4D ndarray support costs a lot of flash space. But it’s the right decision, and it makes the impossible possible for MCUs programmed in Python. Get excited.

Cool, so let’s talk about the new Model() object. We understand that a massive breaking change to the TensorFlow API is a pain. So, when doing this refactoring work, we tried to future-proof it as much as possible so that folks don’t have to change all their code again. The new Model() object now supports Multi-input and Multi-output networks where each Tensor Input and Tensor Output can be up to 4 dimensions.

What does this mean? Well, let’s say you want to train a model that accepts images, voice samples, and accelerometer samples at the same time. You can feed that into the new ML module as three separate Input Tensors. And, if your model outputs YOLO like bounding boxes, scene descriptions, etc. you can handle these separate outputs.

The key to making this work is that each input and output Tensor is a numpy ndarray up to 4 dimensions. This lets us move the model post-processing code to Python, but with vector processing acceleration under the hood so that you can process the massive output vectors produced by machine learning models. For example, let’s say you have an array of 6300 output tuples where each tuple is (xmin, ymin, xmax, ymax, score) and you want to select all columns of the output where the score is greater than 0.5. In Python, you have to write a for loop to do this, and it can literally take seconds if the code is not written carefully - not kidding :frowning: However, using numpy, you can threshold each score value and get an array of the valid indices by doing np.nonzero(np.asarray(output[:, 4] > 0.5)) which is executed in C under the hood at 50x the speed. This makes composing the pre/post-processing of tensors possible in Python.

Anyway, that’s why we made all the changes. You can run any model and pre/post-process the output in Python, which unblocks everyone from running whatever they want on an OpenMV Cam without having to write custom C code to do the work.

For more information about the new API please read the documentation: ml — Machine Learning — MicroPython 1.23 documentation (openmv.io). In the rest of this forum post I will walk you through how to port your Model in Edge Impulse to run using the new API.


Previously, we only supported Models generated by Edge Impulse; moving forward, bring whatever you want! Since you can pre-post process the input/output for your Model in Python using numpy, there’s no limit to what you can do.

However, given the API changes, here’s how you need to modify your EdgeImpulse code to work with the new API:

Current scripts generated by EdgeImpulse look like this (as of 7/25/2024):

# Edge Impulse - OpenMV Image Classification Example

import sensor, image, time, os, tf, uos, gc

sensor.reset()                         # Reset and initialize the sensor.
sensor.set_pixformat(sensor.RGB565)    # Set pixel format to RGB565 (or GRAYSCALE)
sensor.set_framesize(sensor.QVGA)      # Set frame size to QVGA (320x240)
sensor.set_windowing((240, 240))       # Set 240x240 window.
sensor.skip_frames(time=2000)          # Let the camera adjust.

net = None
labels = None

try:
    # load the model, alloc the model file on the heap if we have at least 64K free after loading
    net = tf.load("trained.tflite", load_to_fb=uos.stat('trained.tflite')[6] > (gc.mem_free() - (64*1024)))
except Exception as e:
    print(e)
    raise Exception('Failed to load "trained.tflite", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

try:
    labels = [line.rstrip('\n') for line in open("labels.txt")]
except Exception as e:
    raise Exception('Failed to load "labels.txt", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

clock = time.clock()
while(True):
    clock.tick()

    img = sensor.snapshot()

    # default settings just do one detection... change them to search the image...
    for obj in net.classify(img, min_scale=1.0, scale_mul=0.8, x_overlap=0.5, y_overlap=0.5):
        print("**********\nPredictions at [x=%d,y=%d,w=%d,h=%d]" % obj.rect())
        img.draw_rectangle(obj.rect())
        # This combines the labels and confidence values into a list of tuples
        predictions_list = list(zip(labels, obj.output()))

        for i in range(len(predictions_list)):
            print("%s = %f" % (predictions_list[i][0], predictions_list[i][1]))

    print(clock.fps(), "fps")

You should change the script to this:

# Edge Impulse - OpenMV Image Classification Example

import sensor, image, time, os, ml, uos, gc
from ulab import numpy as np

sensor.reset()                         # Reset and initialize the sensor.
sensor.set_pixformat(sensor.RGB565)    # Set pixel format to RGB565 (or GRAYSCALE)
sensor.set_framesize(sensor.QVGA)      # Set frame size to QVGA (320x240)
sensor.set_windowing((240, 240))       # Set 240x240 window.
sensor.skip_frames(time=2000)          # Let the camera adjust.

net = None
labels = None

try:
    # load the model, alloc the model file on the heap if we have at least 64K free after loading
    net = ml.Model("trained.tflite", load_to_fb=uos.stat('trained.tflite')[6] > (gc.mem_free() - (64*1024)))
except Exception as e:
    print(e)
    raise Exception('Failed to load "trained.tflite", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

try:
    labels = [line.rstrip('\n') for line in open("labels.txt")]
except Exception as e:
    raise Exception('Failed to load "labels.txt", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

clock = time.clock()
while(True):
    clock.tick()

    img = sensor.snapshot()

    predictions_list = list(zip(labels, net.predict([img])[0].flatten().tolist()))

    for i in range(len(predictions_list)):
        print("%s = %f" % (predictions_list[i][0], predictions_list[i][1]))

    print(clock.fps(), "fps")

Here are the changes:

  1. The tf module was replaced with the ml module. Also, I import numpy from the ulab module; this is not needed in the script per-say, but it will be necessary to call numpy functions.

  2. To load the ml module now you just directly create Model() object versus calling load. We’ve kept the load_to_fb argument so that you can still load models onto the frame buffer stack if you are on an OpenMV Cam without SDRAM. You should not continue to use this argument unless you absolutely need it. Note that Model() objects automatically free their associated memory in the heap or whatever space they used on the frame buffer stack when deleted.

  3. Classify() has been removed. The sliding window approach it used, while interesting, is unusable as it’s so slow. There is only one inference method now, predict() which takes a list of input ndarrays or image objects and returns a list of ndarrays. The list of inputs must be the same size as the number of tensor inputs the model expects.

  4. On output predict() returns the list of ndarrays equal to the number of tensor outputs. For the classification model it accepts one input (an image) and outputs one tensor (the list of class scores). So, we have to do [0] to grab the single output tensor.

  5. The output tensor of the classification model has a shape of (1, x) where x is the number of classes you are looking at. If you try to convert this into a list directly from an ndarray you’ll get two levels of lists [[..., ..., ..., .etc]]. So, you have to flatten it first (e.g. remove the extra dimensions), before converting the ndarray into a list of floats you can pass to the enumerate method in Python.

As you can see, the new API is actually less code than the previous API.

Question: I thought the predict() call only accepted ndarrays?

Answer: Yes, it does. However, in Python we automatically detect when image objects are passed to predict() and convert them on to ndarrays on the fly for you using the Normalization object. This is a pre-processing class we made for handling images. The new API allows you to add pre-processing classes for whatever you like. Note that the Normalization leverages fast C code under the hood to convert the image to an ndarray. So, you should not notice any loss in speed.

Moving on, for object detection the current EdgeImpulse code looks like this (as of 7/25/2024):

# Edge Impulse - OpenMV Object Detection Example

import sensor, image, time, os, tf, math, uos, gc

sensor.reset()                         # Reset and initialize the sensor.
sensor.set_pixformat(sensor.RGB565)    # Set pixel format to RGB565 (or GRAYSCALE)
sensor.set_framesize(sensor.QVGA)      # Set frame size to QVGA (320x240)
sensor.set_windowing((240, 240))       # Set 240x240 window.
sensor.skip_frames(time=2000)          # Let the camera adjust.

net = None
labels = None
min_confidence = 0.5

try:
    # load the model, alloc the model file on the heap if we have at least 64K free after loading
    net = tf.load("trained.tflite", load_to_fb=uos.stat('trained.tflite')[6] > (gc.mem_free() - (64*1024)))
except Exception as e:
    raise Exception('Failed to load "trained.tflite", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

try:
    labels = [line.rstrip('\n') for line in open("labels.txt")]
except Exception as e:
    raise Exception('Failed to load "labels.txt", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

colors = [ # Add more colors if you are detecting more than 7 types of classes at once.
    (255,   0,   0),
    (  0, 255,   0),
    (255, 255,   0),
    (  0,   0, 255),
    (255,   0, 255),
    (  0, 255, 255),
    (255, 255, 255),
]

clock = time.clock()
while(True):
    clock.tick()

    img = sensor.snapshot()

    # detect() returns all objects found in the image (splitted out per class already)
    # we skip class index 0, as that is the background, and then draw circles of the center
    # of our objects

    for i, detection_list in enumerate(net.detect(img, thresholds=[(math.ceil(min_confidence * 255), 255)])):
        if (i == 0): continue # background class
        if (len(detection_list) == 0): continue # no detections for this class?

        print("********** %s **********" % labels[i])
        for d in detection_list:
            [x, y, w, h] = d.rect()
            center_x = math.floor(x + (w / 2))
            center_y = math.floor(y + (h / 2))
            print('x %d\ty %d' % (center_x, center_y))
            img.draw_circle((center_x, center_y, 12), color=colors[i], thickness=2)

    print(clock.fps(), "fps", end="\n\n")

For object detection there are a lot more changes as this requires custom post-processing in Python:

# Edge Impulse - OpenMV Object Detection Example

import sensor, image, time, os, ml, math, uos, gc
from ulab import numpy as np

sensor.reset()                         # Reset and initialize the sensor.
sensor.set_pixformat(sensor.RGB565)    # Set pixel format to RGB565 (or GRAYSCALE)
sensor.set_framesize(sensor.QVGA)      # Set frame size to QVGA (320x240)
sensor.set_windowing((240, 240))       # Set 240x240 window.
sensor.skip_frames(time=2000)          # Let the camera adjust.

net = None
labels = None
min_confidence = 0.5

try:
    # load the model, alloc the model file on the heap if we have at least 64K free after loading
    net = ml.Model("trained.tflite", load_to_fb=uos.stat('trained.tflite')[6] > (gc.mem_free() - (64*1024)))
except Exception as e:
    raise Exception('Failed to load "trained.tflite", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

try:
    labels = [line.rstrip('\n') for line in open("labels.txt")]
except Exception as e:
    raise Exception('Failed to load "labels.txt", did you copy the .tflite and labels.txt file onto the mass-storage device? (' + str(e) + ')')

colors = [ # Add more colors if you are detecting more than 7 types of classes at once.
    (255,   0,   0),
    (  0, 255,   0),
    (255, 255,   0),
    (  0,   0, 255),
    (255,   0, 255),
    (  0, 255, 255),
    (255, 255, 255),
]

threshold_list = [(math.ceil(min_confidence * 255), 255)]

def fomo_post_process(model, inputs, outputs):
    ob, oh, ow, oc = model.output_shape[0]

    x_scale = inputs[0].roi[2] / ow
    y_scale = inputs[0].roi[3] / oh

    scale = min(x_scale, y_scale)

    x_offset = ((inputs[0].roi[2] - (ow * scale)) / 2) + inputs[0].roi[0]
    y_offset = ((inputs[0].roi[3] - (ow * scale)) / 2) + inputs[0].roi[1]

    l = [[] for i in range(oc)]

    for i in range(oc):
        img = image.Image(outputs[0][0, :, :, i] * 255)
        blobs = img.find_blobs(
            threshold_list, x_stride=1, y_stride=1, area_threshold=1, pixels_threshold=1
        )
        for b in blobs:
            rect = b.rect()
            x, y, w, h = rect
            score = (
                img.get_statistics(thresholds=threshold_list, roi=rect).l_mean() / 255.0
            )
            x = int((x * scale) + x_offset)
            y = int((y * scale) + y_offset)
            w = int(w * scale)
            h = int(h * scale)
            l[i].append((x, y, w, h, score))
    return l

clock = time.clock()
while(True):
    clock.tick()

    img = sensor.snapshot()

    for i, detection_list in enumerate(net.predict([img], callback=fomo_post_process)):
        if i == 0: continue  # background class
        if len(detection_list) == 0: continue  # no detections for this class?
    
        print("********** %s **********" % labels[i])
        for x, y, w, h, score in detection_list:
            center_x = math.floor(x + (w / 2))
            center_y = math.floor(y + (h / 2))
            print(f"x {center_x}\ty {center_y}\tscore {score}")
            img.draw_circle((center_x, center_y, 12), color=colors[i])

    print(clock.fps(), "fps", end="\n\n")

Okay, here’s what’s going on:

  1. Like the object detection code as before you need to change the tf module to the ml module, import ulab if you plan to use numpy stuff, and change tf.load to ml.Model.

  2. After that we add a custom post-processing function to handle the output of the FOMO model. Previously detect() ran all this logic in C. While this was sweet, it meant that precious firmware space was used on all OpenMV Cams for a baked in detect() method which may not be what you want to use. Now the post-processing is in Python. We can do this through the callback argument built into predict(). Let’s walk through the code:

# The post-processing callback will receive the model object, the input list, and the output list.
# We designed it this way so that this callback function could be included in a library in the future
# that you load as a module. Like "import fomo_post_processing from ei".
def fomo_post_process(model, inputs, outputs):
    # This unpacks the single output tensor from FOMOs shape.
    # It has batches (1), height, width, and channels for each object class.
    ob, oh, ow, oc = model.output_shape[0]

    # For image arguments to predict() the roi of the image being processed is available.
    # We get it by grabbing the single input at [0] and getting the ROI object there.

    x_scale = inputs[0].roi[2] / ow
    y_scale = inputs[0].roi[3] / oh

    # This code computes the x/y scale difference between the ROI and the output
    # tensors width/height. We can map back to the input image using this.
    scale = min(x_scale, y_scale)

    # In the case the input image gets cropped when given to the model input
    # We need to compute the x/y offset to map it back (and the ROI offset).
    x_offset = ((inputs[0].roi[2] - (ow * scale)) / 2) + inputs[0].roi[0]
    y_offset = ((inputs[0].roi[3] - (oh * scale)) / 2) + inputs[0].roi[1]

    # Create a list of lists for each class output.
    l = [[] for i in range(oc)]

    # FOMO outputs an activation map for each class.
    for i in range(oc):
        # The image object now supports creating images from ndarrays. The code below is
        # like magic. What's happening is that we are selecting the output tensor [0], and there's
        # only 1 output tensor for FOMO. Then we grab batch 0 (the only one), every pixel
        # of the height and width dimensions and the target class we are looking for. This shows
        # shows the power of ndarrays. The below code is doing a very complex operation, but, 
        # in one line. Note that the array is sliced using numpy. So, no copy is made to create the
        # array slice below. Finally, the output array is an array of floats (0 to 1). We need to make
        # if (0 to 255) to create a GRAYSCALE image. So, we multiply all pixels by 255 and then
        # cast it to an image. The image lib will automatically interpret (h, w) ndarrays as 
        # GRAYSCALE images
        img = image.Image(outputs[0][0, :, :, i] * 255)
        # Next, find blobs above the threshold value.
        blobs = img.find_blobs(
            threshold_list, x_stride=1, y_stride=1, area_threshold=1, pixels_threshold=1
        )
        # Then for all the blobs found...
        for b in blobs:
            rect = b.rect()
            x, y, w, h = rect
            # Extract the brightness of the pixels in the blob to create a score.
            score = (
                img.get_statistics(thresholds=threshold_list, roi=rect).l_mean() / 255.0
            )
            # And then map the blobs back to the input image.
            x = int((x * scale) + x_offset)
            y = int((y * scale) + y_offset)
            w = int(w * scale)
            h = int(h * scale)
            # And add them back to their score list.
            l[i].append((x, y, w, h, score))
    # Return a list of classes which each have a list of (x, y, w, h, score).
    return l

Wow! That was a lot that detect() used to do. But, now, with the code in Python, if you need to run a modified model with a slightly different output than detect() had, you aren’t out of luck anymore. You can change what’s going on during post-processing now.

  1. Finally, the rest of the code looks very similar to how we processed the output of detect().
# Predict() just returns whatever the post-processing method wants to return.
for i, detection_list in enumerate(net.predict([img], callback=fomo_post_process)):
        if i == 0: continue  # background class
        if len(detection_list) == 0: continue  # no detections for this class?
    
        print("********** %s **********" % labels[i])
        # This is the output we added to each list back in the post-processing method.
        for x, y, w, h, score in detection_list:
            center_x = math.floor(x + (w / 2))
            center_y = math.floor(y + (h / 2))
            print(f"x {center_x}\ty {center_y}\tscore {score}")
            img.draw_circle((center_x, center_y, 12), color=colors[i])

But now, it is more Pythonic without the weird output class that was just a named tuple.

Note, you may notice this alternative version of the fomo_post_process() in our examples:

def fomo_post_process(model, inputs, outputs):
    n, oh, ow, oc = model.output_shape[0]
    nms = NMS(ow, oh, inputs[0].roi)
    for i in range(oc):
        img = image.Image(outputs[0][0, :, :, i] * 255)
        blobs = img.find_blobs(
            threshold_list, x_stride=1, area_threshold=1, pixels_threshold=1
        )
        for b in blobs:
            rect = b.rect()
            x, y, w, h = rect
            score = (
                img.get_statistics(thresholds=threshold_list, roi=rect).l_mean() / 255.0
            )
            nms.add_bounding_box(x, y, x + w, y + h, score, i)
    return nms.get_bounding_boxes()

We designed an NMS object to take care of the details of dealing with overlapping bounding boxes and mapping detections from the output tensor back to the input image. However, this class was designed before we made the decision to switch to ndarrays for everything. Given that, its API is not suitable for moving into the future and it will be refactored. You should avoid using it if you don’t want your code breaking again. It needs to be refactored as the current API can’t easily leverage numpy vectorization for larger models that output thousands of detections.


Thank you for reading this massive forum post. There is no doubt we need to fix bugs that folks will uncover once they start testing things. We’ll get those fixed quickly. However, the API should be stable now.

Please ask questions. :slight_smile:

7 Likes

With all the changes to something like net.detect to net.predict, is it any faster/slower? Its seemingly the slowest part about object detection.

These changes are mainly meant to improve usability, and maintainability. However, inference still got faster, but this is mostly due to the updated tflite-micro library. It’s probably as fast as it can get on the current cameras, that’s why we’re working on next gen cams with NPUs.

FOMO outputs a 16x16 pixel image. Post-processing this in Python using Numpy and find_blobs() should finish in about 1ms or less. Running the network itself is the bulk of the work. So, as Ibrahim mentions, you should only see speed gains on our current cameras.

For next gen system the NPUs accelerate things so much that the inverse actually happens… pre/post-processing take more time than running the network :upside_down_face: - this is why we are leaning in on leveraging SIMD and new hardware units like GPUs on future MCUs.

I understand with this update we should be able to load other models like yolo.
I tried to load yolov8 model of size ~11mb but failed with memory allocation error. Whats the maximum model size supported for OpenMV H7 plus so I can try to reduce the size?

OpenMV-4 Plus has a 4Mbytes heap right now. The model will need to be less than that. We can probably increase this to at least 8 Mbytes in the next release.

@iabdalkader - We could just raise the heap to 16 MB on the H7 Plus and RT1062 and 32 MB on the Pure Thermal OpenMV. Any reason not to?

A bigger heap might be slower, I’m not sure. That said you can increase it to whatever you like, I don’t like FB/FB alloc wasting all that space anyway.

That would be great… For my use case I have been struggling to reduce the model size.

Hi @iabdalkader ,
I’m trying to load an objection detection quantised tflite model. its size is ~3.5MB if that matters.
I was not expecting " Failed to allocate tensors " error with the latest update. Do we still a few unsupported tensors?

Hi, there should be some more error messages in the terminal. If there’s nothing else printed about tflm backend then it’s a memory issue.

I’ll post a new firmware for the H7 Plus and RT1062 with more SDRAM enabled tomorrow.

1 Like

No, all ops are enabled. It’s not a memory issue either, that will fail to load the model like before. It’s likely a quantization issue.

Thats the error I get in the exception. I’ll check again through if there is any other error messages on the terminal.

The arena and everything is preallocated. If it’s a memory issue you’ll get an exception that says it failed to alloc x jus like before. If you get failed to alloc tensor it’s something internal in tensorflow. The error can only be one of two things: unsupported quantization or unsupported op. Don’t bother looking for an exact error message as the library is stripped from logs to save space. If you can’t figure out the issue, please post the model and I’ll test it for you with a library with logs.

yeah you’re right @iabdalkader , nothing much on the terminal.
here’s the model, could you please help to figure out what’s wrong with it?
detect_quant.tflite.zip (2.5 MB)

Thanks kwagyeman for the updated firmware. I tried it but it didn’t resolve my issue.

This is the error log I’m getting…

Could not copy directory imadest!!!
Trying to load model : deteFailed to allocate tensors
ent call last):
File “”, line 37, in
to load model file, did you d labels.txt file onto the m (Failed to allocate tensors
OpenMV v4.5.8-56.g8a47918b v1.22-omv.r17.434.g8f6a976dTM32H743
Type “help()” for >>>

I tried some other tflite model and it loaded successfully despite being larger (7.4MB) that the current one (3.8MB). May be something related to quantization as @abdalkader suspected?