Object Detection Accuracy

Hello all I’m extremely new to using the OpenMV IDE and ecosystem, I feel I’ve done pretty well so far with using example scripts and AI-generated bits of code but i have a bit of a technical question that im unable to find an answer to, whether that be from lack of familiarity of not enough experience im not sure and apologies if this is a simple fix.
My Question:
i have used Edge Impulse to create a trained object recognition model to identify a tennis ball, this works well for my first try, i have put the trained.tflite and labels.txt files onto my H7 camera. i run my script that uses the trained model to find tennis balls in the frame. this works great but my concerns lay with the accuracy of this. When a tennis ball is detected it draws a cross on the centre but when i move the camera the cross seems to step in large across the screen and often is off-centre from the ball as its sitting in between one of these steps. Is there something i can adjust or implement to help the centre of the object recognition not jump by values of 10 and allow it to move 1 value at a time for accuracy?

Below is an example if how the X and Y values move in increments of 5-10, this is a serial print of the values from the net.detect image function.









Here is the code from my script:

import sensor, image, time, os, tf, math, uos, gc
from pyb import UART

# Initialize the camera sensor
sensor.set_windowing((180, 140))

# Initialize UART communication
uart = UART(3, 9600)

# Load the TensorFlow Lite model
# Make sure to handle exceptions if the model fails to load
    net = tf.load("trained.tflite")
except Exception as e:
    print('Failed to load TensorFlow Lite model:', e)

# Read labels from the labels file
    labels = [line.rstrip('\n') for line in open("labels.txt")]
except Exception as e:
    print('Failed to load labels:', e)

# Define color palette for detection annotations
colors = [
    (255, 0, 0),   # Red
    (0, 255, 0),   # Green
    (0, 0, 255),   # Blue

# Main loop
    img = sensor.snapshot(sensor.JPEG).lens_corr(strength=1.5, zoom=1.0, in_place=True)
    found_circle_in_frame = False

    # Perform object detection
    for i, detection_list in enumerate(net.detect(img, thresholds=[(160, 255)])):
        if i == 0 or not detection_list:
        for d in detection_list:
            [x, y, w, h] = d.rect()
            center_x = x + (w // 2)
            center_y = y + (h // 2)
            img.draw_cross(center_x, center_y, color=colors[2])
            # Assuming global circle detection is intended
            found_circles = img.find_circles(
            if found_circles:  # Check if at least one circle is found
                ball_r = found_circles[0].r()  # Use the first found circle
                print("Ball Drawn\n")
                img.draw_circle(center_x, center_y, ball_r, color=(255, 0, 0))
                found_circle_in_frame = True
                # Simple area estimate based on the circle's radius
                size_estimate = ball_r * 2

    if not found_circle_in_frame:
        # Handle frames where no circles are found
        print("No circle detected in this frame.")

        #_END CODE_ :)

Any help is appreciated :slight_smile:

This has to do with the network itself seeing a 96x96 input image of the world and the result is scaled back up to the original image resolution. I think FOMO has a 16x16 output grid, though, which is even less res than its input.

Unfortunately, FOMO just gives a strong centroid output for what it detects. I don’t know if an object is to the left or right in the grid - e.g. I don’t get a detection being seen in one or more grid spots… it’s just in one. Given this, the behavior you see is expected. You can fix it by making the FOMO model output a denser grid of sections. This will, however, cost you on speed.

An alternative approach that doesn’t cost speed is to use the object position as an estimate and then input that into an object tracker. An object tracker is an algorithm where you match detections across frames and then represent the position of an object based on what the tracker says versus what the detection says. Detections are then just estimates of the actual position.

In practice, you keep an array of detections that you have a position. When you have a detection update, you blend the newly detected position into a previous detection based on its being in about the same area via a distance threshold. Better yet, if an object moves quickly, you can estimate the direction vector to have a smaller distance threshold check.

Anyway, I hope you get the idea. It’s quite a bit more code to write, though and involves data structures, and etc. However, this is how real systems do this generally.

Thankyou for that, i now understand. :slight_smile: