models supported by CMSIS-NN?


in this blog:Low Power Deep Learning on the OpenMV Cam powered by the Cortex-M7 Pro | OpenMV

Note that you should keep in mind the CMSIS-NN library and conversion scripts have limitations on the number and types of supported layers - so your model should be simple.

I know that lenet is small, but can I use GoogleNet, VGGNet, AlexNet, or Resnet?

Maybe they are too big to run on OpenMV

Hi, LENET6 is about 100kb after our conversion script. All this other networks you mentioned are hundreds of megabytes in size. Part of the reason for that is the initial resolution they start on. Lenet works on about 24x24 pixel images, CIFAR10 works on 32x32 images. So, expect CNNs on the OpenMV Cam to have to be small like this. The thing is still a microcontroller.


I have a question relating to the models.

I’m created my own dataset and built the files as explained. When it comes to the models, since they need to be small in size, should I just use the CIFAR10_fast since it works on the M7? Or should I create my own similar to the Smiles model?

Also, when you say that the model works on a 32x32 pixel image, does that mean that I need to resize my dataset to a 32x32 image?


Use the Smile network as a starting point. As for the image size, yes, they should be 32x32. Here’s a rather comprehensive guide yet to be merged into the main github: openmv/ml/cmsisnn at more_nn_nets · kwagyeman/openmv · GitHub

If you look at the later networks I built I basically just took the smile network and used that. It’s rather easy to get a very high score. The trick is that you need to setup the real life system in the way as you do for testing which can be hard.

Great! I’m trying my data with the smile model and I’ll take a look at the link!

Thank you,

Hi again,

I have another question regarding using the CNN. I managed to successfully trait, test, quantize and, convert my network with an accuracy of 91%.

I stored the .network to the M7 and now I’m not 100% sure which example I should use to get the best results. Right now, I used the “” and its working but, I noticed that I need to place the object in a certain position within the frame for it to be recognized and also, the lighting seems to affect my results.

I couldn’t try the “” example because I don’t have a Haar Cascade file for my project.

The network was made using the smile model and its 24KB in size.


Yeah, that’s about the results of the neural network.

The reason for it not being super awesome is a lack of training examples. Most of the hype you see about deep learning being awesome requires tons of training examples. The network is very good at doing the most minimal amount of learning to score well on the dataset. So, it will not learn anything necessary beyond what you show it during training. If you want it to generalize tilts, rotations, and scale changes then it needs to see images manipulated in that way too. That said, the network will also learn to recognize whatever borders appear in images when you rotate them and/or zoom in and out.

Given this… your best results will only be achieved when using the network in an environment when you control all the variables and you’re simple using the network to classify well lit objects or scenes from a static viewing angle. If you have access to more data you can make it better however. As for the cifar10 network. Keep in mind the network was trained on a bunch of small images at fixed distance and lighting. So, it expects the world to be that way too.

Thank you for your response!

I assumed the content in my dataset would play a role on the outcome. My plan now is to increase my dataset and in fact, use the openMV to save the images of the objects detected.

Luckily, my project involves the openMV to be in a fixed position (static viewing angle). After increasing my dataset, I’ll see how the results turn out and will decide if I want to control lighting in my system too!

Thanks again for your help! It really clarified a few points.

Yeah, that’s the best what for this to work is to collect images from the camera’s view of what you need to work on. Then the networks will work great. This eliminates a lot of the variables.