Welcome new user! You can search existing questions and answers without registering, but please register to post new questions and receive answers. Note that due to large amounts of spam attempts, your first three posts will be manually moderated, so please be patient.
Because of un-manageable amounts of spam despite our use of CAPTCHAs, email authorization, and other tools, we have discontinued this forum (see the 700k+ registered users with validated email addresses at right?). Please email us any questions or post bug reports and feature requests on GitHub at https://github.com/jevois -- The content below remains available for future reference.
Welcome to JeVois Tech Zone, where you can ask questions and receive answers from other members of the community.

Can TensorFlow Lite use the GPU ?

+1 vote
I am running a custom tensorflow model using tensorflow lite. From the results it seems the GPU is not really used. Is the TensorfFlow Easy model using the GPU or not and if not (how) can I activate it?


Follow up question: Is there any framework that allows running custom deep nets on the jevois, using the GPU?

Thanks already:)
asked Jun 6, 2018 in Programmer Questions by phildue (420 points)
edited Jun 7, 2018 by phildue

1 Answer

0 votes
Best answer
Great question! There is a lot of work these days on hybrid CPU+GPU deep networks on embedded systems, but I don't think TensorFlow implements this yet. On the other hand, the GPU in JeVois is much slower than its CPU: CPU is 4x1344 Mhz but GPU is only 2x408 MHz (of course, hard to compare since those are very different kinds of processing units). So it is not clear that much would be gained for the effort. This is very different than desktops with ~10 CPU cores vs ~1000 GPU cores; there a big gain is obtained by using the GPU.
answered Jun 7, 2018 by JeVois (46,580 points)
selected Nov 13, 2018 by phildue
Thank you a lot for the answer! This leaves me with another question though: I run a network smaller than TinyYolo using Tensorflow-Lite and I get an average inference time of 3.5s although i use 4 processing threads. The darknet-yolo version in contrast achieves ~1.5s on average and is, as far as I understand, running on the GPU. Do you know what leads to this big difference in inference time?

It also seems that there is no difference in choosing between 1 and 4 threads in tensorflow lite
I am certainly with Joseph Redmond (of Darknet YOLO) on that one, what matters is the complexity (number of multiply accumulate) more than just network size.

Have a look at his tests here: https://pjreddie.com/darknet/tiny-darknet/

His darknet reference is much bigger than squeezenet but runs faster (fewer operations) and is more accurate.

For the threads, we have not played with that too much, but we do see near 400% CPU usage when running TensorFlow mobilenets, so parallelism is happening for sure.

Note that on embedded systems there is another variable, which is ARM NEON acceleration (the equivalent of SSE on intel processors). Darknet uses the NNPACK package which uses NEON copiously. I am not sure how far along the TensorFlow people are on this front. You may also be interested in the ARM Compute Library and ARM NN SDK which have plenty of acceleration with NEON, GPU, etc, but those will require some work to transfer a trained Caffe or TensorFlow model to these frameworks.
Totally agree that it is the number of operations rather than parameters. When I was talking about "a smaller network than TinyYolo" I actually referred to a model that has the same architecture as tiny yolo but uses less filters on each layer and was trained with a slightly different loss function. So it would definitely have less operations than TinyYolo.

That's why I think the main difference is the fact that darknet uses NEON acceleration already.

Thanks a lot for the support!