The main advantage of this checkout system is its ability to recognize and classify produce, which we decided early on would be done by a machine learning system. Initially, an object detection system was proposed, which consists of both image classification and object localization, as shown in the image below.
However, we realized that as long as customers did not attempt to put multiple types of produce on the scale at the same time (which they do not expect to do anyway), we only needed to perform the simpler task of image classification - recognizing the presence of a produce item, if any, in the image returned from the webcam.
With this decision made, the task was now to train an image classifier for the desired classes of produce with the specified accuracy of >90%. We decided to use transfer learning to train on top of MobileNet V2, a convolutional neural network that achieved 91% top-one accuracy on the 1000-class ImageNet challenge. There is a large amount of documentation on how to use Tensorflow to do this, and specifically to then go and do inference using a Raspberry Pi, so this is the library we decided to use.
As a preliminary network, a 2D convolutional layer with dropout, a pooling layer, and dense layer were added on to the extensive pre-existing MobileNet architecture.
Initially, Erin trained a few layers on top of MobileNet on her Macbook for 20 epochs, with each epoch taking ~40 minutes, so 14 hours in total. With 60,000 images used in training, this timeline was long but not at all out of the realm of normal training times. However, we were able to set up a remote desktop with Mark's Windows/NVIDIA GPU setup. Initial testing showed that the training ran at ~11 minutes per epoch, which is a significant improvement, reducing training time from 14 hours to just under 4. However, we then realized that the training had not even been using the GPU, Mark's computer just had a significantly better CPU! After a late night with a few silly errors (namely renaming a few CUDA files to "10" instead of "11"), the network started finally training on the GPU, with each epoch taking a grand total of 2 minutes. This means a new image classifier with 75 output classes can be trained on top of the entire MobileNet V2 network, with 60,000 images, for 20 epochs, in under an hour. This significantly reduces the bottleneck between having an idea on how to improve the network results and being able to see how well it works.
Moral of the story: if you plan to train convolutional neural networks in any capacity, the hardware component cannot be overlooked!
Comments