Intel Launches Movidius Neural Compute Stick: Deep Learning and AI on a $79 USB Stick
by Nate Oh on July 20, 2017 11:00 AM EST- Posted in
- Intel
- SoCs
- Machine Learning
- Movidius
- Neural Networks
Today Intel subsidiary Movidius is launching their Neural Compute Stick (NCS), a version of which was showcased earlier this year at CES 2017. The Movidius NCS adds to Intel’s deep learning and AI development portfolio, building off of Movidius’ April 2016 launch of the Fathom NCS and Intel’s later acquisition of Movidius itself in September 2016. As Intel states, the Movidius NCS is “the world’s first self-contained AI accelerator in a USB format,” and is designed to allow host devices to process deep neural networks natively – or in other words, at the edge. In turn, this provides developers and researchers with a low power and low cost method to develop and optimize various offline AI applications.
Movidius's NCS is powered by their Myriad 2 vision processing unit (VPU), and, according to the company, can reach over 100 GFLOPs of performance within an nominal 1W of power consumption. Under the hood, the Movidius NCS works by translating a standard, trained Caffe-based convolutional neural network (CNN) into an embedded neural network that then runs on the VPU. In production workloads, the NCS can be used as a discrete accelerator for speeding up or offloading neural network tasks. Otherwise for development workloads, the company offers several developer-centric features, including layer-by-layer neural networks metrics to allow developers to analyze and optimize performance and power, and validation scripts to allow developers to compare the output of the NCS against the original PC model in order to ensure the accuracy of the NCS's model.
The 2017 Movidius NCS vs. 2016 Fathom NCS
According to Gary Brown, VP of Marketing at Movidius, this ‘Acceleration mode’ is one of several features that differentiate the Movidius NCS from the Fathom NCS. The Movidius NCS also comes with a new "Multi-Stick mode" that allows multiple sticks in one host to work in conjunction in offloading work from the CPU. For multiple stick configurations, Movidius claims that they have confirmed linear performance increases up to 4 sticks in lab tests, and are currently validating 6 and 8 stick configurations. Importantly, the company believes that there is no theoretical maximum, and they expect that they can achieve similar linear behavior for more devices. Though ultimately scalability will depend at least somewhat with the neural network itself, and developers trying to use the feature will want to play around with it to determine how well they can reasonably scale.
Meanwhile, the on-chip memory has increased from 1 GB on the Fathom NCS to 4 GB LPDDR3 on the Movidius NCS, in order to facilitate larger and denser neural networks. And to cap it all off, Movidius has been able to reduce the MSRP to $79 – citing Intel’s "manufacturing and design expertise” – lowering the cost of entry even more.
Like other players in the edge inference market, Movidius is looking to promote and capitalize on the need for low-power but capable inference processors for stand-alone devices. That means targeting use cases where the latency of going to a server would be too great, a high-performance CPU too power hungry, or where privacy is a greater concern. In which case, the NCS and the underlying Myriad 2 VPU are Intel's primary products for device manufacturers and software developers.
Movidius Neural Compute Stick Products | ||
Movidius Neural Compute Stick | Fathom Neural Compute Stick | |
Interface | USB 3.0 Type A | USB 3 |
On-chip Memory | 4Gb LPDDR3 | 1Gb/512Mb LPDDR3 |
Deep Learning Framework Support | Caffe TensorFlow (as of NC SDK v1.09.00) |
Caffe TensorFlow |
Native Precision Support | FP16 | FP16, 8bit |
Features | Acceleration mode Multi-Stick mode |
N/A |
Nominal Power Envelope | 1W | 1W |
SoC | Myriad 2 VPU | Myriad 2 VPU (MA2450) |
Launch Date | 7/20/2017 | 4/28/2016 |
MSRP | $79 | $99 |
As for the older Fathom NCS, the company notes that the Fathom NCS was only ever released in a private beta (which was free of charge). So the Movidius NCS is the de facto production version. For customers who did grab a Fathom NCS, Movidius says that Fathom developers will be able to retain their current hardware and software builds, but the company will be encouraging developers to switch over to the production-ready Movidius NCS.
Stepping back, it’s clear that the Movidius NCS offers stronger and more versatile features beyond the functions described in the original Fathom announcement. As it stands, the Movidius NCS offers native FP16 precision, with over 10 inferences per second at FP16 precision on GoogleNet in single-inference mode, putting it in the same range as the 15 nominal inferences per second of the Fathom. While the Fathom NCS was backwards compatible with USB 1.1 and USB 2, it was noted that the decreased bandwidth reduced performance; presumably, this applies for the Movidius NCS as well.
SoC-wise, while the older Fathom NCS had a Myriad 2 MA2450 variant, a specific Myriad 2 model was not described for the Movidius NCS. A pre-acquisition 2016 VPU product brief outlines 4 Myriad 2 family SoCs to be built on a 28nm HPC process, with the MA2450 supporting 4Gb LPDDR3 while the MA2455 supports 4Gb LPDDR3 and secure boot. Intel’s own Myriad 2 VPU Fact Sheet confirms the 28nm HPC process, implying that the VPU remains fabbed with TSMC. Given that the 2014 Myriad 2 platform specified a TSMC 28nm HPM process, as well as a smaller 5mm x 5mm package configuration, it’s possible that a different, more refined 28nm VPU powers the Movidius NCS. In any case, it was mentioned that the 1W power envelope applies to the Myriad 2 VPU, and that in certain complex cases, the NCS may operate within a 2.5W power envelope.
Ecosystem Transition: From Google’s Project Tango to Movidius, an Intel Company
Close followers of Movidius and the Myriad SoC family may recall Movidius’ previous close ties with Google, having announced a partnership with Myriad 1 in 2014, culminating in the Myriad 1’s appearance in Project Tango. Further agreements in January 2016 saw Google sourcing Myriad processors and Movidius’ entire software development environment in return for Google contributions to Movidius’ neural network technology roadmap. In the same vein, the original Fathom NCS also supported Google’s TensorFlow, in contrast to the Movidius NCS, which is only launching with Caffe support.
Update (10/11/17): Movidius has announced and released a new version of their Neural Compute Software Development Kit (v1.09.00) that brings TensorFlow 1.3 support to the NCS.
As an Intel subsidiary, Movidius has unsurprisingly shifted into Intel’s greater deep learning and AI ecosystem. On that matter, Intel’s acquisition announcement explicitly linked Movidius with Intel RealSense (which also found its way into Project Tango) and computer vision endeavors; though explicit Movidius integration with RealSense is yet to be seen – or if in the works, made public. In the official Movidius NCS news brief, Intel does describe Movidius fitting into Intel’s portfolio as an inference device, while training and optimizing neural networks falls to the Nervana cloud and Intel's new Xeon Scalable processors respectively. To be clear, this doesn’t preclude Movidius NCS compatibility with other devices, and to that effect Mr. Brown commented: “If the network has been described in Caffe with the supported layer types, then we expect compatibility, but we also want to make clear that NCS is agnostic to how and where the network was trained.”
On a more concrete note, Movidius has a working demonstration of a Xeon/Nervana/Caffe/NCS workflow, where an end-to-end workflow of a Xeon-based training scheme generates a Caffe network optimized by Nervana’s Intel Caffe format, which is then deployed via NCS. Movidius plans to debut this demo at Computer Vision and Pattern Recognition (CVPR) conference in Honolulu, Hawaii later this week. In general, Movidius and Intel promise to have plenty to talk about in the future, where Mr. Brown comments: “We will have more to share about technical integrations later on, but we are actively pursuing the best end-to-end experience for training through to deployment of deep neural networks.”
Upcoming News and NCS Demos at CVPR
Alongside the Xeon/Caffe/Nervana/NCS workflow demo, Movidius has a slew of other things to showcase at CVPR 2017. Interestingly, Intel has described their presentations and demos as two separate Movidius and RealSense affairs, implying that the aforementioned Movidius/RealSense unification is still in the works.
For Movidius, Intel describes three demonstrations: “SDK Tools in Action,” “Multi-Stick Neural Network Scaling,” and “Multi-Stage Multi-Task Convolutional Neural Network (MTCNN).” The first revolves around the Movidius Neural Compute SDK and the platform API. The multi-stick demo showcases 4 Movidius NCS’ in accelerating object recognition. Finally, the third demo showcases Movidius NCS support for MTCNN, “a complex multi-stage neural network for facial recognition.” Meanwhile, Intel is introducing the RealSense D400 series, a depth-sensing camera family
The multi-stick demo is presumably what the company mentioned as a multi-stick demo that has been validated on three different host platforms: desktop CPU, laptop CPU, and a low-end SoC. The company also has a separate acceleration demo, where the Movidius NCS accelerates a Euclid developer module and offloads the CPUs, “freeing up the CPU for other tasks such as route planning or running application-level tasks.” The result is around double the framerate and a two-thirds power reduction.
All-in-all, Intel sees and outright states that they consider the Movidius NCS to be a means towards democratizing deep learning application development. As recent as this week, we’ve seen a similar approach as Intel’s recent 15.46 integrated graphics driver brought support for CV and AI workload acceleration on Intel integrated GPUs, tying in with Intel’s open source Compute Library for Deep Neural Networks (clDNN) and associated Computer Vision SDK and Deep Learning Deployment Toolkits. On a wider scale, Intel has already publicly positioned itself for deep learning in edge devices by way of their ubiquitous iGPUs, and Intel’s ambitions are highlighted by its recent history of machine learning and autonomous automotive oriented acquisitions: MobilEye, Movidius, Nervana, Yogitech, and Saffron.
As Intel pushes forward with machine learning development by way of edge devices, it will be very interesting to see how their burgeoning ecosystem coalesces. Like the original Fathom, the Movidius NCS is aimed at lowering the barriers to entry, and as the Fathom launch video supposes, a future where drones, surveillance cameras, robots, and any device can be made smart by “adding a visual cortex” that is the NCS.
With that said, however, technology is only half the challenge for Intel. Neural network inference at the edge is a popular subject for a number of tech companies, all of whom are jockeying for the lead position in what they consider a rapidly growing market. So while Intel has a strong hand with their technology, success here will mean that they need to be able to break into this new market in a convincing way, which is something they've struggled with in past SoC/mobile efforts. The fact that they already have a product stack via acquisitions may very well be the key factor here, since being late to the market has frequently been Intel's Achilles' heel in the past.
Wrapping things up, the Movidius NCS is now available for purchase for a MSRP of $79 through select distributors, as well as at CVPR.
Source: Movidius
38 Comments
View All Comments
ddriver - Friday, July 21, 2017 - link
"1watt $80 stick is about 33% of the performance of a $5500 GPU and consumes 1/250th the power. "Just to illustrate how silly this statement is, and how poor this product's value is, let's compare it to an arm single board computer.
The ODROID-XU4 comes with 2 gigs of ram and a GPU that is capable of 102 gflops compute. Aside from having the performance, it can also run applications, graphics and whatnot. All this for 60 bucks.
Funny thing that intel needed to make a 400 million $ acquisition to essentially get a purpose specific 16 bit CPU. Maybe it has been so long since intel made 16 bit chips they no longer remember how to do it?
tipoo - Friday, July 21, 2017 - link
How is 100Gflops 33% of the power of a 5500 dollar GPU?Kvaern1 - Thursday, July 20, 2017 - link
"That means targeting use cases where the latency of going to a server would be too great, a high-performance CPU too power hungry, or where privacy is a greater concern."tuxRoller - Thursday, July 20, 2017 - link
It's an asic. A gpgpu is a general purpose processing unit. Nvidia realizes this which is why they've started to offer NN specific compute areas with their latest offerings.It's difficult to say if this is slower than an $80 gpu, but it's far faster for inference and can be plugged into any laptop.
Too bad it looks to be only useful for inference and only supports caffe.
ddriver - Friday, July 21, 2017 - link
It is slower and less capable than a 30$ arm soc chip. Basically all of the high end arm socs released the last two years can do in excess of 100 gflops. And can do a bunch of other stuff on top of computing deep learning.tuxRoller - Saturday, July 22, 2017 - link
From the link to Tom's:"Movidius claimed that the Myriad 2 can hit up to 1,000 GFLOPS per watt[...]"
I await your demonstration proving your claims.
tuxRoller - Saturday, July 22, 2017 - link
Btw, this is an asic. It isn't intended to play Battlefield or let you watch Netflix. It's designed for one thing, and that's why it's efficient.tipoo - Thursday, July 20, 2017 - link
Look at what happened to Bitcoin mining on GPUs vs ASICs, for an idea of how this works. Silicon built *specifically to one application* will always be faster/cheaper/more efficient than something that can do a lot of things, like a CPU or GPU.This is a fixed function neural network, it does neural networks and that's it, no other functions like a GPU.
Yojimbo - Thursday, July 20, 2017 - link
Yes, most edge inferencing will be done on either ARM or on an ASIC. It is far from clear, however, that other aspects of neural network computing, such as data center inferencing and training, will be done on ASICs. That is because there will probably be a lot of mixed workload usage of neural networks, e.g., neural networks used in conjunction with simulation or analytics. An ASIC for inferencing or training/inferencing a neural network will not be able to carry out the other workloads, unlike GPUs. Whatever power saving and performance are gained by having the ASIC run the neural network will be given back when transferring the data to some other processor to complete the workload.Yojimbo - Thursday, July 20, 2017 - link
What it comes down to is that the GPU is already very efficient at the number crunching needed in neural networks. The inefficiency comes from memory access, or more precisely, data flow. However, to optimize the data flow for a CNN it seems one gives up doing much else other than training or inferencing CNNs.