Cortex-M7 Launches: Embedded, IoT and Wearables
by Stephen Barrett on September 23, 2014 7:01 PM ESTThe Cortex-M7 CPU
The primary focus of the Cortex-M7 is improved performance. ARM’s goal was to elevate the M series performance to a level previously unseen, while maintaining the M series' signature small die size and tiny power consumption. There are at least two reasons ARM focused on performance for the M7 processor. First, they want to further drive a wedge between traditional 8- and 16-bit microcontrollers and provide ARM a further differentiated market position; second, the M7 will help support the IoT (Internet of Things) and wearable device markets. Focusing on enhanced DSP capabilities, the M7 is more suited to audio and visual sensor hub processing than any previous M series design.
Digging into the details, the Cortex-M7 features a six-stage, in-order, dual-issue superscalar pipeline with single- and double-precision floating point units, instruction and data caches, branch prediction, SIMD support, and tightly coupled memory. Here's the high level view of the pipeline:
The presence of instruction and data caches, branch prediction, as well as tightly coupled memory are differentiating features of the M7 versus previous M series processors. Microcontrollers often forego caches and sometimes even operate with flash as the only memory interface. By providing high performance instruction and data caches, the M7 approaches more typical high performance processor design.
Tightly coupled memory (TCM) is a technology ARM’s partners can use to extend the effective caching of a single M7 processor and has only been seen in previous A and R series designs. In use, it can have the performance of a cache but, unlike cache, its contents are directly controlled by the developer. That is, TCM is part of the physical memory map of the microcontroller. Developers can place critical code and data inside TCM that can be deterministically accessed with high performance in routines such as interrupt service requests. The M7 supports up to 16 MB of tightly coupled memory.
Adding branch prediction allows arm to target dedicated DSP devices with its Cortex-M7 microcontroller. DSP code is often analog data stream filters for applications such as audio input keyword detection, audio output equalization, and frequency domain amplitude peak searching. When running on an always-on microcontroller these tasks are almost always looped. Without a branch predictor, the code must continually evaluate a loop condition that 99.9% of the time results in the same outcome. Branch predictors cost extra die space but when DSP is your target, they are an obvious design benefit.
Summarizing the M series cores can be done both from an instruction features standpoint and also a die size and performance standpoint. Unfortunately ARM, who provides HDL (Hardware Description Language) that can be synthesized to physical chips, was not yet willing to provide die size numbers until their partner Cortex-M7 announcements, since the processor does not become physical until a partner gets involved. Until a partner releases data, we can simply assume the M7 somewhat larger than its predecessors.
ARM Cortex-M Instruction Sets | |||||||||||
M0 | M0+ | M3 | M4 | M7 | |||||||
Thumb | Most | Most | Entire | Entire | Entire | ||||||
Thumb-2 | Subset | Subset | Entire | Entire | Entire | ||||||
Hardware multiply | 1 or 32 cycles | 1 or 32 cycles | 1 cycle | 1 cycle | 1 cycle | ||||||
Hardware divide | No | No | Yes | Yes | Yes | ||||||
Saturated math | No | No | Yes | Yes | Yes | ||||||
DSP Extensions | No | No | No | Yes | Yes, enhanced | ||||||
Floating-point | No | No | No | Optional single precision | Yes | ||||||
Tightly coupled memory | No | No | No | No | yes | ||||||
Architecture | ARMv6-M | ARMv6-M | ARMv7-M | ARMv7-M | ARMv7-M | ||||||
Cache Architecture | Von Neuman | Von Neuman | Harvard | Harvard | Harvard |
ARM Cortex-M Area, Power, Performance | |||||||||||
M0 | M0+ | M3 | M4 | M7 | |||||||
90nm LP dynamic power (µW/MHz) | 16 | 9.8 | 32 | 33 | n/a | ||||||
90nm LP area mm2 | 0.04 | 0.035 | 0.12 | 0.17 | n/a | ||||||
40nm G dynamic power (µW/MHz) | 4 | 3 | 7 | 8 | n/a | ||||||
40nm G area mm2 | 0.01 | 0.009 | 0.03 | 0.04 | n/a | ||||||
Dhrystone (official) DMIPS/MHz | 0.84 | 0.94 | 1.25 | 1.25 | 2.14 | ||||||
Dhrystone (max options) DMIPS/MHz | 1.21 | 1.31 | 1.89 | 1.95 | 3.23 | ||||||
CoreMark/MHz | 2.33 | 2.42 | 3.32 | 3.40 | 5.04 |
ARM did state that power consumption of M7 is roughly in line with previous performance/mW, so we could estimate a corresponding increase of 50% to 75% more power consumption. Area is anyone's guess at the moment.
43 Comments
View All Comments
Stephen Barrett - Tuesday, September 23, 2014 - link
ARM specifically advised against comparing performance numbers across architectures, saying it was an apples and oranges comparison.Despite similar numbers in these very synthetic benchmarks, when running actual application code , the M7 will never compete with the A series. The numbers are only useful comparing within the family.
Wilco1 - Wednesday, September 24, 2014 - link
Sure an M7 will never run Android, but that's not the point. v7-M and v7-A share the same Thumb-2 ISA and use the same compiler backend, so it's not apples/oranges. You can run the same binary on an M7 and A7 if you wish.Due to its shorter pipeline, an M7 should beat an equivalently clocked A7. Of course an A7 can clock 2 times as high in 28nm so it wins in absolute performance. However that still means M7 is a huge leap from an M3/M4.
tuxRoller - Wednesday, September 24, 2014 - link
The chart says 2.14-3.23 dmips/mhz, or 5.04 coremark/mhz, so atom d525 or core i5-2400 for coremark, or between intel pentium/pentium pro and pentium 3.http://www.eembc.org/coremark/index.php
https://en.wikipedia.org/wiki/Instructions_per_sec...
Flunk - Thursday, September 25, 2014 - link
Intel Quark.KlausWalter - Wednesday, September 24, 2014 - link
For those guys who understand german here is the best review so far including the first implementation of the Cortex-M7 in a real MCU, a so called STM32F7 Family delivered by ST Microelectronics: http://www.elektroniknet.de/halbleiter/mikrocontro... Google translator may help....uningenieromas - Wednesday, September 24, 2014 - link
The three big guys in the MCU world (Atmel, Freescale and ST) are already working in Cortex M7 processors and I'm sure they already have a internal implementation of the MCUs:http://www.arm.com/about/newsroom/arm-supercharges...
Freescale has it already in it's roadmap, and it's called Kinetis X.
jdesbonnet - Wednesday, September 24, 2014 - link
Microchip is also a 'big guy' but is conspicuously absent from the ARM party. Which is a pity: I like their stuff, but I've now standardized on the ARM Cortex-M family of MCU.ah06 - Wednesday, September 24, 2014 - link
So we expect to see this as the core of an SoC in wearables and for high end wearables, ARM recommends a gimped A7. Well what about the R series that falls right in between?I'm still very fuzzy on the relative comparisons between the wearable processors. Maybe do an article comparing Aster platform (ARM7ESJ), Cortex-M4/M7, Cortex-Rx, Cortex-A7/A53
Thanks
FunBunny2 - Wednesday, September 24, 2014 - link
-- the M series processors are considered microcontrollers and not application processors, mainly because they lack a memory management unit (MMU).Well, then, the X86 wasn't a real cpu, because the MMU was off chip for rather a while. Not until the 386 was it really implemented.
hammer256 - Wednesday, September 24, 2014 - link
So what are the target market of the M series and R series? R series is higher performance? Both lack MMU right? The distinction between the two lines are kinda blurry to me.