Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh

Name: Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh
Item: Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh
Author: Anand Lal Shimpi

by Anand Lal Shimpi on December 17, 2013 11:56 PM EST

Posted in
Ask the Experts

20 Comments | Add A Comment

20 Comments

Question from mrdude

A few questions:

When can we expect an end to software based dvfs scaling? It seems to me to be the biggest hurdle in the armsphere towards higher single threaded performance.

the current takes on your big.little architecture have been somewhat suboptimal (the exynos cache flush as an example), so what can we expect from arm themselves to skirt/address these issues? It seems to me to be a solid approach given the absolutely miniscule power and die budget that your little cores occupy, but there's still the issues of software and hardware implementation before it becomes widely accepted.

Though this question might be better posited for the gpu division, are we going to be seeing unified memory across the gpu and CPU cores in the near future? Arm joining hsa seems to point to a more coherent hardware architecture and programming emphasis

Pardon the grammatical errors as IM typing this on my phone. big thanks to Anand and peter.

Answer

Hi Mrdude,

While there are platforms that use hardware event monitors to influence DVFS policy, this is usually underneath a Software DVFS framework. Software DVFS is powerful in that it has a global view of all activity across a platform in time whereas Hardware DVFS relies on building up a picture from lots of individual events which have little to no relationship with one another. As an analogy, Software DVFS is like directing traffic from a helicopter with a very clear view of what is going on all roads in a city (but greater latency when forcing a change), whereas Hardware DVFS is like trying to pool information from hundreds of traffic cops all feeding traffic information in from their street corner. A traffic cop might be able to change traffic flow on their street corner, but it may not be the best policy for the traffic in the city!

Like all things in life, there are trade-offs with neither approach being absolutely perfect in all situations and hardware DVFS solutions rely on the Software DVFS helicopter too.

Question from Try-Catch-Me

What do you have to do to get into chip design? Is it really difficult to get into companies like ARM?

Answer

An Engineering degree in electronics and/or software for a start. Passion for micro-architecture & architecture certainly helps! :)

Question from mercury555

Peter:

What emotion comes to mind on the fact that ARM wishes to forget the big.LITTLE with a 64 bit equivalent of A12 limited to a Quad-Core configuration for consumer electronics?

Thanks.

Answer

Hi Mercury,

ARM continues to believe in big.LITTLE which is why we improved on interoperability in the Cortex-A53 and Cortex-A57 generation of processors. In future processor announcements I’m sure you’ll see our continued focus on big.LITTLE as a key technology that enables best possible energy efficiency.

Question from mrtanner70

1. We don't seem to have quite seen the promised power savings for big.little yet (thinking of the Exynos 5420 in particular since it has hmp working, not sure if any devices have correct Linux kernel yet though). Are you still as bullish on this aspect of the big.little story?

2. Are there particular synergies to using Mali with the CCI vs. other brand GPU's?

3. What is your general response to the criticism of big.little that has come out of Intel and Qualcomm? Intel, in particular, tends to argue dynamic frequency scaling is a better approach.

Cheers

Answer

Hi MrTanner,

In answer to (3), DVFS is complimentary to big.LITTLE not instead of.

A partner building a big.LITTLE platform can use DVFS across the same voltage and frequency range as another vendor on the same process with a single processor. The difference is that once the voltage floor of the process is reached with the 'big' processor the thread can be rapidly switched to the 'LITTLE' processor further increasing energy efficiency.

Mobile workloads have an extremely large working range from gaming and web browsing through to screen-off updates. The challenge with a single processor is that it must compromise between absolute performance at the high-end and energy efficiency at the low-end. However a big.LITTLE solution allows the big processor to be implemented for maximum performance since it will only be operating when needed. Conversely the LITTLE processor can be implemented for best energy efficiency.

Question from sverre_j

I have just been through the extensive instruction ARMv8 set ( and there must be several hundred instructions in total), so my question is whether ARM believes that compilers, such as gcc, can be set up to take advantage of most of the instruction set, or whether one will still depend on assembly coding for a lot of the advanced stuff.

Answer

Hi Sverre,

The AArch64 instruction set in the ARMv8 architecture set is simpler than the ARMv7 instruction set. The AArch32 instruction set in the ARMv8 architecture is an evolution of ARMv7 with a few extra instructions. From this perspective, just the same as compilers such as GCC can produce optimised code for the billions of ARMv7 devices on the market I don’t see any new challenge for ARMv8 compilers.

Question from iwod

1. MIPS - Opinions On it against ARMv8 ?

2. I Quote

"There is nothing worse than scrambled bytes on a network. All Intel implementations and the vast majority of ARM implementations are little endian. The vast majority of Power Architecture implementations are big endian. Mark says MIPS is split about half and half – network infrastructure implementations are usually big endian, consumer electronics implementations are usually little endian. The inference is: if you have a large pile of big endian networking infrastructure code, you’ll be looking at either MIPS or Power. "

How True is that? And if true, do ARM has any bigger plans to tackle this problem. Obviously there are huge opportunities when SDN are now exploding.

3. Thoughts on current integration of IP ( ARM ), Implementer ( Apple/Qualcomm ) and Fab ( TSMC ) ? Especially on the speed of execution. Where previously it would takes years for any IP marker from announce to something that is on the market. We are now seeing Apple coming in much sooner and Qualcomm is also well ahead of ARM projected schedule for 64Bit SoC in terms of Shipment date.

4. Thoughts on Apple's implementation of ARMv8?

5. Thoughts on Economy of Scale in Fab and Nodes. Post 16/14nm and 450mm wafers. Development Cost etc. How would that impact ARM?

6. By having a Pure ARMv8 implementation and Not supporting the older ARMv7. How much, in terms of % transistor does it save?

7. What technical hurdles do you see for ARM in the near future?

Answer

Hi iwod,

Addressing question-2, all ARM architecture and processor implementations support big and little endian data. There is an operating system visible bit that can be changed dynamically during execution.

On question-6, certainly an AArch64 only implementation would save a few transistors compared to an ARMv8 implementation supporting both AArch32 and AArch64. However probably not as much as you think and is very dependent on the micro architecture since the proportion of decode (or AArch32 specific gates) will be less in a wide OOO design than an in-order design. For now, code compatibility with the huge amount of applications written for Cortex-A5, Cortex-A7, Cortex-A9, etc is more important.

Question from ciplogic

* Which are the latencies in CPU cycles for CPU caches? Is it possible in future to create a design that uses a shared L3 cache?

* How many general purpose CPU registers are in Cortex-A53 compared with predecesors?

* Can be expected that Cortex-A53 to be part of netbooks in the years to come? What about micro-servers?

Answer

Hi Ciplogic,

While not yet in mobile, ARM already produces solutions with L3 caches such as our CCN-504 and CCN-508 products which Cortex-A53 (and Cortex-A57) can be connected too.

Since Cortex-A53 is an in-order, non-renamed processor the number of integer general purpose registers in AArch64 is 31 the same as specified by the architecture.

Question from secretmanofagent

Getting away from the technical questions, I'm interested in these two.

ARM has been used in many different devices, what do you consider the most innovative use of what you designed, possibly something that was outside of how you envisioned it originally being used?

As a creator, what devices made you look at what you created and had the most pride?

Answer

Hi,

I'd suggest all of us who work for ARM are proud that the vast majority of mobile devices use ARM technology!

Some of the biggest innovations with ARM devices is coming in the Internet of Things (IOT) space which isn't as technically complex from a processor perspective as high-end mobile or server, but is a space that will certainly effect our everyday lives.

Question from wrkingclass_hero

What is ARM's most power efficient processing core? I don't mean using the least power, I mean work per watt. How does that compare to Intel and IBM? Also, I know that ARM is trying to grow in the server market, given the rise of the GPGPU market, do you foresee ARM leveraging their MALI GPUs for this in the future? Finally, does ARM have any interest or ambition in scaling up to the desktop market?

Answer

Hi wrkingclass_hero,

In the traditional applications class, Cortex-A5, Cortex-A7 and Cortex-A53 have very similar energy efficiency. Once a micro-architecture moves to Out-of-Order and increases the ILP/MLP speculation window and frequency there is a trade-off of power against performance which reduces energy efficiency. There’s no real way around this as higher performance requires more speculative transistors. This is why we believe in big.LITTLE as we have simple (relatively) in-order processors that minimise wasted energy through speculation and higher-performance out-of-order cores which push single-thread performance.

Across the entire portfolio of ARM processors a good case could be made for Cortex-M0+ being the more energy efficient processor depending on the workload and the power in the system around the Cortex-M0+ processor.

Question from Xajel

When running 32bit apps on 64bit OS, is there's any performance hit compared to 64bit apps on 64bit OS ?

And from IPC/Watt perspective, how A53/A57 is doing compared to A7/A15... I mean how much more performance we will get in the same power usage compared to A7/A15... talking about the whole platform ( memory included )

Answer

The performance per watt (energy efficiency) of Cortex-A53 is very similar to Cortex-A7. Certainly within the variation you would expect with different implementations. Largely this is down to learning from Cortex-A7 which was applied to Cortex-A53 both in performance and power.

Question from /pigafetta

Is ARM thinking of adding hardware transactional memory instructions, similar to Intel's TSX-NI?

And would it be possible to design a CPU with an on-chip FPGA, where a program could define it's own instructions?

Answer

Hi pigfetta,

ARM has an active architecture research team and, as I'm sure you would expect, look at all new architectural developments.

It would be possible to design a CPU with on-chip FPGA (after all, most things in design are possible), but the key to a processor architecture is code compatibility so that any application can run on any device. If a specific instruction can only run on one device it is unlikely to be taken advantage of by software since the code is no longer portable. If you look at the history of the ARM architecture it's constantly evolved with new instructions added to support changes in software models. These instructions are only introduced after consultation with the ARM silicon and software partners.

You may also be interested in recent announcements concerning Cortex-A53 implemented on an FPGA. This allows standard software to run on the processor, but provides flexibility around the other blocks in the system.

Question from kenyee

How low a speed can the ARM chips be underclocked?

i.e., what limits the lowest speed?

Answer

Hi Kenyee,

If you wished to clock an ARM processor at a few KHz you could. Going slower is always possible!

Question from Alpha21264

Can you talk a bit about your personal philosophy regarding pipeline lengths. As the A53 and A57 diverge significantly on the subject. Too short its difficult to implement goodness like a scheduler but as you increase the length you also contribute to design bloat: you need large branch target arrays with both global and local history to avoid stalls, more complicated redirects in the decoder and execution units to avoid bubbles, and generally just more difficult loops to converge in your design. Are you please with the pipeline in the A53, where do you see happening with the pipeline both in the big cores and the little ones going forward (anticipate a vague answer on this one, but not going to stop me from asking)?

Answer

Hi Alpha,

I'd expect my view of pipeline lengths to be similar to most other micro-architects. The design team have to balance the shortest possible pipeline to minimise branch mis-prediction penalty and wasted pipelining of control/data against the gates-per-cycle needed to hit the frequency target. Balance being the operative word as the aim is to have a similar amount of timing pressure on each pipeline stage since there's no point in having stages which are near empty (unless necessary due to driving long wires across the floorplan) and others which are full to bursting.

Typically a pipeline is built around certain structures taking a specific amount of time. For example you don't want an ALU to be pipelined across two cycles due to the IPC impact. Another example would be the instruction scheduler where you want the pick->update path to have a single-cycle turnaround. And L1 data cache access latency is important, particularly in pointer chasing code, so getting a good trade-off against frequency & the micro-architecture is required (a 4-cycle latency may be tolerable on a wide OOO micro-architecture which can scavenge IPC from across a large window, but an in-order pipeline wants 1-cycle/2-cycle).

We're pretty happy with the 8-stage (integer) Cortex-A53 pipeline and it has served us well across the Cortex-A53, Cortex-A7 and Cortex-A5 family. So far it's scaled nicely from 65nm to 16nm and frequencies approaching 2GHz so there's no reason to think this won't hold true in the future.

Introduction Final Questions

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

20 Comments

View All Comments

Exophase - Wednesday, December 18, 2013 - link
Wonderful article. Thank you very much for your time and information.
syxbit - Wednesday, December 18, 2013 - link
Great answers!
It's too bad none of the questions about A15 losing so badly to Krait and Cyclone weren't brought up.
ciplogic - Wednesday, December 18, 2013 - link
It was about Cortex A53. Also, the answers were politically neutral (as they should), as the politics and other companies future development are not the engineer's talk. Maybe an engineer from Qualcomm could answer accurately.
lmcd - Wednesday, December 18, 2013 - link
Krait 200 is way worse than A15. If A15 revisions come in then A15 could easily keep pace with Krait. But idk if ARM does those.

Cyclone is an ARM Cortex-A57 competitor.
Wilco1 - Wednesday, December 18, 2013 - link
A15 has much better IPC than Krait (which is why in the S4 Krait needs 2.3GHz to get the similar performance as A15 at just 1.6GHz). The only reason Krait can keep up at all is because it uses 28nm HPM, which allows for much higher frequencies.
ddriver - Wednesday, December 18, 2013 - link
Really? The Note3 with krait is pretty much neck to neck with the exynos octa version at 1.9, which was a A15 design last time I checked.
Wilco1 - Wednesday, December 18, 2013 - link
Sorry, it was 1.6 vs 1.9GHz in the S4 and 1.9 vs 2.3GHz in the Note3. Both are pretty much matched on performance, so Krait needs ~20% higher clock.
cmikeh2 - Wednesday, December 18, 2013 - link
I don't know if those are normalized for actual power consumption, although we have to deal with different process technologies as well. Good IPC is pretty much meaningless in this segment if it requires ridiculous voltages to hit the frequencies it needs to.
Wilco1 - Thursday, December 19, 2013 - link
It does seem Samsung had some issues with its process indeed. NVidia was able to reach higher frequencies easily at low power. I haven't seen detailed power consumption comparisons between the 2 variants of S4 and N3 at load, but there certainly is a power cost to pushing your CPU to high frequencies (high voltages indeed!), so having better IPC helps.
twotwotwo - Wednesday, December 18, 2013 - link
Not even sure I'd count A15 out yet. I have a vague that impression power draw is part of why it didn't get more wins; if so, the next process gen might help with that. Folks on AT will have moved on to thinking about 64-bit chips by then, but as Peter put it there will be plenty of lower-end sockets left to fill.

Also, there was an A15 in the most popular ARM laptop yet (the Exynos in the Chromebook) so at least it got one really neat win. :)

Answered by the Experts: ARM's Cortex A53 Lead Architect, Peter Greenhalgh

Post Your Comment

20 Comments

View All Comments

Exophase - Wednesday, December 18, 2013 - link

syxbit - Wednesday, December 18, 2013 - link

ciplogic - Wednesday, December 18, 2013 - link

lmcd - Wednesday, December 18, 2013 - link

Wilco1 - Wednesday, December 18, 2013 - link

ddriver - Wednesday, December 18, 2013 - link

Wilco1 - Wednesday, December 18, 2013 - link

cmikeh2 - Wednesday, December 18, 2013 - link

Wilco1 - Thursday, December 19, 2013 - link

twotwotwo - Wednesday, December 18, 2013 - link

Log in

Don't have an account? Sign up now