Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence

Name: Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence
Item: Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence
Author: Andrei Frumusanu

by Andrei Frumusanu on May 26, 2020 9:00 AM EST

192 Comments | Add A Comment

192 Comments

2019 was a great year for Arm. On the mobile side of things one could say it was business as usual, as the company continued to see successes with its Cortex cores, particularly the new Cortex-A77 which we’ve now seen employed in flagship chipsets such as the Snapdragon 865. The bigger news for the company over the past year however hasn’t been in the mobile space, but rather in the server space, where one can today rent Neoverse-N1 CPUs such as Amazon’s impressive Graviton2 chip, with more vendors such as Ampere expected to release their server products soon.

While the Arm server space is truly taking off as we speak, aiming to compete against AMD and Intel, Arm hasn't reached the pinnacle of the mobile market – at least, not yet. Arm’s mobile Cortex cores have lived in the shadow of Apple’s custom CPU microarchitectures over the past several years, as Apple has seemingly always managed to beat Cortex designs by significant amounts. While there’s certainly technical reasons to the differences – it was also a lot due to business rationale on Arm’s side.

Today for Arm’s 2020 TechDay announcements, the company is not just releasing a single new CPU microarchitecture, but two. The long-expected Cortex-A78 is indeed finally making an appearance, but Arm is also introducing its new Cortex-X1 CPU as the company’s new flagship performance design. The move is not only surprising, but marks an extremely important divergence in Arm’s business model and design methodology, finally addressing some of the company’s years-long product line compromises.

The New Cortex-A78: Doubling Down on Efficiency

The new Cortex-A78 isn’t exactly a big surprise – Arm had first publicly divulged the Hercules codename over two years ago when they had presented the company’s performance roadmap through 2020. Two years later, and here we are, with the Cortex-A78 representing the third iteration of Arm’s new Austin-family CPU microarchitecture, which had started from scratch with the Cortex-A76.

The new Cortex-A78 pretty much continues Arm’s traditional design philosophy, that being that it’s built with a stringent focus on a balance between performance, power, and area (PPA). PPA is the name of the game for the wider industry, and here Arm is pretty much the leading player on the scene, having been able to provide extremely competitive performance at with low power consumption and small die areas. These design targets are the bread & butter of Arm as the company has an incredible range of customers who aim for very different product use-cases – some favoring performance while some other have cost as their top priority.

All in all (we’ll get into the details later), the Cortex-A78 promises a 20% improvement in sustained performance under an identical power envelope. This figure is meant to be a product performance projection, combining the microarchitecture’s improvements as well as the upcoming 5nm node advancements. The IP should represent a pretty straightforward successor to the already big jump that were the A76 and A77.

The New Cortex-X1: Breaking the Design Constraint Chains

Arm’s existing business model was aimed at trying to create a CPU IP that covers the widest range of customer needs. This creates the problem that you cannot hyper-focus on any one area of the PPA triangle without making compromises in the other two. I mentioned that Arm’s CPU cores have for years lived in the shadow of Apple’s CPU cores, and whilst for sure, the Apple's cores were technical superior, one very large contributing factor in Arm's disadvantage was that the business side of Arm just couldn’t justify building a bigger microarchitecture.

As the company is gaining more customers, and is ramping up R&D resources for designing higher performance cores (with the server space being a big driver), it seems that Arm has finally managed to get to a cross-over point in their design abilities. The company is now able to build and deliver more than a single microarchitecture per year. In a sense, we sort of saw the start of this last year with the introduction of the Neoverse-N1 CPU, already having some more notable microarchitectural changes over its Cortex-A76 mobile sibling.

Taking a quick look at the new Cortex-X1, we find the X1 higher up in Arm’s Greek pantheon family tree of CPU microarchitectures. Codenamed Hera, the design at least is named similarly to its Hercules sibling, denominating their close design relationship. The X1 is much alike the A78 in its fundamental design – in fact both CPUs were created by the same Austin CPU design team in tandem, but with the big difference that the X1 breaks the chains on its power and area constraints, focusing to get the very best performance with very little regard to the other two metrics of the PPA triangle.

The Cortex-X1 was designed within the frame of a new program at Arm, which the company calls the “Cortex-X Custom Program”. The program is an evolution of what the company had previously already done with the “Built on Arm Cortex Technology” program released a few years ago. As a reminder, that license allowed customers to collaborate early in the design phase of a new microarchitecture, and request customizations to the configurations, such as a larger re-order buffer (ROB), differently tuned prefetchers, or interface customizations for better integrations into the SoC designs. Qualcomm was the predominant benefactor of this license, fully taking advantage of the core re-branding options.

The new Cortex-X program is an evolution of the BoACT license, this time around making much more significant microarchitectural changes to the “base” design that is listed on Arm’s product roadmap. Here, Arm proclaims that it allows customers to customize and differentiate their products more; but the real gist of it is that the company now has the resources to finally do what some of its lead customers have been requesting for years.

One thing to note, is that while Arm names the program the “Cortex-X Custom Program”, it’s not to be confused with actual custom microarchitectures by vendors with an architectural license. The custom refers to Arm’s customization of their roadmap CPU cores – the design is still very much built by Arm themselves and they deliver the IP. For now, the X1 IP will also be identical between all licensees, but the company doesn’t rule out vendor-specific changes the future iterations – if there’s interest.

This time around Arm also maintains the marketing and branding over the core, meaning we’ll not be seeing the CPU under different names. All in all, the whole marketing disclosure around the design program is maybe a bit confusing – the simple matter of fact is that the X1 is simply another separate CPU IP offering by Arm, aimed at its leading partners, who are likely willing to pay more for more performance.

At the end of the day, what we're getting are two different microarchitectures – both designed by the same team, and both sharing the same fundamental design blocks – but with the A78 focusing on maximizing the PPA metric and having a big focus on efficiency, while the new Cortex-X1 is able to maximize performance, even if that means compromising on higher power usage or a larger die area.

It’s an incredible design philosophy change for Arm, as the company is no longer handicapped in the ultra-high-tier performance ring with the big players such as Apple, AMD, or Intel – all whilst still retaining their bread & butter design advantages for the more cost-oriented vendors out there who deliver hundreds of millions of devices.

Let’s start by dissecting the microarchitecture changes of the new CPUs, starting off with the Cortex-A78…

The Cortex-A78 Micro-architecture: PPA Focused

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

192 Comments

View All Comments

Wilco1 - Tuesday, May 26, 2020 - link
Disappointed in what way? Flagship phones have been more than fast enough in the last few years. There is a balance between power consumption and performance - and I think the improved efficiency of Cortex-A78 will be more useful in typical use-cases. It won't win benchmarks, but if you believe iPhone performance is measurably better in real-life use (rather than benchmarks), why not just buy one?
syxbit - Tuesday, May 26, 2020 - link
Put it in context. You pay $1500 for a Galaxy S20 ultra that's slower than a $400 iphone.
If you do a lot of web browsing on javascript heavy pages, nothing beats single threaded perf. You can't improve it by just throwing slower cores at it.
Discourse did a good writeup that's still valid today.
https://meta.discourse.org/t/the-state-of-javascri...
Wilco1 - Tuesday, May 26, 2020 - link
You could also get the $699 OnePlus 8 and beat the S20 ultra on both performance and cost. Where is the difference?

Javascript and browsers depend heavily on software optimization, and that's the real issue.
armchair_architect - Tuesday, May 26, 2020 - link
syxbit is right. Javascript and browsers are not just software. They stress CPU in different ways than the usual Spec/Geekbench and X1 will not be just a benchmark core.
If you look at DVFS curve of A77 vs A78, X1 will probably be even lower power than A78 in the region of perf in which they overlap.
For the simple reason that to achieve same performance as A77/A78, X1 will need much lower frequency and voltage. This will greatly offset the intrinsic growth in iso-frequency power that X1 will for sure have.
My point would be: going wider helps you be more efficient iso-perf vs narrower cores.
The power efficiency hit only comes when you go over the peak perf offered by the narrower core.
So you could argue that something like X1 is taking the A78 DVFS curve and pushing it down (lower power) and of top of that it extends it to new performance point not even reachable on A78.
Obviously you pay in area for this :)
But Apple has clearly showed over the years that this is the winning formula
ZolaIII - Wednesday, May 27, 2020 - link
You are completely wrong. It's much more about caching than wider core's. X1 is not 50% faster than A78 but it is 50% bigger. Best approach would be wider ISA with same execution units multiplied in numbers like RISC V did lay out already foundations for 256 bit ISA (still a scratch) and is finalising 128 bit one. But there's a catch in tool's and compilers support.
soresu - Wednesday, May 27, 2020 - link
X1 does have wider NEON SIMD, twice as wide in fact - so for content that favors SIMD (like dav1d AV1 decoding) you will get a serious jump in performance.

Unfortunately the benchmarks do not really give us much of an idea of real world improvement for something like this, so we'll have to wait for products to get a better idea.
dotjaz - Thursday, May 28, 2020 - link
ARM specifically said A78 was designed to INCREASE EFFICIENCY vs A77, a lot of the decisions concur with that.
X1 was designed to MAXIMIZE PERFORMANCE sacrificing efficiency and area in the process. When you factor in the leakage caused by larger die. X1 would almost certainly be less efficient than A78 when you drop it to below 2GHz.
Wilco1 - Thursday, May 28, 2020 - link
"Javascript and browsers are not just software."

They are just software. Fun fact: your Android browser is built with -Oz. Yes, all optimizations are turned off in order to reduce binary size. That's an insanely stupid software decision which means Android phones appear to be behind iOS when in fact they are not.
name99 - Saturday, May 30, 2020 - link
It's not an "insanely stupid software decision"...
Fun fact: Apple ALSO builds pretty much all their software at either -Oz or -Os! Both Apple and Google (and probably MS) are well aware that the "overall system experience" matters more than picking up a few percentage points in particular benchmarks, and that large app footprints hurt that overall system experience. Apple's recommendation for MOST developer code (and followed internally) has been to optimize for size for yikes, at least 20 years, and hasn't changed in all that time.

Look at the (ongoing) work in LLVM to reduce code size ( "outliner" is one of the relevant keywords); the people involved in that span a range of companies. I've seen a lot of work by Apple people, a lot by Google people, some even by Facebook people.
Wilco1 - Saturday, May 30, 2020 - link
There is a world of difference between optimizing performance without regard for codesize and optimizing for smallest possible codesize without any regard for performance. -Ofast is the former, -Oz is the latter. Most software, including Linux distros, uses -O2 as the best tradeoff between these extremes. Non essential applications use -Os (or even -Oz if performance is irrelevant). However a browser is extremely performance sensitive. Saving a few bytes with -Oz loses 10-20% performance and that means you lose the equivalent of a full CPU generation. I call that insanely stupid, there are no other words to describe it.

Arm's New Cortex-A78 and Cortex-X1 Microarchitectures: An Efficiency and Performance Divergence

The New Cortex-A78: Doubling Down on Efficiency

The New Cortex-X1: Breaking the Design Constraint Chains

Post Your Comment

192 Comments

View All Comments

Wilco1 - Tuesday, May 26, 2020 - link

syxbit - Tuesday, May 26, 2020 - link

Wilco1 - Tuesday, May 26, 2020 - link

armchair_architect - Tuesday, May 26, 2020 - link

ZolaIII - Wednesday, May 27, 2020 - link

soresu - Wednesday, May 27, 2020 - link

dotjaz - Thursday, May 28, 2020 - link

Wilco1 - Thursday, May 28, 2020 - link

name99 - Saturday, May 30, 2020 - link

Wilco1 - Saturday, May 30, 2020 - link

Log in

Don't have an account? Sign up now