Nowadays many cloud service providers design their own silicon, but Amazon Web Services (AWS) started to do this ahead of its rivals and by now its Annapurna Labs subsidiary develop processors that can well compete with those from AMD and Intel. This week AWS introduced its Graviton4 SoC, a 96-core ARM-based chip that promises to challenge renowned CPU designers and offer unprecedented performance to AWS clients.

"By focusing our chip designs on real workloads that matter to customers, we are able to deliver the most advanced cloud infrastructure to them," said David Brown, vice president of Compute and Networking at AWS. "Graviton4 marks the fourth generation we have delivered in just five years, and is the most powerful and energy efficient chip we have ever built for a broad range of workloads."

The AWS Graviton4 processor packs 96 cores that offer on average 30% higher compute performance compared to Graviton3 and is 40% faster in database applications as well as 45% faster in Java applications, according to Amazon. Given that Amazon did not reveal many details about its Graviton4, it is hard to attribute performance increases to any particular characteristics of the CPU.

Yet, NextPlatform believes that the processor uses Arm Neoverse V2 cores, which are more capable than V1 cores used in previous-generation AWS processors when it comes to instruction per clock (IPC). Furthermore, the new CPU is expected to be fabricated using one of TSMC's N4 process technologies (4nm-class), which offers a higher clock-speed potential than TSMC's N5 nodes.

"AWS Graviton4 instances are the fastest EC2 instances we have ever tested, and they are delivering outstanding performance across our most competitive and latency sensitive workloads," said Roman Visintine, lead cloud engineer at Epic. "We look forward to using Graviton4 to improve player experience and expand what is possible within Fortnite.”

In addition, the new processor features a revamped memory subsystem with a 536.7 GB/s peak bandwidth, which is 75% higher compared to the previous-generation AWS CPU. Higher memory bandwidth improves performance of CPUs in memory intensive applications, such as databases.

Meanwhile, such a major memory bandwidth improvement indicates that the new processor employs a memory subsystem with a higher number of channels compared to Graviton3, though AWS has not formally confirmed this.

Graviton4 will be featured in memory-optimized Amazon EC2 R8g instances, which is particularly useful to boost performance in high-end databases and analytics. Furthermore, these R8g instances provide up to three times more vCPUs and memory than Graviton 3-based R7g instances, enabling higher throughput for data processing, better scalability, faster results, and reduced costs. To ensure security of AWS EC2 instances, Amazon equipped all high-speed physical hardware interfaces of Graviton4 CPUs.

Graviton4 R8g is currently in preview, these instances will be available widely in the coming months.

Sources: AWS, NextPlatform

Comments Locked


View All Comments

  • mode_13h - Monday, December 4, 2023 - link

    I've never heard of Neoverse V2 supporting SMT. You'd think ARM would've mentioned that, when they announced it.

    > among the first ARM-based server CPUs to use SMT was/is actually
    > the one from Huawei/HiSilicon.

    There were other ARM cores with SMT. The Cortex-A65 was used in a couple self-driving SoCs and supports 2-way SMT.

    In terms of server CPUs, the Neoverse E1 supposedly has it. Cavium/Marvell's ThunderX2 & ThunderX3 have 4-way SMT.
  • 29a - Monday, December 4, 2023 - link

    "I don't know why the writer is acting like the uarch and memory are some kind of mystery worthy of speculation."

    Because he's terrible, AI could write much better articles.
  • Wadiest - Wednesday, November 29, 2023 - link

    Neat, that's nearly 70% the memory bandwidth of a 2021 Mac Studio (M1 Ultra).

    I know, it's a facetious comparison, but I find it curious either way you look at it - whether that server CPU makers just can't seem to beat Apple's barely-more-than-a-laptop-CPU, or that Apple so massively over-designed their M-series processors in this regard.
  • bubblyboo - Wednesday, November 29, 2023 - link

    Because M# processor memory bandwidth is shared between the CPU and GPU, whereas here it's only the CPU bandwidth.
  • mode_13h - Thursday, November 30, 2023 - link

    Yup. Comparing CPU + GPU vs. CPU-only. GPUs are notoriously bandwidth-hungry.
  • name99 - Friday, December 1, 2023 - link

    Yes, and no.
    Of course it is true that GPUs are bandwidth-hungry. But it's also true that data centers are notoriously underprovisioned with bandwidth. Dick Sites (one of the Google performance engineers) has frequently complained about this.

    I suspect it costs money to provide bandwidth (not to mention designing your machine differently, eg without using standard motherboards and DIMMs), and that this is part of what you are getting when you pay for Apple (in spite of the loud voices who claim that there are no advantages to on-SoC RAM...)
  • mode_13h - Saturday, December 2, 2023 - link

    The GPU comment was made to explain why Apple's client SoC has so much bandwidth. The reason why server CPUs can't easily do the same is due to their memory scalability requirements.

    Of course, there are exceptions. Intel's Xeon Max has HBM, which can be used as a "cache", to avoid compromising on scalability. Nvidia's Grace uses on-package LPDDR5X, I guess with a memory scaling strategy of either adding more nodes. Maybe, in the future, they plan on pools of additional memory being accessed over CXL. Long-term, we seem to be headed for memory tiers, where one form or another of on-package memory comprises the fast tier.

    Anyway, if you're curious how a server CPU would perform with 1 TB/s of bandwidth:
  • lemurbutton - Wednesday, November 29, 2023 - link

    Why are you comparing a server CPU to a consumer SoC?
  • erinadreno - Wednesday, November 29, 2023 - link

    M-series chips are using LPDDR memories, which are aimed at high bandwidth at the cost of latency. IFRC the latency of M1 is ~100ns and zen3 chips were capable of ~60ns latency. Considering you need to do 1 fetch, 1 decode before any of the bandwidth is usable for numerical calculation.

    I'd say LPDDR is more like GDDR rather conventional DDR.
  • mode_13h - Thursday, November 30, 2023 - link

    LPDDR is about low-power. You can hit the same bandwidth numbers using DDR5, but at considerably higher power.

Log in

Don't have an account? Sign up now