The userbase for fortran on M1 is probably super small anyway. Although.. I can see the HPC cluster entirely made up of Macbook Airs before my eye. Just like the PS3-cluster the air force used to have ;)
No, the hardware would be cheaper but the maintenance would be much more time-intensive. That's why companies that need a lot of processing hardware buy enterprise level hardware. The cost of maintaining the system quickly eclipses the hardware costs. And if you're using a computer to make money, quite often the hardware cost is only a small amount of your costs.
I dunno about the "quite often the hardware cost is only a small amount of your costs." part. as modern production methods have been ever more automated, (I'm talkin to you, bitcoin mining), there's almost no other cost. now, some may argue, in the extreme case of mining for instance, that power is the largest component; but isn't that 'hardware' cost? it certainly isn't labor or interest or land or even CxOs' cut. fewer and fewer automation efforts are conducted in assembler or even naked C or java or FORTRAN, but in frameworks, often with bespoke syntax and with headcounts way lower than their native languages. so, yeah, now into the foreseeable future, hardware is the biggest byte.
True, but I did enjoy the holistic response. Just think of the potential: batteries are a built-in UPS, and you don't need to mess about with any sort of KVM arrangement - if a node drops out, you can go right to it and poke it to find out what's up!
I guess the results showing lower TDP despite 100% load, means that the cores are sometimes idling for a part of their clock frequency. It means the cpu is lacking buffers, and isn't fully optimized.
Buffers and even cache can't completely avoid memory bottlenecks.
Also, you can run a core 100% on code with very little parallelism and not draw much power. Code with lots of ILP and especially vector arithmetic burns a lot more power, which is why AVX2 and especially AVX-512 trigger significant clock-throttling on Intel.
Interesting article thanks. One thing I missed, what process is this on? 7nm?
It's also interesting that the M1 has demonstrated that with the right sizings, a very wide backend can give you significant single threaded performance. Not really that useful for a server processor where you're likely to be running many threads and want to trade for more cores though.
Yes, 7nm and monolithic, which seems fairly incredible as this thing is huge. Dont have the die size numbers though. Wonder what the yield is on these...
Each Neoverse N1 core with 1MB L2 is just 1.4mm^2, so 80 of them add up to 112mm^2. The die size is estimated at about 350mm^2, so tiny compared to the total ~1100mm^2 in EPYC 7742.
So performance/area is >3x that of EPYC. Now that is efficiency!
Timing of this article is awkward. We're comparing to the 18 month old 7742 vs the soon to be released Zen 3 Milan parts which based on the already launched Zen 3 desktop parts (and Milan leaks) will be 9-27% faster in the same power envelope.
Cache is a big part of the die size for the AMD chip and the N1 has much less of it which makes the die size smaller. AMD's Desktop IGP parts with way less cache perform very similarly in many workloads to those with the extra cache and the same has been true for intel parts over the years. Some workloads don't benefit much at all from the extra cache and some do which makes choosing the benchmarks more important.
That's not to say the N1 isn't more efficient, but rather that it's hard to make a fair comparison, particularly around die size. They may have similar core counts but have made very different design decisions around cache.
I don't see how it matters, but Altra is about 9 months old and Neoverse N1 is a sibling of Cortex-A76 which has been available in phones for 2 years. As for Milan, I expect the gain on SPECrate to be about 10-15%. And Milan will be competing with the Altra Max which has 60% more cores which should give ~40% speedup.
Yes the design decisions are quite different, and it is interesting that they end up with similar performance despite the disparity in L3 cache. I suspect that 8 memory channels is becoming a limit, and a future generation with DDR5 will enable more throughput (and even more cores).
I am sorry but looking carefully the heatsink and the application of the thermal paste, we are facing a limit of the reticle thing on 7nm. We are in front of a 700/800 mm2 thing. On 7nm this means very few units sold and nearly zero market penetration. Same thing on 5nm given the higher core numbers.
In pratics we have nothing in our hands. Another failure in Server market
No it is not anywhere near the reticle limit. You can't estimate the die size from the heatsink, but you estimate it based on similar designs for which we do have numbers. Graviton 2 is a similar design at 30B transistors. This has another 16 cores which adds another 16X1.4 = 22.4mm^2. So around 350mm^2 in N7.
This is just a ridiculous statement. 350mm^2 ... no way. Firstly, the die size of Graviton 2 is not known. A realistic comparison would be AMD's Zen2 chiplet which has 3.9b transistors and is 72mm^2. One would deduce from that, that Graviton 2 is > 550mm^2. Also your napkin calculation to add 22mm2 is flawed. Firstly, you don't know if a N1 core is actually taking 1.4mm^2 in this CPU. Secondly, you're forgetting to add 64 PCI-E lanes. Let's say, 25mm2 for the CPU and 25mm2 for the lanes. That would bring the total to 600mm^2. Quite a bit bigger to your 350mm^2.
Using Zen 2 is not correct since it uses much larger transistors. Using Kirin 990 5G density gives an estimate of 330mm^2 for Graviton 2. The size of N1 cores has been published for 7nm, so we know it is 1.4mm^2. You're right that PCIe lanes would add to it as well - assuming the PHYs have the same size as DDR PHYs at the same speed, 64 lanes would be about 12-15mm^2. That would increase it to about 365mm^2.
Kirin 990 5G uses N7+. Altra uses N7. Not only is the process different but they're also totally different categories of products concerning transistor density. A mobile SOC can be very dense. It barely has any IO (which is not transistor dense). Also GPU, DPU, IMG, ... all are extremely dense. Kirin 990 5G is 90MTr/mm^2. No way a server class SOC is going to be more than 60MTr/mm2. Renoir = 62, Navi 21 = 52, Zen2 = 54, Vega 20 = 40, Navi 10 = 41. Ampere isn't going to magically break 60.
"The size of N1 cores has been published for 7nm, so we know it is 1.4mm^2" Those are ARM numbers and that is only if you use high density libraries.
Arm servers don't need high performance libraries - even mobile phones clock over 3 GHz using high density libraries. See https://images.anandtech.com/doci/13959/03_Infra%2... (note 3.1GHz and 1.4mm^2 with 1MB L2 on 7nm is ~100MT/mm^2)
Using ~90MT/mm^2 for 7nm is reasonable since that is the reported density of recent 7nm chips (Kirin 990 5G is 91, 4G is 88 - the older 980 gets 93). Mobile SoCs already have a large amount of IO and analog logic and we are multiplying that amount by 3x.
This shows how stupid it is to use high performance libraries in server chips - they don't need to run at 5GHz!
We have different opinions but there's only one true fact: the die size is not disclosed. So anything anyone says is just a pure guess. You can't throw it around as fact.
The package is only 16% larger than EPYC. Do you see any opportunity to reduce the huge number of pins? There are 8 memory channels plus full 128 PCIe lanes.
Yes, the problem Altra Max will likely face is more memory bottlenecks. Also, I wonder if they'll have to dial clocks down, a little, to keep the power-efficiency numbers attractive.
Obviousy it is a cpu of niche, not high volume like Intel or AMD. With a so large die we will not see many of these around. As usual only volume matter in Server world So no worries for X86.
Actually, those are a bigger threat to x86 than ARM chips like the M1 in Personal Computers. Server x86/x64 CPUs ist where AMD and Intel make a lot of their money. The key question for this and similar Neoverse chips is software support. If you can run your database or whatever natively on an ARM-native OS like Linux, these are tempting. Now, if MS would release Exchange Server in native for ARM, the threat would be even bigger.
Agreed about software, but i don't see problems for x86 dominance. Major sin of this design is die size, around 800mm2 looking photos in the article. On 7nm it means a very low cpu output; this issue will become even worse on 5nm. So it is not a matter how good is a SKU but who have the real volume in server world. In past decades we have seen a lot of better cpus than x86 puppies, but in spite of this they all have lost their way. The winner scheme is "volume". This is the only parameter that gives the dominance of a solution over another ones, expecially today with several and several millions/year of server SKUs absorbed by the market. Altra is not born to beat x86, at least not in this crazy, old style, incarnation. They need to follow AMD (and shortly Intel) path instead of they will never be relevant. Actual and upcoming advanced processes are not done for these massive things.
Apple's move to Arm does hit Intel's bottom line by many billions. A large percentage of AWS is already Graviton as more big customers are moving to it (latest is Twitter). Oracle is going to use Ampere Altra, and Microsoft is claimed to develop their own Arm servers.
As Goldalf said, volume matters in the server world, and they are moving to Arm.
That was my question also! Who fabs it, and what is their yield. This thing is quite big. Does anyone know if they overprovision cores so they can use those with small, very partial defects? At that size and those numbers of transistors, even a tiny probability of a defect can mean that the great majority of chips ends up in the circular bin (garbage).
How exactly is it big? It's tiny for a server chip - 80 cores at about half the die size of a typical 28-core Xeon (~700mm^2). And TSMC 7nm yield is extremely good even for much larger chips like GPUs.
Note I wrote "quite big", and by transistor count, it's a larger CPU, expected for a server chip. As for the Xeon, how high is Intel's yield for the 28 core Xeon, and that after how many years on 14 nm+++ (etc)? So, if you have a yield number for this 80 core Ampere chip, please share it.
It's larger than a mobile SoC, but small for a server chip thanks to Arm's tiny cores and the high density of TSMC 7nm. See https://www.anandtech.com/show/16028/better-yield-... for the defect rate, and from that a simple yield calculator gives 71% for a 350mm^2 die. That's before you fix SRAM defects or harvest dies for the lower-end SKUs. So we conclude yield is very good.
Glad to read that you've proven Andrei wrong, so maybe you should write these reviews. Here a direct quote from the first page of the review (also, take a look at the pictures: "The chip itself is absolutely humongous and amongst the current publicly available processors is the biggest in the industry, out-sizing AMD’s SP3 form-factor packaging, coming in at around 77 x 66.8mm – about the same length but considerably wider than AMD’s counterparts."
How ignorant can you be? Obviously the chip and silicon die have different sizes. The chip is large because it has many pins. The silicon die is a tiny fraction of the chip. We're discussing the size of the silicon die here, not the size of the chip. Completely different things.
It's hard to imagine how anyone sane can believe that a "chip measuring 77 x 66.8mm" (6 times the reticle limit!) is referring to the die size rather than the package. Andrei's quote even uses the word package. So maybe you're right and eastcoast_pete was just trolling.
"This thing is quite big." Package size is not die size.
If Nvidia can pump out dies more than twice the area on an inferior process and still get some perfect dies, I suspect they'll have no issues whatsoever with yield on TSMC 7nm at this stage - especially with the ability to sell lower-core-count variants.
Die harvested models with less cores sell for only 5-10% less. So I'm not sure if that means yields are really good, or really bad. Apparently they seem to be pushing the 80 core models pretty hard since so many are being offered.
Then again, it depends what we define as yield quality? Defects seem to be low, but binning could be another issue as only two models seem to hit 3.3GHz and at incredibly high power budgets.
3.3Ghz is about where that architecture tops out - I'm not sure that tells us much about yield. To me, the pricing seems to indicate that they aren't expecting to have to shift a ton of the lower-core-count die-harvested models.
Assuming that Intel just wants to milk customers forever, just like nVidia/phone oems do, they should quickly bridge the performance gap. They'll just have to stop being lazy and actually provide us with more than a drip fed speed increase.
An Apple guy I see. Remember that up until a couple of months ago Apple was charging you $1000 for a laptop with a dual core 1.1GHz chip.
The "phone OEMs" finally have a core that can somewhat compete with Apple's Firestorm. It will take them a couple iterations to perfect it but they are on the right path. As for Intel, another story entirely. The latest word is that their 10nm process isn't going to well and they have hit yet another delay for 7nm. They may hit up Samsung's foundries just to get a product out (due to TSMC not having any capacity until December 21). So while their issues are far more significant than those for Android phones, it isn't due to their lazily milking customers. They have real tech issues to deal with, issues of the sort that Apple and AMD don't have to worry about because they lack the capability and expertise required to make their own chips.
The answer to the question of "how powerful it is" is clear - more than good enough. The real question in fact is: "How much can they produce?" AMD has the crown in x86 processor performance, but this doesn't really matter very much as long as they can build enough processors only for a part of the market.
64kB pages might significantly enhance performance on workload with large memory sets, as the TLB will be up to 16x less used. On the other hand, memory usage of the Linux file system cache will also increase a lot.
Would you be able to test the effect of 64kB vs 4kB page size on at least some workloads?
It's something that I wanted to test but it requires a OS reinstall / kernel recompile - I didn't want to get into that rabbit hole of a time sink as already spent a lot of time verifying a lot of data across the three platforms over a few weeks already.
I'd love to see that as well. For workloads that use transparent huge pages, there should not be much difference since both would use 2MB huge pages (512*4KB or 32*64KB), plus one or more even larger page sizes, but it needs to be measured to be
The downsides of 64KB requiring larger disk I/O and more RAM are often harder to quantify, as most benchmarks try to avoid the interesting cases.
I've tried benchmarking kernel compiles on Graviton2 both ways and found 64kB pages to be a few percent faster when there is enough RAM, but forcing the system to swap by limiting the total RAM made the 64kB kernel easily 10x to 1000x slower than the 4kB one, depending on the how the available memory lined up with the working set.
Thank you for the incredible amount of information and the work you put into this: Anandtech's best!
Yet I wonder who would deploy this and where. The purchasing price of the CPU would seem to become a rather miniscule part of the total system cost, especially once you go into big RAM territory. And I wonder if it's not similar with the energy budget: I see my larger systems requiring more $ and Watts on RAM than on the CPUs. Are they doing, can they do anything there to reduce DRAM energy consumption vs. Intel/AMD?
The cost of the ecosystem change to ARM may be less relevant once you have the scale to pay for it, but where exactly would those scale benefits come from? And what scales are we talking about? Would you need 100k or 1m servers to break even?
And what sort of system load would you have to reach/maintain to have significant energy advantages vs. x86 iron?
Do they support special tricks like powering down quadrants and RAM banks for load management, do they enable quick standby/actvation modes so that servers can be take off and on for load management?
And how long would the benefits last? AMD has demonstrated rather well, that the ability to execute over at least three generations of hardware are required to shift attention even from the big guys and they have still all the scaling benefits the x86 installed base provides.
These guys are on a 2nd generation product, promise 3rd but essentially this would seem to have the same level of confidence as the 1st EPIC.
Does the price of the CPU matters at this point? It's not as if you are going to build your own system isn't it? Isn't buying the whole computer the only option? Do they sell motherboards as well?
People like to talk about how RAM costs rapidly sink the cost of the CPU *as a proportion of total cost*, which is true, but saving $2000 per socket still adds up when you're buying a fair few of them. Saving hundreds of thousands of dollars on a datacentre build-out (and with no significant performance downsides) is good financial sense, even if you're still spending millions overall.
I really want to see what would happen if an M1-based design got scaled down to something suitable for a workstation.
I think something like that is needed in order to solve problems like what you were running into with Blender - until we have decent high performance aarch64 workstations on developer's desks, x86-64 is gonna be ahead on software support.
*technically* Nvidia did and does sell Jetson boards in that range (they even delivered the first consumer accessible PCIe 4.0 root, in the AGX Xavier), but Nvidia's insistence on locking everything down to an ancient kernel basically kills any hope of it being used outside of specific fields. Even something as simple as their teams keeping up with LTS releases would be somewhat viable, but nope. Even Raspberry Pi foundation is far ahead of them, now.
I think they are still trying to upgrade their bootloader to allow booting from NVMe cards? At least their AGX Xavier is finally picking up beta UEFI support (which should help somewhat).
I guess they would rather Apple eat their lunch. Oh, well.
This in interesting times indeed... The question is what having Aarch64 available on very nice desktops/labtops from Apple will do in terms of incentivising devs to work with ARM? Knowing that servers with compute power/socket that equals or outperforms x86 @ lower prices must appeal to a lot of java and cloudvendor workloads.
'Where Ampere and the Altra definitely is beating AMD in is TCO, or total cost of ownership. Taking the flagship models as comparison points – the Q80-33 costs only $4050 which generally matching the performance of AMD’s EPYC 7742 which still comes in at $6950, essentially 42% cheaper.'
Does per-core licensing cut into that advantage at all?
Soon you should be able to buy an AWS Outpost with Graviton2 inside which kinda sorta straddles the line between "owning" and "accessed via Amazon cloud services".
Very excited to receive me Apple M1 MacBook Pro. I hope it gives me some perspective on how performance can be applied to scientific local workstation computing.
25% more cores for Zen2 7742 class. If paired with multi socket and then Milan drop in this is not going to be any major breakthrough.
"The Arm server dream is no longer a dream, it’s here today, and it’s real." lol so until today all the articles on the ARM are not real I guess.
Anyways I will wait for market penetration of this with server share and then see how great ARM is and how bad x86 is going to be as from AT's narrative recently.
Are you this mopey every time there's a paradigm-shift in the tech industry? Feel free to keep looking for metrics that "prove" you right, but eventually it's going to be a very hard search.
Thanks Andrei! Maybe I am barking up the wrong tree here, but I find the "baby" server chip in that lineup particularly interesting. Nowhere near as fast as this, of course, but for $ 800, it might make for a nice CPU for a basic server setup; nothing fancy, but low TdP, and would probably get the job done. The question here is how expensive the MB for those would be. Lastly, if Ampere sends you one of those $ 800 ones, could/would you test it?
They will likely sell desktops using these just like the previous generation, but they are not cheap as it is high-end server gear using expensive ECC memory (and lots of it since there are 8 channels). If you don't need the fastest then there is eg. NVIDIA Xavier or LX2160A (16x A72) boards for around $500.
I think those are probably most useful for workloads that are pathologically memory and/or I/O limited - 4TB per socket, save ~$3000 over the faster CPU, benefit from power savings over the life of the server.
Ironically, AMD's opportunity to win might turn into an ultimate loss - Intel's manufacturing advantage kept x86 relevant, and with access to the x86 instruction set limited by ownership of the IP, AMD lived alongside Intel in that walled garden.
With the manufacturing advantage gone however, Apple has left the garden, and maybe other personal computers won't be far behind - software compatibility I think is actually less of an issue in the era of SaaS and continuous updates. Ie. you were going to have to download new versions of the software you use as time went on anyway.
This is all great but when all licencing is per core it limits the usage scenarios or benefits of these developments as they can really only be used with open source type licences. For the rest of us on Windows, Oracle, Java, Apple, IBM, etc licencing it doesn't bring anything to the table.
Meh. Nothing special. it has been benchmarked on Phoronix and it performed more or less on par with Rome. 80 newest ARM cores against 64 mature x86 cores within constrained power envelope. Naples is just about to come out and I suspect some time after that AMD will have something like really wide new RISC-V cores.
It won most benchmarks on Phoronix while using significantly less power. Yes Milan is about to be released, and it will have to compete with the 128-core Altra Max. Which do you believe is going to win - 64 SMT cores or 128 real cores?
It actually won less than half of the benchmarks on phoronix, since a number of those graphs just re-state the results in score/W. There are also questions over some of the compiler options used on those benchmarks, since many of the tests are compiled with options that won't enable AVX on benchmarks where it should be beneficial (yet, not having SVE, the N1 cores are at no such disadvantage).
"should be beneficial" -> "might help in a few limited cases". AVX/AVX512 isn't that useful for general C/C++ code. You typically only see large gains when people optimize using intrinsics.
Intrinsics don't compile if they're for a CPU arch beyond what the compiler is being instructed to target. So, even packages where people take the time to optimize with intrinsics need to guard them with compile-time checks to ensure the CPU target is capable of executing those instructions.
Compilers do generate vectorized code. I don't know how well GCC is doing on that front, lately, but the TNN tests should be a good way to see that. Too bad those tests don't use -march=native.
What's interesting about TNN is I'm looking at the exact source revision Phoronix is using, and it seems they've completely dropped their backend for x86. The source/tnn/device/x86/ is simply missing. So, I wonder if they decided the compiler was good enough that they didn't need to bother with their own hand-optimized code for it, or if they just decided they don't care how fast their stuff runs on it.
TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.
Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU. The makefile selects the right ISA variant for any files using intrinsics. But none of this is used in the TNN test.
> TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.
At this point, I probably will.
> Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU.
That's a whole additional level of effort for the developers. For them to bother compiling and conditionally calling different versions only makes sense if they think their main userbase aren't going to bother recompiling specifically for their hardware. In the case of specialized packages, it's reasonable to expect your users to take a little trouble for the best performance. It's really things like very low-level libs or multimedia code where you tend to see the sort of elaborate runtime detection and dynamic codepath selection that you're describing.
I think Basis Universal and High Performance Conjugate Gradient are some other cases where the wider SIMD of Zen2 and Skylake-SP should confer significant benefit.
"should give significant benefit" -> "might give some benefit". I suggest you try out. Autovectorization is not nearly as good as you seem to believe, and the overall speedup is often disappointing even if some loops are 10-20x faster.
SMT gives very little benefit (only 15% faster on SPECINT and 3.5% *slower* on SPECFP), adds a lot of area and complexity, and results in very bad per-thread performance.
It's always better to use 2 real cores instead of 1 SMT core. So if you have small cores like Neoverse N1, adding SMT makes no sense at all.
Whether SMT makes sense depends on the workload. Many tasks have greater branch-density and less computational density, in which case SMT is a massive win. I'm compiling code all the time and see huge speedups from SMT.
That said, of course I'll take two real cores instead of 2-way SMT, all else being equal, but that always costs more. If SMT really made as little sense as you say, then it wouldn't be nearly so widespread.
There are certainly cases where SMT helps, but having some wins doesn't mean it is worth adding SMT. All too often people talk up the upsides and ignore the downsides. Let's see how Altra Max does vs Milan next year, that should answer which is best.
Note almost none of the billions of CPUs sold every year have SMT (even if we exclude embedded). Adding another core is simpler, cheaper and gives more performance.
> Let's see how Altra Max does vs Milan next year, that should answer which is best.
I disagree. That's a bit apples and oranges. The differences between Zen2/Skylake and N1 are too big. You need to look at the overhead of SMT for a particular core vs. the benefits for that core.
These x86 cores are large not just because of SMT, but also the x86 tax, their wide vector units, and other things. It could be that SMT adds just 5% overhead, and that's not enough to increase your core count hardly at all, if you drop it.
It's never going to be a perfect comparison - nobody will design SMT and non-SMT variants of the same core! There are many differences because they use different design principles. However it will clearly show which of these designs works out best for top-end server performance.
SMT is a clear win where you already have large cores with a lot of execution resources, whereby the extra resources required aren't a large proportion of overall die area. It also helps if your tasks are focused on overall performance for a given number of threads, rather than performance-per-thread.
Where the cores are this small, though, simply adding more of them seems to be the better option.
Plus, per-thread performance is only an issue if that's how you're paying for CPU time, and then what you actually care about is thread performance per dollar, which would compensate for any cost differences due to SMT, as well.
That comparison is only valid for Amazon customers and in the short term. It can't be used to support a broader conclusion about SMT, because we lack transparency into the cost structure of Amazon's hosting, like whether they're subsidizing Graviton2 servers or even just charging enough for them to simply break even on the hardware.
Why would they introduce Graviton if it would run at a loss??? A significant percentage of AWS is already Graviton (probably 20% by now). If anything Graviton increases profitability due to vertical integration and other cost reduction.
First, there's a fundamental disparity between an in-house CPU and a 3rd Party one, where Amazon can cut out some overheads by building their own. So, that already skews the price-comparison.
The other question is whether Amazon is partially-subsidizing the price of their Graviton2 instances as an incentive to get more people to switch. For a business, the least risky thing is to stay on x86, so Amazon needs to present an immediate and significant cost savings to get people to switch. After they've switched and ARM server cores have had more time to mature, Amazon can charge more and make back a good return on investment.
I obviously don't know if that's what they're doing, but we don't know that it's not. So, you really can't read much into their current pricing. That's all I'm saying.
Finally, I guess you missed this part, in the discussion of SPECjbb:
> One thing that did come to mind immediately when I saw the results was SMT. > Due to this being a transactional data-plane resident type of workload, > SMT will undoubtedly help a lot in terms of performance, > so I tested out the EPYC chip figures with SMT disabled, > and indeed max-jOPS went down to 209.5k for the 2S THP enabled results, > meaning that SMT accounts for a 29.7% performance benefit in this benchmark.
...
> It’s generally these kinds of workloads that SMT works best on, > and that’s why IBM can deploy SMT4 or SMT8 processors, > and the type of workloads Marvell’s ThunderX was trying to carve a niche or itself with SMT4.
As the article mentions, Marvell’s ThunderX did support SMT on ARMv8-A.
Were SMT's reputation not bruised by all the recent side-channel exploits, perhaps it would be showing up in some of ARM's own cores. Maybe their V-series will get it, since that's a much larger core.
ThunderX2/X3 and Neoverse E1 have SMT, but neither has been hugely successful. SMT doesn't provide a significant benefit across a wide range of workloads, so adding another core remains simpler and cheaper. And yes, security is another nail in the coffin.
The performance of Graviton2 meets our expectation for Neoverse N1 (or Cortex A76) better. How can Q80 manage to deliver so much higher IPC with the same architecture? Incredible.
For the love of god, please keep the axis scaling identical!
Same applies to every single metric always. If you provide separate graphs for different products, please make sure that axis-scaling is the same in all images!
Next step would be to see how ARM performed with 256MB (or more) of cache. The early models didn't suit many workloads compared to the general purpose kings (x86-64). Each generation of ARM based server chips have seen an increase to the number of suitable workloads thus more general purpose. Adding more specialised instructions to x86 has diminishing returns that increase the complexity of decoding & execution it's always good to be challenged "can we make it simpler & faster?".
ARM has been doing some of the same, though. Look at the evolution of ARMv8-A as it has aged, and you'll see several bolt-ons to target additional markets: * new atomics * signed, saturating multiplies * CRC instructions * half-precision floating point * SVE (their answer to AVX/AVX-512) * complex number support * an instruction specifically for floating-point conversion in Javascript * integer dot products * random number generation * matrix multiply & manipulation * BFloat16 support * a smattering of other virtualization and security-oriented additions
That's not a small list, and definitely not a less-is-more approach.
Imagine the response if someone made a prediction of this statement just five years ago:
"Intel’s current Xeon offering simply isn’t competitive in any way or form at this moment in time. Cascade Lake is twice as slow and half as efficient – so unless Intel is giving away the chips at a fraction of a price, they really make no sense."
It would've been more appropriate to compare Q64-33 to AMD to assess the merits of each architectural design. Could you repeat some of these tests, limiting the Altra to only 64 cores/threads?
It might be an interesting comparison with SMT disabled since the extra 25% of cores are the alternative to adding SMT. However should we also limit EPYC to 32MB L3 to make things more equal?
This is cool and all but what could an arm processor server be used for? Aside from some linux compilations there isn't that much support for arm processors.
This is looking a lot like the early days of computing where BeBox was powered by PowerPC cpu, a system looking to find a niche that doesn't exist (yet). It was cool for it's time but very little demand for it.
Pretty much everything needed to run modern web services has been ported to ARM. I believe a number of high-profile sites are now using Amazon Graviton2, due to its pricing advantage.
As for client computing, the Pi has emerged as a basic, usable desktop platform.
Also, a couple years ago, Nvidia announced support for their entire software stack on ARM host CPUs. So, I'd say it should also be a decent platform for machine learning, now.
You should have tested some CPU mining bechmarks like Monero's randomX, Wownero's randomwow, Turtle's argon2id Chukwa v2, and DERO's AstroBWT. That would have been interesting as shieeet.
I guess the speccpu2017 test result is wrong.Maybe there were some problems such as bios version or speccpu software configuration. Reference link: https://www.spec.org/cpu2017/results/rfp2017.html https://www.spec.org/cpu2017/results/rint2017.html AMD 7742 speccpu2017 intrate official value:353 per socket AMD 7742 speccpu2017 intfp official value:270 per socket INTEL 8280 speccpu2017 intrate official value:172 per socket INTEL 8280 speccpu2017 intfp official value:141 per socket Ampere Q80-33 offcial speccpu2017 intrate value:300 per socket
if you hava any question or updation,please touch 799517515@qq.com。
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
148 Comments
Back to Article
realbabilu - Friday, December 18, 2020 - link
well it support fortran also using Arm Fortran Compiler, unlike m1.realbabilu - Friday, December 18, 2020 - link
my bad. Numerical Algorithms Group (Nag) has fortran for m1. lets battle begin X86 vs armGruenSein - Friday, December 18, 2020 - link
The userbase for fortran on M1 is probably super small anyway. Although.. I can see the HPC cluster entirely made up of Macbook Airs before my eye. Just like the PS3-cluster the air force used to have ;)davidorti - Friday, December 18, 2020 - link
Wouldn't it be way cheaper a cluster of minis?Flunk - Friday, December 18, 2020 - link
No, the hardware would be cheaper but the maintenance would be much more time-intensive. That's why companies that need a lot of processing hardware buy enterprise level hardware. The cost of maintaining the system quickly eclipses the hardware costs. And if you're using a computer to make money, quite often the hardware cost is only a small amount of your costs.FunBunny2 - Friday, December 18, 2020 - link
I dunno about the "quite often the hardware cost is only a small amount of your costs." part. as modern production methods have been ever more automated, (I'm talkin to you, bitcoin mining), there's almost no other cost. now, some may argue, in the extreme case of mining for instance, that power is the largest component; but isn't that 'hardware' cost? it certainly isn't labor or interest or land or even CxOs' cut. fewer and fewer automation efforts are conducted in assembler or even naked C or java or FORTRAN, but in frameworks, often with bespoke syntax and with headcounts way lower than their native languages. so, yeah, now into the foreseeable future, hardware is the biggest byte.at_clucks - Friday, December 18, 2020 - link
The point was a cluster of Minis would probably be cheaper than a cluster of Airs because why pay for screen, battery, keyboard and all that.Spunjji - Monday, December 21, 2020 - link
True, but I did enjoy the holistic response. Just think of the potential: batteries are a built-in UPS, and you don't need to mess about with any sort of KVM arrangement - if a node drops out, you can go right to it and poke it to find out what's up!ProDigit - Saturday, December 19, 2020 - link
I guess the results showing lower TDP despite 100% load, means that the cores are sometimes idling for a part of their clock frequency.It means the cpu is lacking buffers, and isn't fully optimized.
mode_13h - Sunday, December 20, 2020 - link
Buffers and even cache can't completely avoid memory bottlenecks.Also, you can run a core 100% on code with very little parallelism and not draw much power. Code with lots of ILP and especially vector arithmetic burns a lot more power, which is why AVX2 and especially AVX-512 trigger significant clock-throttling on Intel.
mostlyfishy - Friday, December 18, 2020 - link
Interesting article thanks. One thing I missed, what process is this on? 7nm?It's also interesting that the M1 has demonstrated that with the right sizings, a very wide backend can give you significant single threaded performance. Not really that useful for a server processor where you're likely to be running many threads and want to trade for more cores though.
Josh128 - Friday, December 18, 2020 - link
Yes, 7nm and monolithic, which seems fairly incredible as this thing is huge. Dont have the die size numbers though. Wonder what the yield is on these...Calin - Friday, December 18, 2020 - link
Maybe there are quite a few more than 80 cores on this beast - in which case you can "eat" some die errors by deactivating cores/complexes/...Wilco1 - Friday, December 18, 2020 - link
Each Neoverse N1 core with 1MB L2 is just 1.4mm^2, so 80 of them add up to 112mm^2. The die size is estimated at about 350mm^2, so tiny compared to the total ~1100mm^2 in EPYC 7742.So performance/area is >3x that of EPYC. Now that is efficiency!
andrewaggb - Friday, December 18, 2020 - link
Timing of this article is awkward. We're comparing to the 18 month old 7742 vs the soon to be released Zen 3 Milan parts which based on the already launched Zen 3 desktop parts (and Milan leaks) will be 9-27% faster in the same power envelope.Cache is a big part of the die size for the AMD chip and the N1 has much less of it which makes the die size smaller. AMD's Desktop IGP parts with way less cache perform very similarly in many workloads to those with the extra cache and the same has been true for intel parts over the years. Some workloads don't benefit much at all from the extra cache and some do which makes choosing the benchmarks more important.
That's not to say the N1 isn't more efficient, but rather that it's hard to make a fair comparison, particularly around die size. They may have similar core counts but have made very different design decisions around cache.
Wilco1 - Friday, December 18, 2020 - link
I don't see how it matters, but Altra is about 9 months old and Neoverse N1 is a sibling of Cortex-A76 which has been available in phones for 2 years. As for Milan, I expect the gain on SPECrate to be about 10-15%. And Milan will be competing with the Altra Max which has 60% more cores which should give ~40% speedup.Yes the design decisions are quite different, and it is interesting that they end up with similar performance despite the disparity in L3 cache. I suspect that 8 memory channels is becoming a limit, and a future generation with DDR5 will enable more throughput (and even more cores).
Gondalf - Friday, December 18, 2020 - link
I am sorry but looking carefully the heatsink and the application of the thermal paste, we are facing a limit of the reticle thing on 7nm.We are in front of a 700/800 mm2 thing. On 7nm this means very few units sold and nearly zero market penetration. Same thing on 5nm given the higher core numbers.
In pratics we have nothing in our hands. Another failure in Server market
Andrei Frumusanu - Friday, December 18, 2020 - link
Ampere is doing Altra Max with 128 cores still on 7nm, so this one certainly isn't near hitting reticle limits.Wilco1 - Friday, December 18, 2020 - link
No it is not anywhere near the reticle limit. You can't estimate the die size from the heatsink, but you estimate it based on similar designs for which we do have numbers. Graviton 2 is a similar design at 30B transistors. This has another 16 cores which adds another 16X1.4 = 22.4mm^2. So around 350mm^2 in N7.milli - Monday, December 21, 2020 - link
This is just a ridiculous statement. 350mm^2 ... no way.Firstly, the die size of Graviton 2 is not known.
A realistic comparison would be AMD's Zen2 chiplet which has 3.9b transistors and is 72mm^2.
One would deduce from that, that Graviton 2 is > 550mm^2. Also your napkin calculation to add 22mm2 is flawed. Firstly, you don't know if a N1 core is actually taking 1.4mm^2 in this CPU. Secondly, you're forgetting to add 64 PCI-E lanes.
Let's say, 25mm2 for the CPU and 25mm2 for the lanes. That would bring the total to 600mm^2. Quite a bit bigger to your 350mm^2.
Wilco1 - Monday, December 21, 2020 - link
Using Zen 2 is not correct since it uses much larger transistors. Using Kirin 990 5G density gives an estimate of 330mm^2 for Graviton 2. The size of N1 cores has been published for 7nm, so we know it is 1.4mm^2. You're right that PCIe lanes would add to it as well - assuming the PHYs have the same size as DDR PHYs at the same speed, 64 lanes would be about 12-15mm^2. That would increase it to about 365mm^2.milli - Monday, December 21, 2020 - link
Kirin 990 5G uses N7+. Altra uses N7.Not only is the process different but they're also totally different categories of products concerning transistor density. A mobile SOC can be very dense. It barely has any IO (which is not transistor dense). Also GPU, DPU, IMG, ... all are extremely dense.
Kirin 990 5G is 90MTr/mm^2.
No way a server class SOC is going to be more than 60MTr/mm2.
Renoir = 62, Navi 21 = 52, Zen2 = 54, Vega 20 = 40, Navi 10 = 41.
Ampere isn't going to magically break 60.
"The size of N1 cores has been published for 7nm, so we know it is 1.4mm^2"
Those are ARM numbers and that is only if you use high density libraries.
Wilco1 - Monday, December 21, 2020 - link
Arm servers don't need high performance libraries - even mobile phones clock over 3 GHz using high density libraries. See https://images.anandtech.com/doci/13959/03_Infra%2... (note 3.1GHz and 1.4mm^2 with 1MB L2 on 7nm is ~100MT/mm^2)Using ~90MT/mm^2 for 7nm is reasonable since that is the reported density of recent 7nm chips (Kirin 990 5G is 91, 4G is 88 - the older 980 gets 93). Mobile SoCs already have a large amount of IO and analog logic and we are multiplying that amount by 3x.
This shows how stupid it is to use high performance libraries in server chips - they don't need to run at 5GHz!
milli - Monday, December 21, 2020 - link
We have different opinions but there's only one true fact: the die size is not disclosed. So anything anyone says is just a pure guess. You can't throw it around as fact.milli - Monday, December 21, 2020 - link
Navi 10/20 chips run at < 2Ghz and are 40MTr/mm. Just because Altra runs at 3.3Ghz, doesn't mean that it doesn't use HPL.Josh128 - Friday, December 18, 2020 - link
Exactly-- no way in hell this thing is just 350mm^2. The package is huge. Why would a 350mm^2 die need such a giant package?Wilco1 - Friday, December 18, 2020 - link
The package is only 16% larger than EPYC. Do you see any opportunity to reduce the huge number of pins? There are 8 memory channels plus full 128 PCIe lanes.mode_13h - Sunday, December 20, 2020 - link
Yes, the problem Altra Max will likely face is more memory bottlenecks. Also, I wonder if they'll have to dial clocks down, a little, to keep the power-efficiency numbers attractive.Wilco1 - Monday, December 21, 2020 - link
Altra Max drops max frequency to 3GHz, but it's not clear whether the TDP stays the same.Gondalf - Friday, December 18, 2020 - link
Are you sure :). Come onJosh128 - Friday, December 18, 2020 - link
Did you see the chip package? Its the size of an EPYC package. Im extremely doubtful its only 350mm^2.mode_13h - Sunday, December 20, 2020 - link
Look at where they show the bottom of the heatsink and it's small contact area. That shows the actual die is much smaller.Spunjji - Monday, December 21, 2020 - link
Doubt all you want - they have to put the pins for the interfaces somewhere, and that doesn't change much regardless of die size.Gondalf - Friday, December 18, 2020 - link
Obviousy it is a cpu of niche, not high volume like Intel or AMD. With a so large die we will not see many of these around. As usual only volume matter in Server worldSo no worries for X86.
eastcoast_pete - Friday, December 18, 2020 - link
Actually, those are a bigger threat to x86 than ARM chips like the M1 in Personal Computers. Server x86/x64 CPUs ist where AMD and Intel make a lot of their money. The key question for this and similar Neoverse chips is software support. If you can run your database or whatever natively on an ARM-native OS like Linux, these are tempting. Now, if MS would release Exchange Server in native for ARM, the threat would be even bigger.Gondalf - Friday, December 18, 2020 - link
Agreed about software, but i don't see problems for x86 dominance.Major sin of this design is die size, around 800mm2 looking photos in the article. On 7nm it means a very low cpu output; this issue will become even worse on 5nm.
So it is not a matter how good is a SKU but who have the real volume in server world. In past decades we have seen a lot of better cpus than x86 puppies, but in spite of this they all have lost their way.
The winner scheme is "volume". This is the only parameter that gives the dominance of a solution over another ones, expecially today with several and several millions/year of server SKUs absorbed by the market.
Altra is not born to beat x86, at least not in this crazy, old style, incarnation. They need to follow AMD (and shortly Intel) path instead of they will never be relevant.
Actual and upcoming advanced processes are not done for these massive things.
scineram - Saturday, December 19, 2020 - link
It's less than half that, you absolute retard moron.Wilco1 - Friday, December 18, 2020 - link
Apple's move to Arm does hit Intel's bottom line by many billions. A large percentage of AWS is already Graviton as more big customers are moving to it (latest is Twitter). Oracle is going to use Ampere Altra, and Microsoft is claimed to develop their own Arm servers.As Goldalf said, volume matters in the server world, and they are moving to Arm.
Spunjji - Monday, December 21, 2020 - link
I love Gondalf posts. Minimum-effort confirmation bias ramblings.eastcoast_pete - Friday, December 18, 2020 - link
That was my question also! Who fabs it, and what is their yield. This thing is quite big. Does anyone know if they overprovision cores so they can use those with small, very partial defects? At that size and those numbers of transistors, even a tiny probability of a defect can mean that the great majority of chips ends up in the circular bin (garbage).Wilco1 - Friday, December 18, 2020 - link
How exactly is it big? It's tiny for a server chip - 80 cores at about half the die size of a typical 28-core Xeon (~700mm^2). And TSMC 7nm yield is extremely good even for much larger chips like GPUs.Ithaqua - Friday, December 18, 2020 - link
Plus as with all chips, there may be a 64 / 48 / 32 core version which are just standard chips with the defective core block turned off.eastcoast_pete - Saturday, December 19, 2020 - link
Note I wrote "quite big", and by transistor count, it's a larger CPU, expected for a server chip. As for the Xeon, how high is Intel's yield for the 28 core Xeon, and that after how many years on 14 nm+++ (etc)? So, if you have a yield number for this 80 core Ampere chip, please share it.Wilco1 - Saturday, December 19, 2020 - link
It's larger than a mobile SoC, but small for a server chip thanks to Arm's tiny cores and the high density of TSMC 7nm. See https://www.anandtech.com/show/16028/better-yield-... for the defect rate, and from that a simple yield calculator gives 71% for a 350mm^2 die. That's before you fix SRAM defects or harvest dies for the lower-end SKUs. So we conclude yield is very good.eastcoast_pete - Saturday, December 19, 2020 - link
Glad to read that you've proven Andrei wrong, so maybe you should write these reviews. Here a direct quote from the first page of the review (also, take a look at the pictures: "The chip itself is absolutely humongous and amongst the current publicly available processors is the biggest in the industry, out-sizing AMD’s SP3 form-factor packaging, coming in at around 77 x 66.8mm – about the same length but considerably wider than AMD’s counterparts."Wilco1 - Saturday, December 19, 2020 - link
How ignorant can you be? Obviously the chip and silicon die have different sizes. The chip is large because it has many pins. The silicon die is a tiny fraction of the chip. We're discussing the size of the silicon die here, not the size of the chip. Completely different things.mode_13h - Sunday, December 20, 2020 - link
It'd be less confusing if you'd talk about the "package" dimensions.I think die and chip are traditionally synonymous. For instance, a package with multiple dies is traditionally called a MCM (Multi-Chip Module).
Wilco1 - Monday, December 21, 2020 - link
Look at Andrei's quote above, there isn't a well-defined term - people use chip/CPU/package etc as synonyms.mode_13h - Monday, December 21, 2020 - link
But it's not hard to see how "chip" can cause confusion. So, why not avoid it entirely, and just say either "die" or "package".Only a troll or someone with an agenda could be against clear communication.
Wilco1 - Monday, December 21, 2020 - link
It's hard to imagine how anyone sane can believe that a "chip measuring 77 x 66.8mm" (6 times the reticle limit!) is referring to the die size rather than the package. Andrei's quote even uses the word package. So maybe you're right and eastcoast_pete was just trolling.mode_13h - Monday, December 21, 2020 - link
I agree that people should do a sanity-check on their numbers.Spunjji - Monday, December 21, 2020 - link
"This thing is quite big."Package size is not die size.
If Nvidia can pump out dies more than twice the area on an inferior process and still get some perfect dies, I suspect they'll have no issues whatsoever with yield on TSMC 7nm at this stage - especially with the ability to sell lower-core-count variants.
Samus - Sunday, December 20, 2020 - link
Die harvested models with less cores sell for only 5-10% less. So I'm not sure if that means yields are really good, or really bad. Apparently they seem to be pushing the 80 core models pretty hard since so many are being offered.Then again, it depends what we define as yield quality? Defects seem to be low, but binning could be another issue as only two models seem to hit 3.3GHz and at incredibly high power budgets.
Spunjji - Monday, December 21, 2020 - link
3.3Ghz is about where that architecture tops out - I'm not sure that tells us much about yield. To me, the pricing seems to indicate that they aren't expecting to have to shift a ton of the lower-core-count die-harvested models.damianrobertjones - Friday, December 18, 2020 - link
Assuming that Intel just wants to milk customers forever, just like nVidia/phone oems do, they should quickly bridge the performance gap. They'll just have to stop being lazy and actually provide us with more than a drip fed speed increase.fishingbait - Friday, December 18, 2020 - link
An Apple guy I see. Remember that up until a couple of months ago Apple was charging you $1000 for a laptop with a dual core 1.1GHz chip.The "phone OEMs" finally have a core that can somewhat compete with Apple's Firestorm. It will take them a couple iterations to perfect it but they are on the right path. As for Intel, another story entirely. The latest word is that their 10nm process isn't going to well and they have hit yet another delay for 7nm. They may hit up Samsung's foundries just to get a product out (due to TSMC not having any capacity until December 21). So while their issues are far more significant than those for Android phones, it isn't due to their lazily milking customers. They have real tech issues to deal with, issues of the sort that Apple and AMD don't have to worry about because they lack the capability and expertise required to make their own chips.
mode_13h - Sunday, December 20, 2020 - link
> because they lack the hubris to think they should try to make their own chips.Fixed that for you.
Spunjji - Monday, December 21, 2020 - link
"So while their issues are far more significant than those for Android phones, it isn't due to their lazily milking customers."Correct, their technical issues are entirely separate from their strategy of lazily milking customers.
Ridlo - Friday, December 18, 2020 - link
While no Blender test is indeed a bummer, did you guys tried testing with other ray tracing application (LuxMark, C-Ray, Povray, etc.)?Andrei Frumusanu - Friday, December 18, 2020 - link
I didn't have a standalone test, but Povray is part of SPEC.mode_13h - Thursday, December 31, 2020 - link
Isn't Blender included in SPECfp2017 as 526.blender_r? Or is that something different?Teckk - Friday, December 18, 2020 - link
Whoever decided on naming these products — fantastic job. Simple, clear and effective.Maybe you can offer some free advice to Intel and Sony.
Calin - Friday, December 18, 2020 - link
The answer to the question of "how powerful it is" is clear - more than good enough.The real question in fact is:
"How much can they produce?"
AMD has the crown in x86 processor performance, but this doesn't really matter very much as long as they can build enough processors only for a part of the market.
jwittich - Friday, December 18, 2020 - link
How many do you need? :)Bigos - Friday, December 18, 2020 - link
64kB pages might significantly enhance performance on workload with large memory sets, as the TLB will be up to 16x less used. On the other hand, memory usage of the Linux file system cache will also increase a lot.Would you be able to test the effect of 64kB vs 4kB page size on at least some workloads?
Andrei Frumusanu - Friday, December 18, 2020 - link
It's something that I wanted to test but it requires a OS reinstall / kernel recompile - I didn't want to get into that rabbit hole of a time sink as already spent a lot of time verifying a lot of data across the three platforms over a few weeks already.arnd - Friday, December 18, 2020 - link
I'd love to see that as well. For workloads that use transparent huge pages, there should not be much difference since both would use 2MB huge pages (512*4KB or 32*64KB), plus one or more even larger page sizes, but it needs to be measured to beThe downsides of 64KB requiring larger disk I/O and more RAM are often harder to quantify, as most benchmarks try to avoid the interesting cases.
I've tried benchmarking kernel compiles on Graviton2 both ways and found 64kB pages to be a few percent faster when there is enough RAM, but forcing the system to swap by limiting the total RAM made the 64kB kernel easily 10x to 1000x slower than the 4kB one, depending on the how the available memory lined up with the working set.
abufrejoval - Friday, December 18, 2020 - link
Thank you for the incredible amount of information and the work you put into this: Anandtech's best!Yet I wonder who would deploy this and where. The purchasing price of the CPU would seem to become a rather miniscule part of the total system cost, especially once you go into big RAM territory. And I wonder if it's not similar with the energy budget: I see my larger systems requiring more $ and Watts on RAM than on the CPUs. Are they doing, can they do anything there to reduce DRAM energy consumption vs. Intel/AMD?
The cost of the ecosystem change to ARM may be less relevant once you have the scale to pay for it, but where exactly would those scale benefits come from? And what scales are we talking about? Would you need 100k or 1m servers to break even?
And what sort of system load would you have to reach/maintain to have significant energy advantages vs. x86 iron?
Do they support special tricks like powering down quadrants and RAM banks for load management, do they enable quick standby/actvation modes so that servers can be take off and on for load management?
And how long would the benefits last? AMD has demonstrated rather well, that the ability to execute over at least three generations of hardware are required to shift attention even from the big guys and they have still all the scaling benefits the x86 installed base provides.
These guys are on a 2nd generation product, promise 3rd but essentially this would seem to have the same level of confidence as the 1st EPIC.
askar - Friday, December 18, 2020 - link
Would you mind testing ML performance, i.e. python's SKLearn library classes that can be multithreaded (random forest for example)?mode_13h - Sunday, December 20, 2020 - link
MLPerf?SarahKerrigan - Friday, December 18, 2020 - link
Looks like the LLVM compile time graph has "Quadrant" twice when one of the graph labels should be "Monolithic."Andrei Frumusanu - Friday, December 18, 2020 - link
No, that's correct. Both 1S and 2S figures were run in quadrant modes.SarahKerrigan - Friday, December 18, 2020 - link
You're absolutely right - I'm apparently illiterate and ignored the socket count.cbm80 - Friday, December 18, 2020 - link
Says "KOREA"...so made by Samsung?jeremyshaw - Friday, December 18, 2020 - link
probably just packaging.danbob999 - Friday, December 18, 2020 - link
Does the price of the CPU matters at this point? It's not as if you are going to build your own system isn't it? Isn't buying the whole computer the only option? Do they sell motherboards as well?Spunjji - Monday, December 21, 2020 - link
People like to talk about how RAM costs rapidly sink the cost of the CPU *as a proportion of total cost*, which is true, but saving $2000 per socket still adds up when you're buying a fair few of them. Saving hundreds of thousands of dollars on a datacentre build-out (and with no significant performance downsides) is good financial sense, even if you're still spending millions overall.kepstin - Friday, December 18, 2020 - link
I really want to see what would happen if an M1-based design got scaled down to something suitable for a workstation.I think something like that is needed in order to solve problems like what you were running into with Blender - until we have decent high performance aarch64 workstations on developer's desks, x86-64 is gonna be ahead on software support.
jeremyshaw - Friday, December 18, 2020 - link
Yeah, I guess we all have to just wait for Apple.*technically* Nvidia did and does sell Jetson boards in that range (they even delivered the first consumer accessible PCIe 4.0 root, in the AGX Xavier), but Nvidia's insistence on locking everything down to an ancient kernel basically kills any hope of it being used outside of specific fields. Even something as simple as their teams keeping up with LTS releases would be somewhat viable, but nope. Even Raspberry Pi foundation is far ahead of them, now.
I think they are still trying to upgrade their bootloader to allow booting from NVMe cards? At least their AGX Xavier is finally picking up beta UEFI support (which should help somewhat).
I guess they would rather Apple eat their lunch. Oh, well.
mode_13h - Sunday, December 20, 2020 - link
Nvidia is selling those for robotics and stuff. Apple is not trying to eat that lunch.mode_13h - Sunday, December 20, 2020 - link
There's a youtube video of some guy running Blender on a Pi v4.Der Keyser - Friday, December 18, 2020 - link
This in interesting times indeed... The question is what having Aarch64 available on very nice desktops/labtops from Apple will do in terms of incentivising devs to work with ARM? Knowing that servers with compute power/socket that equals or outperforms x86 @ lower prices must appeal to a lot of java and cloudvendor workloads.Oxford Guy - Friday, December 18, 2020 - link
Since Ampere should be able to run Crysis very well I expected to see that in the benchmarks.mode_13h - Sunday, December 20, 2020 - link
Try it under x86 emulation on the M1 and let us know how it goes.akmittal - Friday, December 18, 2020 - link
Year of ARM on servermode_13h - Sunday, December 20, 2020 - link
Starting with Graviton2 and ending with this, it's certainly the year they arrived on the scene.Oxford Guy - Friday, December 18, 2020 - link
'Where Ampere and the Altra definitely is beating AMD in is TCO, or total cost of ownership. Taking the flagship models as comparison points – the Q80-33 costs only $4050 which generally matching the performance of AMD’s EPYC 7742 which still comes in at $6950, essentially 42% cheaper.'Does per-core licensing cut into that advantage at all?
Spunjji - Monday, December 21, 2020 - link
Where it comes into play, for surename99 - Friday, December 18, 2020 - link
"it’s also a piece of hardware that the general public cannot access outside of Amazon’s own cloud services"This is no longer *technically* true:
https://aws.amazon.com/blogs/aws/reinvent-2020-pre...
Soon you should be able to buy an AWS Outpost with Graviton2 inside which kinda sorta straddles the line between "owning" and "accessed via Amazon cloud services".
watersb - Friday, December 18, 2020 - link
Wow. There's a lot to learn here.Very excited to receive me Apple M1 MacBook Pro. I hope it gives me some perspective on how performance can be applied to scientific local workstation computing.
Silver5urfer - Friday, December 18, 2020 - link
25% more cores for Zen2 7742 class. If paired with multi socket and then Milan drop in this is not going to be any major breakthrough."The Arm server dream is no longer a dream, it’s here today, and it’s real." lol so until today all the articles on the ARM are not real I guess.
Anyways I will wait for market penetration of this with server share and then see how great ARM is and how bad x86 is going to be as from AT's narrative recently.
Spunjji - Monday, December 21, 2020 - link
Are you this mopey every time there's a paradigm-shift in the tech industry? Feel free to keep looking for metrics that "prove" you right, but eventually it's going to be a very hard search.eastcoast_pete - Friday, December 18, 2020 - link
Thanks Andrei! Maybe I am barking up the wrong tree here, but I find the "baby" server chip in that lineup particularly interesting. Nowhere near as fast as this, of course, but for $ 800, it might make for a nice CPU for a basic server setup; nothing fancy, but low TdP, and would probably get the job done. The question here is how expensive the MB for those would be.Lastly, if Ampere sends you one of those $ 800 ones, could/would you test it?
Wilco1 - Friday, December 18, 2020 - link
They will likely sell desktops using these just like the previous generation, but they are not cheap as it is high-end server gear using expensive ECC memory (and lots of it since there are 8 channels). If you don't need the fastest then there is eg. NVIDIA Xavier or LX2160A (16x A72) boards for around $500.Spunjji - Monday, December 21, 2020 - link
I think those are probably most useful for workloads that are pathologically memory and/or I/O limited - 4TB per socket, save ~$3000 over the faster CPU, benefit from power savings over the life of the server.twtech - Friday, December 18, 2020 - link
Ironically, AMD's opportunity to win might turn into an ultimate loss - Intel's manufacturing advantage kept x86 relevant, and with access to the x86 instruction set limited by ownership of the IP, AMD lived alongside Intel in that walled garden.With the manufacturing advantage gone however, Apple has left the garden, and maybe other personal computers won't be far behind - software compatibility I think is actually less of an issue in the era of SaaS and continuous updates. Ie. you were going to have to download new versions of the software you use as time went on anyway.
FunBunny2 - Friday, December 18, 2020 - link
"you were going to have to download new versions of the software you use as time went on anyway."Solar Wind? :)
lorribot - Friday, December 18, 2020 - link
This is all great but when all licencing is per core it limits the usage scenarios or benefits of these developments as they can really only be used with open source type licences.For the rest of us on Windows, Oracle, Java, Apple, IBM, etc licencing it doesn't bring anything to the table.
The_Assimilator - Friday, December 18, 2020 - link
Just in time to be obsoleted by Milan.Spunjji - Monday, December 21, 2020 - link
For a given definition of "obsoleted", where it means "still more than competitive in performance per dollar at a lower price of entry".Brane2 - Saturday, December 19, 2020 - link
Meh. Nothing special. it has been benchmarked on Phoronix and it performed more or less on par with Rome. 80 newest ARM cores against 64 mature x86 cores within constrained power envelope.Naples is just about to come out and I suspect some time after that AMD will have something like really wide new RISC-V cores.
Wilco1 - Saturday, December 19, 2020 - link
It won most benchmarks on Phoronix while using significantly less power. Yes Milan is about to be released, and it will have to compete with the 128-core Altra Max. Which do you believe is going to win - 64 SMT cores or 128 real cores?mode_13h - Sunday, December 20, 2020 - link
It actually won less than half of the benchmarks on phoronix, since a number of those graphs just re-state the results in score/W. There are also questions over some of the compiler options used on those benchmarks, since many of the tests are compiled with options that won't enable AVX on benchmarks where it should be beneficial (yet, not having SVE, the N1 cores are at no such disadvantage).Wilco1 - Monday, December 21, 2020 - link
"should be beneficial" -> "might help in a few limited cases". AVX/AVX512 isn't that useful for general C/C++ code. You typically only see large gains when people optimize using intrinsics.mode_13h - Monday, December 21, 2020 - link
Intrinsics don't compile if they're for a CPU arch beyond what the compiler is being instructed to target. So, even packages where people take the time to optimize with intrinsics need to guard them with compile-time checks to ensure the CPU target is capable of executing those instructions.Compilers do generate vectorized code. I don't know how well GCC is doing on that front, lately, but the TNN tests should be a good way to see that. Too bad those tests don't use -march=native.
What's interesting about TNN is I'm looking at the exact source revision Phoronix is using, and it seems they've completely dropped their backend for x86. The source/tnn/device/x86/ is simply missing. So, I wonder if they decided the compiler was good enough that they didn't need to bother with their own hand-optimized code for it, or if they just decided they don't care how fast their stuff runs on it.
See:
* https://openbenchmarking.org/innhold/83a730ed41d4e...
* https://github.com/Tencent/TNN/tree/v0.2.3
Wilco1 - Monday, December 21, 2020 - link
TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU. The makefile selects the right ISA variant for any files using intrinsics. But none of this is used in the TNN test.
mode_13h - Monday, December 21, 2020 - link
> TNN does not benefit from -march=native. Phoronix uses the generic C++ version which doesn't benefit from vectorization. Try it yourself.At this point, I probably will.
> Optimized versions using intrinsics typically use runtime checks so you automatically get the fastest version that works on your CPU.
That's a whole additional level of effort for the developers. For them to bother compiling and conditionally calling different versions only makes sense if they think their main userbase aren't going to bother recompiling specifically for their hardware. In the case of specialized packages, it's reasonable to expect your users to take a little trouble for the best performance. It's really things like very low-level libs or multimedia code where you tend to see the sort of elaborate runtime detection and dynamic codepath selection that you're describing.
mode_13h - Monday, December 21, 2020 - link
I think Basis Universal and High Performance Conjugate Gradient are some other cases where the wider SIMD of Zen2 and Skylake-SP should confer significant benefit.Wilco1 - Monday, December 21, 2020 - link
"should give significant benefit" -> "might give some benefit". I suggest you try out. Autovectorization is not nearly as good as you seem to believe, and the overall speedup is often disappointing even if some loops are 10-20x faster.vinayshivakumar - Saturday, December 19, 2020 - link
I am a bit puzzled why none of these processors support SMT... Can someone shed light on why this is the case ?Wilco1 - Saturday, December 19, 2020 - link
SMT gives very little benefit (only 15% faster on SPECINT and 3.5% *slower* on SPECFP), adds a lot of area and complexity, and results in very bad per-thread performance.It's always better to use 2 real cores instead of 1 SMT core. So if you have small cores like Neoverse N1, adding SMT makes no sense at all.
mode_13h - Sunday, December 20, 2020 - link
Whether SMT makes sense depends on the workload. Many tasks have greater branch-density and less computational density, in which case SMT is a massive win. I'm compiling code all the time and see huge speedups from SMT.That said, of course I'll take two real cores instead of 2-way SMT, all else being equal, but that always costs more. If SMT really made as little sense as you say, then it wouldn't be nearly so widespread.
Wilco1 - Monday, December 21, 2020 - link
There are certainly cases where SMT helps, but having some wins doesn't mean it is worth adding SMT. All too often people talk up the upsides and ignore the downsides. Let's see how Altra Max does vs Milan next year, that should answer which is best.Note almost none of the billions of CPUs sold every year have SMT (even if we exclude embedded). Adding another core is simpler, cheaper and gives more performance.
mode_13h - Monday, December 21, 2020 - link
> Let's see how Altra Max does vs Milan next year, that should answer which is best.I disagree. That's a bit apples and oranges. The differences between Zen2/Skylake and N1 are too big. You need to look at the overhead of SMT for a particular core vs. the benefits for that core.
These x86 cores are large not just because of SMT, but also the x86 tax, their wide vector units, and other things. It could be that SMT adds just 5% overhead, and that's not enough to increase your core count hardly at all, if you drop it.
Wilco1 - Monday, December 21, 2020 - link
It's never going to be a perfect comparison - nobody will design SMT and non-SMT variants of the same core! There are many differences because they use different design principles. However it will clearly show which of these designs works out best for top-end server performance.Spunjji - Monday, December 21, 2020 - link
SMT is a clear win where you already have large cores with a lot of execution resources, whereby the extra resources required aren't a large proportion of overall die area. It also helps if your tasks are focused on overall performance for a given number of threads, rather than performance-per-thread.Where the cores are this small, though, simply adding more of them seems to be the better option.
mode_13h - Monday, December 21, 2020 - link
Well said.mode_13h - Sunday, December 20, 2020 - link
Plus, per-thread performance is only an issue if that's how you're paying for CPU time, and then what you actually care about is thread performance per dollar, which would compensate for any cost differences due to SMT, as well.Wilco1 - Monday, December 21, 2020 - link
And Graviton 2 shows which is cheaper and faster.mode_13h - Monday, December 21, 2020 - link
That comparison is only valid for Amazon customers and in the short term. It can't be used to support a broader conclusion about SMT, because we lack transparency into the cost structure of Amazon's hosting, like whether they're subsidizing Graviton2 servers or even just charging enough for them to simply break even on the hardware.Wilco1 - Monday, December 21, 2020 - link
Why would they introduce Graviton if it would run at a loss??? A significant percentage of AWS is already Graviton (probably 20% by now). If anything Graviton increases profitability due to vertical integration and other cost reduction.mode_13h - Monday, December 21, 2020 - link
First, there's a fundamental disparity between an in-house CPU and a 3rd Party one, where Amazon can cut out some overheads by building their own. So, that already skews the price-comparison.The other question is whether Amazon is partially-subsidizing the price of their Graviton2 instances as an incentive to get more people to switch. For a business, the least risky thing is to stay on x86, so Amazon needs to present an immediate and significant cost savings to get people to switch. After they've switched and ARM server cores have had more time to mature, Amazon can charge more and make back a good return on investment.
I obviously don't know if that's what they're doing, but we don't know that it's not. So, you really can't read much into their current pricing. That's all I'm saying.
mode_13h - Sunday, December 20, 2020 - link
Finally, I guess you missed this part, in the discussion of SPECjbb:> One thing that did come to mind immediately when I saw the results was SMT.
> Due to this being a transactional data-plane resident type of workload,
> SMT will undoubtedly help a lot in terms of performance,
> so I tested out the EPYC chip figures with SMT disabled,
> and indeed max-jOPS went down to 209.5k for the 2S THP enabled results,
> meaning that SMT accounts for a 29.7% performance benefit in this benchmark.
...
> It’s generally these kinds of workloads that SMT works best on,
> and that’s why IBM can deploy SMT4 or SMT8 processors,
> and the type of workloads Marvell’s ThunderX was trying to carve a niche or itself with SMT4.
mode_13h - Sunday, December 20, 2020 - link
As the article mentions, Marvell’s ThunderX did support SMT on ARMv8-A.Were SMT's reputation not bruised by all the recent side-channel exploits, perhaps it would be showing up in some of ARM's own cores. Maybe their V-series will get it, since that's a much larger core.
Wilco1 - Monday, December 21, 2020 - link
ThunderX2/X3 and Neoverse E1 have SMT, but neither has been hugely successful. SMT doesn't provide a significant benefit across a wide range of workloads, so adding another core remains simpler and cheaper. And yes, security is another nail in the coffin.EthiaW - Saturday, December 19, 2020 - link
The performance of Graviton2 meets our expectation for Neoverse N1 (or Cortex A76) better. How can Q80 manage to deliver so much higher IPC with the same architecture? Incredible.Brutalizer - Saturday, December 19, 2020 - link
One old Oracle SPARC T8 cpu does 153.500 Java max-JOPS SPECjbb2015. And the crit-JOPS value is 90.000. Easily smashing all cpus here.https://blogs.oracle.com/bestperf/specjbb2015:-spa...
satai - Saturday, December 19, 2020 - link
Benchmarked by Oracle... Definitely trustworthy.zepi - Saturday, December 19, 2020 - link
SPECJBB graphs kill me.For the love of god, please keep the axis scaling identical!
Same applies to every single metric always. If you provide separate graphs for different products, please make sure that axis-scaling is the same in all images!
Andrei Frumusanu - Sunday, December 20, 2020 - link
The graphs are generated by the benchmark itself.tygrus - Saturday, December 19, 2020 - link
Next step would be to see how ARM performed with 256MB (or more) of cache. The early models didn't suit many workloads compared to the general purpose kings (x86-64). Each generation of ARM based server chips have seen an increase to the number of suitable workloads thus more general purpose. Adding more specialised instructions to x86 has diminishing returns that increase the complexity of decoding & execution it's always good to be challenged "can we make it simpler & faster?".mode_13h - Sunday, December 20, 2020 - link
ARM has been doing some of the same, though. Look at the evolution of ARMv8-A as it has aged, and you'll see several bolt-ons to target additional markets:* new atomics
* signed, saturating multiplies
* CRC instructions
* half-precision floating point
* SVE (their answer to AVX/AVX-512)
* complex number support
* an instruction specifically for floating-point conversion in Javascript
* integer dot products
* random number generation
* matrix multiply & manipulation
* BFloat16 support
* a smattering of other virtualization and security-oriented additions
That's not a small list, and definitely not a less-is-more approach.
Leeea - Sunday, December 20, 2020 - link
Very interesting.Great article.
Sivar - Sunday, December 20, 2020 - link
Typo report, conclusion: "The se*ver landscape is changing very quickly. "Sivar - Sunday, December 20, 2020 - link
Imagine the response if someone made a prediction of this statement just five years ago:"Intel’s current Xeon offering simply isn’t competitive in any way or form at this moment in time. Cascade Lake is twice as slow and half as efficient – so unless Intel is giving away the chips at a fraction of a price, they really make no sense."
mode_13h - Sunday, December 20, 2020 - link
Heh, good call!Makste - Monday, December 21, 2020 - link
Truly exciting times. Thanks for the review.There's going to be a massive restructuring in "computeverse".
I hope they'll be a merger at one point.
hyc - Monday, December 21, 2020 - link
It would've been more appropriate to compare Q64-33 to AMD to assess the merits of each architectural design. Could you repeat some of these tests, limiting the Altra to only 64 cores/threads?Wilco1 - Tuesday, December 22, 2020 - link
It might be an interesting comparison with SMT disabled since the extra 25% of cores are the alternative to adding SMT. However should we also limit EPYC to 32MB L3 to make things more equal?hyc - Friday, December 25, 2020 - link
Are you saying that turning off 16 cores in the Q80-33 is not the equivalent of running a Q64-33?Turning off SMT may have some merits, depending on workload.
Andrei Frumusanu - Sunday, January 3, 2021 - link
I actually did test those setups:2S Q64-33 setup: 433.7 SPECint2017 - 341.7 SPECfp2017
Pyxar - Wednesday, December 23, 2020 - link
This is cool and all but what could an arm processor server be used for? Aside from some linux compilations there isn't that much support for arm processors.This is looking a lot like the early days of computing where BeBox was powered by PowerPC cpu, a system looking to find a niche that doesn't exist (yet). It was cool for it's time but very little demand for it.
mode_13h - Wednesday, December 23, 2020 - link
Pretty much everything needed to run modern web services has been ported to ARM. I believe a number of high-profile sites are now using Amazon Graviton2, due to its pricing advantage.As for client computing, the Pi has emerged as a basic, usable desktop platform.
mode_13h - Thursday, December 24, 2020 - link
Also, a couple years ago, Nvidia announced support for their entire software stack on ARM host CPUs. So, I'd say it should also be a decent platform for machine learning, now.Bytales - Wednesday, December 23, 2020 - link
You should have tested some CPU mining bechmarks like Monero's randomX, Wownero's randomwow, Turtle's argon2id Chukwa v2, and DERO's AstroBWT. That would have been interesting as shieeet.Miguelxataka - Sunday, January 3, 2021 - link
7742 (15 months ago release) 225w 67.000 point passmark??? why?? 7702 200w 71.000 points passmark. With 7702 the comparision!!7702 (200w) Zen 2 vs Ampere (250w). 20% less power consumption + with Zen 3 20% extra power consumption =36%?
Where are the power efficiency of the RISC/ARM processors here?
Rance - Friday, July 16, 2021 - link
For the specjbb 2015 data, what was the baseline pagesize set to?4KB/64KB/?
Thanks!
sunwins - Tuesday, October 12, 2021 - link
I guess the speccpu2017 test result is wrong.Maybe there were some problems such as bios version or speccpu software configuration.Reference link:
https://www.spec.org/cpu2017/results/rfp2017.html
https://www.spec.org/cpu2017/results/rint2017.html
AMD 7742 speccpu2017 intrate official value:353 per socket
AMD 7742 speccpu2017 intfp official value:270 per socket
INTEL 8280 speccpu2017 intrate official value:172 per socket
INTEL 8280 speccpu2017 intfp official value:141 per socket
Ampere Q80-33 offcial speccpu2017 intrate value:300 per socket
if you hava any question or updation,please touch 799517515@qq.com。