The Apple A13 SoC: Lightning & Thunder

Apple’s A13 SoC is the newest iteration in the company’s silicon design efforts. The new silicon piece is manufactured on what Apple calls a “second generation 7nm manufacturing process”. The wording is a bit ambiguous, however, as it’s been repeatedly pointed out that this would mean TSMC’s N7P node, which is a performance tuned variant of last year’s N7 node, and not the N7+ node which is based on EUV production.

Update October 27th: TechInsights has now officially released a die shot of the new Apple A13, and we can confirm a few assumptions on our side.

The new die is 98.48mm² which is 18.3% larger than the A12 of last year. Given that this year’s manufacturing node hasn’t seen any major changes in terms of process density, it’s natural for the die size to increase a bit as Apple adds more functionality to the SoC.


AnandTech modified TechInsights Apple A13 Die Shot

Die Block Comparison (mm²)
SoC

Process Node
Apple A13

TSMC N7P
Apple A12

TSMC N7
Total Die 98.48 83.27
Big Core 2.61 2.07
Small Core 0.58 0.43
CPU Complex
(Cores & L2)
13.47
(9.06 + 4.41)
11.16
(8.06 + 3.10)
GPU Total 15.28 14.88
GPU Core 3.25 3.23
NPU 4.64 5.79
SLC Slice
(SRAM+Tag Logic)
2.09 1.23
SLC SRAM
(All 4 Slices)
6.36 3.20

When breaking down the block sizes of the different IP on the SoC, we can see some notable changes: The big new Lightning cores have increased in size by ~26% compared to last year, a large increase as we expect the new cores to have new functional units. The small Thunder cores have also increase in size by a massive 34% compared to last year’s Tempest cores, pointing out to the large microarchitecture changes we'll discuss in a bit.

The L2 on the big cores looks relatively similar to that of the A12, pointing out to a maintained 8MB size. What’s interesting is that the L2 of the small cores has now seen significantly changes, and the two slices that this cluster now embeds look quite identical to the slices of the large core's L2. It’s thus very likely that we’re looking at an increased 4MB of total L2 for the small Thunder cores.

The GPU footprint has slightly increased by a more marginal 3.8% - the biggest change seems to have been a rearrangement of the ALU blocks and texture unit layout of the GPU back-end, as the front-end blocks of the new IP looks largely similar to that of the A12.

The NPU has seen a large reduction in size and is now 20% smaller than that of the A12. As the A12’s NPU was Apple’s first in-house IP it seems natural for the company to quickly iterate and optimise on the second-generation design. It’s still a notably large block coming in at 4.64mm².

By far the biggest change on the SoC level has been the new system level cache (SLC). Already last year Apple had made huge changes to this block as it had adopted a new microarchitecture and increased the size from 4MB to 8MB. This year, Apple is doubling down on the SLC and it’s very evidently using a new 16MB configuration across the four slices. A single SLC slice without the central arbitration block increases by 69% - and the actual SRAM macros seen on the die shot essentially double from a total of 3.20mm² to 6.36mm².

The amount of SRAM that Apple puts on the A13 is staggering, especially on the CPU side: We’re seeing 8MB on the big cores, 4MB on the small cores, and 16MB on the SLC which can serve all IP blocks on the chip.

CPU Frequencies

The CPU complex remains a 2+4 architecture, supporting two large performance cores and four smaller efficiency cores. In terms of the frequencies of the various cores, we can unveil the following behavior changes to the A13:

Maximum Frequency vs Loaded Threads
Per-Core Maximum MHz
Apple A12 1 2 3 4 5 6
Performance 1 2514 2380 2380 2380 2380 2380
Performance 2   2380 2380 2380 2380 2380
Efficiency 1     1587 1562 1562 1538
Efficiency 2       1562 1562 1538
Efficiency 3         1562 1538
Efficiency 4           1538
Apple A13 1 2 3 4 5 6
Performance 1 2666 2590 2590 2590 2590 2590
Performance 2   2590 2590 2590 2590 2590
Efficiency 1     1728 1728 1728 1728
Efficiency 2       1728 1728 1728
Efficiency 3         1728 1728
Efficiency 4           1728

The large performance cores this year see a roughly 6% increase in clockspeeds, bringing them up to around 2666MHz. Last year we estimated the A12 large cores to clock in at around 2500MHz, but the more exact figure as measured by performance counters seems to be 2514MHz. Similarly, the A13’s big core clock should be a few MHz above our estimated 2666MHz clock. Apple continues to quickly ramp down in frequency depending on how many large cores are active, and as such will max out at 2590MHz even on the lightest threads. I also noted that frequency will quickly ramp up and down depending on instruction mix and the load complexity on the core.

The small efficiency cores have seen a larger 8.8 – 12.3% clock boost, bringing them to up to ~1728MHz. This is a good boost, but what’s also important is that the small cores now don’t clock down when there’s more of them active.

The Lightning Performance CPU Cores: Minor Upgrades, Mystery of AMX

The large cores for this generation are called “Lightning” and are direct successors to last year’s Vortex microarchitecture. In terms of the core design, at least in regards to the usual execution units, we don’t see too much divergence from last year’s core. The microarchitecture at its heart is still a 7-wide decode front-end, paired with a very wide execution back-end that features 6 ALUs and three FP/vector pipelines.

Apple hasn’t made any substantial changes to the execution back-end, as both Lightning and Vortex are largely similar to each other. The notable exception to this is the complex integer pipelines, where we do see improvements. Here the two multiplier units are able to shave off one cycle of latency, dropping from 4 cycles to 3. Integer division has also seen a large upgrade as the throughput has now been doubled and latency/minimum number of cycles has been reduced from 8 to 7 cycles.

Another change in the integer units has been a 50% increase in the number of ALU units which can set condition flags; now 3 of the ALUs can do this, which is up from 2 in A12's Vortex.

As for the floating point and vector/SIMD pipelines, we haven't noticed any changes there.

In terms of caches, Apple seems to have kept the cache structures as they were in the Vortex cores of the A12. This means we have 8-way associative 128KB L1 instruction and data caches. The data cache remains very fast with a 3-cycle load-to-use latency. The shared L2 cache between the cores continues to be 8MB in size, however Apple has reduced the latency from 16 to 14 cycles, something we’ll be looking at in more detail on the next page when looking at the memory subsystem changes.

A big change to the CPU cores which we don’t have very much information on is Apple’s integration of “machine learning accelerators” into the microarchitecture. At heart these seem to be matrix-multiply units with DSP-like instructions, and Apple puts their performance at up to 1 Tera Operations (TOPs) of throughput, claiming an up-to 6x increase over the regular vector pipelines. This AMX instruction set is seemingly a superset of the ARM ISA that is running on the CPU cores.

There’s been a lot of confusion as to what this means, as until now it hadn’t been widely known that Arm architecture licensees were allowed to extend their ISA with custom instructions. We weren’t able to get any confirmation from either Apple or Arm on the matter, but one thing that is clear is that Apple isn’t publicly exposing these new instructions to developers, and they’re not included in Apple’s public compilers. We do know, however, that Apple internally does have compilers available for it, and libraries such as the Acclerate.framework seem to be able to take advantage of AMX. Unfortunately, I haven't had the time or experience to investigate this further for this article.

Arm’s recent reveal of making custom instructions available for vendors to implement and integrate into Arm’s cores certainly seems evidence enough that architecture licensees would be free to do what they’d like – Apple’s choice of hiding away AMX instructions at least resolves the concern about possible ISA fragmentation on the software side.


Apple's iPhone 11 Pro Max Motherboard with the A13 SoC (Image Courtesy iFixit)

The Thunder Efficiency CPU Cores: Major Upgrades

Apple’s small efficiency cores are extremely interesting because they’re not all that small when compared to the typical little cores from Arm, such as the Cortex-A55. Last year’s Tempest efficiency cores in the A12 were based on a 3-wide out-of-order microarchitecture with two main execution pipelines, working alongside L/S units and what we assume is a dedicated division unit.

This year’s Thunder microarchitecture seems to have made major changes to the efficiency CPU core, as we’re seeing substantial upgrades in the execution capabilities of the new cores. In terms of the integer ALUs we’re seemingly still looking at two units here, however Apple has doubled the number of units capable of flag set operations from 1 to 2. MUL throughput remains at 1 instruction per cycle, while the division unit is also seemingly unchanged.

What’s actually more impressive is that the floating point and vector pipelines were essentially doubled: FP addition throughput has gone from 1 to 2, while the latency has been reduced from 4 to 3. This is mirrored by vector addition capabilities, with a TP of 2 and a latency of 2. This doubling of throughput is extended throughout almost all instructions executed in the FP/SIMD pipelines, with the exceptions being some operations such as multiplications and division.

The FP division unit has seen a massive overhaul, as it’s seemingly now a totally new unit that’s now optimized for 64-bit operations, no longer halving its throughput when operating on double-precision numbers. DP latencies have been reduced from 19 to 10 cycles, while SP latency has gone down from 12 to 9 cycles. Vector DP division operations have even seen silly improvements such as 4x increase in throughput and 1/3rd the latency.

The Thunder cores are now served by a 48KB L1 data cache, which is an increase over the 32KB we’ve seen in previous generations of Apple’s efficiency cores. We haven’t been able to confirm the L1 instruction cache. There also seems to have been changes to the L2 cache of the efficiency cores, which we'll discuss on the following page.

Looking at the performance of the new A13 Thunder cores, we’re seeing that the new microarchitecture has increased its IPC significantly, with gains ranging from 19% in 403.gcc to 38% in 400.perlbench in SPECint, while floating point performance has also improved by an equally impressive 34-38% in non-memory bound SPECfp workloads.

In other areas we're seeing some performance regressions, and this is because Apple has changed the DVFS policies of the memory subsystem, leading to the efficiency cores being unable to trigger some of the memory controller's higher frequency performance states. This results in some of the odd results we are seeing, such as 470.lbm.

This causes a bit of an issue for our dedicated measurements of the cores in isolation: given a more realistic workload such as a 3D game where the GPU would have the memory run at faster speed, the performance of the Thunder cores should be higher than what we see showcased here. I’ll attempt to measure the peak performance of the cores when they’re not limited by memory in a future update as I think it should be very interesting.

The power efficiency of the new cores is also significantly better. Granted, some of these improvements will be due to the system memory not running as fast, but given that the cores still deliver 10-23% higher average performance in the SPEC suites, it’s still massively impressive that energy consumption has gone down by 25% on average as well – pointing to major efficiency gains.

In the face of the relatively conservative changes of the Lightning cores (other than AMX), the new Thunder cores seem like an outright massive change for the A13 and a major divergence from Apple’s past efficiency core microarchitectures. In the face-off against a Cortex-A55 implementation such as on the Snapdragon 855, the new Thunder cores represent a 2.5-3x performance lead while at the same time using less than half the energy.

Introduction & Design The A13's Memory Subsystem: Faster L2, More SLC BW
Comments Locked

242 Comments

View All Comments

  • BradleyTwo - Monday, October 21, 2019 - link

    My apologies if this has been disclosed already, but would it be possible to ask if Apple supplied these phones for testing?

    The reason I ask is that there is quite a long thread over at a popular Mac rumors forum where a number of us are concerned at the variable screen quality on the iPhone 11 Pro and Pro Max.

    Many of us, myself included, have received a suboptimal screen, in that it was a dim, murky yellow display (the less polite of us have called them p-stained), while others have received screens which are not uniformly lit.

    We have generally exchanged them to receive marginally better units, a few of which have been perfect, but a disappointing majority of the exchanges are often still below the apparently impressive characteristics of the displays discussed in the review.

    As this is not mentioned in the various iPhone 11 Pro reviews, a number of us have formed suspicions that Apple has cherry picked the best screens to supply to reviewers.

    A clarification whether Apple did indeed supply the units, or if they were bought off the shelf, would be much appreciated.
  • techsorz - Monday, October 21, 2019 - link

    Apple calibrates each device, this is what XDR essentially is. However this will create better uniformity across displays than make them as different as you say.

    Dim, murky yellow is probably caused by you not disabling true-tone and auto-brightness. Otherwhise you have a very faulty unit, as this display should be bright enough to nearly burn out your retina. (Exaggeration)

    Not uniformly lit could be an error, just return it in this case. Clearly Apple wouldn't supply faulty hardware to anyone on purpose, not testers or consumers.
  • Andrei Frumusanu - Monday, October 21, 2019 - link

    These are Apple review samples, but in our experience and testing they don't differ from commercial models.
  • BradleyTwo - Monday, October 21, 2019 - link

    Thank you for the clarification. It would of course be negligent for Apple PR not to ensure reviewers receive fully tested units.

    I can assure you, however, that when it comes to the screen, the number of less than optimal units being sold at retail is probably higher than you might think. While these are most likely all within manufacturing tolerances for QC purposes, some of them I highly doubt Apple would send to reviewers.

    Oh well, at least the 14 day return period provides the opportunity to exchange. The "screen lottery" we call it.
  • Andrei Frumusanu - Tuesday, October 22, 2019 - link

    Apple would have to be very misleading in providing fully sealed units. It's possible that some retail units perform worse but over the years we've never really encountered such a unit.
  • s.yu - Tuesday, October 22, 2019 - link

    "It would of course be negligent for Apple PR not to ensure reviewers receive fully tested units."
    lol! Samsung Fold.
  • joms_us - Tuesday, October 22, 2019 - link

    It is pity though the so-called fastest SoC is not even close to these Android phones which are typically half the speed of fastest desktops. How do you expect people to believe A13 is faster than i-9900K or Ryzen 3950X? Where GeekBiased and jurassic SP2006? LOL

    https://youtu.be/ay9V5Ec8eiY?t=514

    https://youtu.be/DtSgdrKztGk?t=423

    https://youtu.be/PkVW5eSXKfw?t=115

    I'd say cut the crap and show us real-world results not cherry-picked worthless numbers from benchmarking tools.
  • Quantumz0d - Tuesday, October 22, 2019 - link

    The fanboys man they are so blinded by reality, Apple was able to set a perfect world utopian dream for them. Can't fix stupid.

    I used to run Sultanxda kernel on my OP3 with SD820 processor the SD821 had higher clock speed over 820 but guess what OP screwed it up or Qualcomm didn't provide fix there was Clockspeed crashing at high freq so he disabled it entirely on both big and small. Guess what ? Benches took a massive hit. But UX ? Nope. Infact it improved a lot how is that possible ? I guess Spec and GB only matters right.

    Pixel 3 lagged badly due to the RAM issue no one mentions all say its beautiful wonderful amazing. Guess what ? 1080P 60FPS doesn't exist as an option and its auto as Google deems. 4XL no 4K60FPS because less storage. No press mentions.

    Coming to this garbage phone. iOS 13 whatever. Same icons, same springboard since 1.1.4 (I used it and JBed it) no desktop no customization to OS. All iPhones on the planet look same just like the brainwash here of ridiculous comparision of a GB (bullshit bench) and Spec score. Masterpiece of corporate koolaid.

    Why don't they mention how the Audio format which records is not in Lossless but in AAC crap unlike my V30 does with the 192KHz 24Bit option in FLAC with Limiter and Gain switch or the Video mode which had full manual Pro controls or even the camera having any Manual options. All ASUS, Samsung, LG, Sony, OnePlus, Huawei offer Pro camera forget Pro Video which only LG and Sony do. But No one cares, simpletons only care about A series marketing BS.

    The worst of all no Filesystem. $1000 device which doesn't even have a Filsystem usable by end user or has an option to install the apps off the AppStore. Nor any SD expansion slot to be prepared for emergency. But people are riling up and getting worked over the ARM masterrace LMAO with BGA MacBook Pro with 1 USB C port. Bonus is, to develop iOS app you must pay $99 yearly fee AND own that BGA Soldered KB/Touchpad/Battery/SSD macbook because XCode !!

    The abomination design. Display mutilation for 3 years while heralding best colors best display LOL. Very funny.

    And no 3.5mm jack. Because Apple wanted to make $5bn off revenue from AirPods (Higher than AMDs entire profit) guess what ? Less than 320kbps data rate LMAO. My LG V30 absolutely destroys this phone to oblivion with its ESS9218P DAC processor only found in Top motherboards from ASUS/GB/MSI. Even Vivo Nex decimates this garbage audio iPhone.

    Very very funny how even Qualcomm who spent billions of dollars in R&D for their Centriq ARM server processor by even relegating the teams which worked on post 810, full custom Kryo 820 series and dumping all it out because of the Broadcom M&A (Major beneficiary was Apple due to the Hock Tan connection with Apple, he would sell out all LTE patents) impact and no profit in the ARM server market, forget logistics, capex, ROI, x86 emulation AND 64Bit x86 Emulation legacy code with a massive scale of Linux community around.

    But we want ARM A series BGA processor which has world class Spec and GB score and beats Mainstream and HEDT LGA processors.

    Claps !
  • Anand2019 - Tuesday, October 22, 2019 - link

    Why are you so angry?
  • Quantumz0d - Tuesday, October 22, 2019 - link

    Fed up of the unending talk of x86 vs ARM is one hell even AT forums cpu and oc subforum. Whole thread dedicated to worship this talk.

    Two Apple ruined smartphones by this policy of removing jack and features while raising the price to moon, Other companies also want greed by forcing people to buy BT earphones which sound garbage, horrid longevity (Need to charge everyday) pushing people to buy trash (Beats) thus making whole market saturated with Apple agenda. Look at Google Pixel 4 they also removed offering Dongle, Samsung, OnePlus. Same thing like Apple very greedy.

    Three destroyed the laptops with thin and light obsession. And soldered junk with less and less I/O.

    Finally 4th - this corporation is built on American values but is a stooge to cash from China thus enabling more totalitarianship while claiming Liberty on US land. Spineless positiin.

Log in

Don't have an account? Sign up now