Memory Subsystem & Latency

Usually, the first concern of a SoC design, is that it requires that it performs well in terms of its data fabric and properly giving its IP blocks access to the caches and DRAM of the system within good latency metrics, as latency, especially on the CPU side, is directly proportional to the end-result performance under many workloads.

The Google Tensor, is both similar, but different to the Exynos chips in this regard. Google does however fundamentally change how the internal fabric of the chip is set up in terms of various buses and interconnects, so we do expect some differences.


First off, we have to mention that many of the latency patterns here are still quite a broken due to the new Arm temporal prefetchers that were introduced with the Cortex-X1 and A78 series CPUs – please just pay attention to the orange “Full Random RT” curve which bypasses these.

There’s a couple of things to see here, let’s start at the CPU side, where we see the X1 cores of the Tensor chip being configured with 1MB of L2, which comes in contrast with the smaller 512KB of the Exynos 2100, but in line with what we see on the Snapdragon 888.

The second thing to note, is that it looks like the Tensor’s DRAM latency isn’t good, and showcases a considerable regression compared to the Exynos 2100, which in turn was quite worse off than the Snapdragon 888. While the measurements are correct in what they’re measuring, the problem is a bit more complex in the way that Google is operating the memory controllers on the Google Tensor. For the CPUs, Google is tying the MCs and DRAM speed based on performance counters of the CPUs and the actual workload IPC as well as memory stall % of the cores, which is different to the way Samsung runs things which are more transactional utilisation rate of the memory controllers. I’m not sure of the high memory latency figures of the CPUs are caused by this, or rather by simply having a higher latency fabric within the SoC as I wasn’t able to confirm the runtime operational frequencies of the memory during the tests on this unrooted device. However, it’s a topic which we’ll see brought up a few more times in the next few pages, especially on the CPU performance evaluation of things.

The Cortex-A76 view of things looks more normal in terms of latencies as things don’t get impacted by the temporal prefetchers, still, the latencies here are significantly higher than on competitor SoCs, on all patterns.

What I found weird, was that the L3 latencies of the Tensor SoC also look to be quite high, above that of the Exynos 2100 and Snapdragon 888 by quite a noticeable margin. I noted that one weird thing about the Tensor SoC, is that Google didn’t give the DSU and the L3 cache of the CPU cluster a dedicated clock plane, rather tying it to the frequency of the Cortex-A55 cores. The odd thing here is that, even if the X1 or A76 cores are under full load, the A55 cores as well as the L3 are still running at lower frequencies. The same scenario on the Exynos or Snapdragon chip would raise the frequency of the L3. This behaviour and aspect of the chip can be confirmed by running at dummy load on the Cortex-A55 cores in order to drive the L3 higher, which improves the figures on both the X1 and A76 cores.

The system level cache is visible in the latency hump starting at around 11-13MB (1MB L2 + 4MB L3 + 8MB SLC). I’m not showing it in the graphs here, but memory bandwidth on normal accesses on the Google chip is also slower than on the Exynos, but I think I do see more fabric bandwidth when doing things such as modifying individual cache lines – one of the reasons I think the SLC architecture is different than what’s on the Exynos 2100.

The A55 cores on the Google Tensor have 128KB of L2 cache. What’s interesting here is that because the L3 is on the same clock plane as the Cortex-A55 cores, and it runs at the same higher frequencies, is that the Tensor’s A55s have the lowest L3 latencies of the all the SoCs, as they do without an asynchronous clock bridge between the blocks. Like on the Exynos, there’s some sort of increase at 2MB, something we don’t see on the Snapdragon 888, and I think is related to how the L3 is implemented on the chips.

Overall, the Tensor SoC is quite different here in how it’s operated, and there’s some key behaviours that we’ll have to keep in mind for the performance evaluation part.

Introduction - Custom or Semi-Custom? CPU Performance & Power
POST A COMMENT

108 Comments

View All Comments

  • Silver5urfer - Tuesday, November 2, 2021 - link

    You said you do Kernels but "As someone who was rather hopeful that google would take control and bring us android users a true apple chip equivalent some day, this is definitely not the case with google silicon."

    What is Android lacking from needing that so called A series processor onto the platform ? I already see Android modding has been drained a lot now. It's there on XDA but less than 1% of user base uses mods, maybe root but it's still niche.

    Android has been on a downhill since a long time. With Android v9 Pie to be specific. Google started to mimic iOS on superficial level with starting from OS level information density loss now on 12 it's insane, you get 4 QS toggles. It's worst. People love it somehow because new coat of trash paint is good.

    On HW side, except for OnePlus phones no phones have proper mod ecosystem. Pixels had but due to the crappy policies they implemented on the HW side like AB system, Read only filesystem copied from Huawei horrible fusing of filesystems and then enforcing all these at CTS level, they added the worst of all - Scoped Storage which ruined all the user use cases of having a pocket computer to a silly iOS like trash device. Now on Android any photo you download goes into that Application specific folder and you cannot change it, due to API level block on Playstore for targeting Android v11 which comes with Scoped Storage by default. Next year big thing is coming, all 32bit applications will be obsoleted because ARM is going to remove the 32bit IP from the Silicon designs. That makes 888 the last 32bit capable CPU.

    Again what do you expect ? Apple A series shines in these Anandtech SPEC scores but when It comes to real life Application work done performance, they do not show the same level of difference. Which is basically Application launch speed and performance of the said application now Android 12 adds a splash screen BS to all apps globally. Making it even worse.

    There's nothing that Google is going to provide you or anyone to have something that doesn't exist, Android needs freedom and that is being eroded away every year with more and more Apple inspired crap. The only reason Google did this is to experiment on those billions of dollars and millions for their R&D, Pixel division has been in loss since 2016, less than 3% North American marketshare. Only became 3 from 2 due to A series budget Pixels. And they do not even sell overseas on many markets. In fact they imitate Apple so much that now they want the stupid HW exclusive joke processors for their lineup imitating Apple for no reason. Qcomm provides all the blobs and baseband packages, If Google can make them deliver support for 6 years they can do it, but they won't because sales. All that no charger because environment, no 3.5mm jack because no space, no SD slot is all a big fat LIE.

    Their GS101 is a joke, a shame to CPU engineering, trash thermal design, useless A7x cores and the bloated X1 x2 cores for nothing, except for their ISP nothing is useful and even the Pixel camera can be ported to other phones, Magic Eraser for eg works on old Pixels, soon other phones due to Camera API2 and Modding.

    Google's vision of Android was dead since v9 and since the death of Nexus series. Now it's more of a former shell with trash people running for their agenda of yearly consumerism and a social media tool rather than the old era of computer in your pocket, to make it worse the PR of Pixel is horrible and more political screaming than anything else.
    Reply
  • Zoolook - Saturday, November 6, 2021 - link

    Apple silicon shines in part due to being on a superior process, and a much better memory subsystem, Samsung process is far behind TSMC in regards to efficiency unfortunately. Reply
  • Zoolook - Saturday, November 6, 2021 - link

    Small nitpick, A8X GPU was a PowerVR licence, A11 had the first Apple inhouse GPU. Reply
  • iphonebestgamephone - Sunday, November 14, 2021 - link

    "cut power consumption in half WITHOUT increasing performance"

    Make a custom kernel and uc/uv it and there you go. Should be easy for a pro kernel dev like you.
    Reply
  • tipoo - Tuesday, November 2, 2021 - link

    Thanks for this analysis, it's great.

    I'm still left wondering what the point of Tensor is after all this. It doesn't seem better than what was on market even for Android. I guess the extra security updates are nice but still not extra OS updates even though it's theirs. And the NPU doesn't seem to outperform either despite them talking about that the most.

    And boy do these charts just make A15 look even more above and beyond their efforts, but even A4 started with Cortex cores, maybe in 2-3 spins Google will go more custom.
    Reply
  • Blastdoor - Tuesday, November 2, 2021 - link

    I wonder if we will now see a similar pattern play out in the laptop space, with Macs moving well beyond the competition in CPU and GPU performance/watt, and landing at similar marketshare (it would be a big deal for the Mac to achieve the same share of the laptop market that the iPhone has of the smartphone market). Reply
  • tipoo - Tuesday, November 2, 2021 - link

    Well I'm definitely going to hold my Apple stocks for years and that's one part of the reason. M1 Pro and Max are absolute slam dunks on the industry, and their chipmaking was part of what won me over on their phones. Reply
  • TheinsanegamerN - Tuesday, November 2, 2021 - link

    When did apple manage that? I can easily recall the M1 pulling notably more power then the 4700u in order to beat it in benchmarks despite having 5nm to play with. The M1X max pulls close to 100W at full tilt, and is completely unsustainable. Reply
  • Spleter - Tuesday, November 2, 2021 - link

    I think you are confusing temperature in degrees and not the amount of watts. Reply
  • Alistair - Wednesday, November 3, 2021 - link

    when it is drawing 100 watts it is competing against windows laptops that are drawing 200 watts, i'm not sure what the problem is Reply

Log in

Don't have an account? Sign up now