Today’s piece is a bit of an unusual review; NVIDIA’s new Jetson AGX embedded system kit isn’t really a device platform we’re expecting the average reader to think about, much less buy. NVIDIA’s shift over the last few years from offering consumer grade Tegra chipsets to more specialised silicon applications isn’t any more evident than in the new Tegra Xavier which powers the Jetson AGX. While the board's capabilities certainly fall outside of the use-cases of most consumers, it still represents a very interesting platform with a lot of functionality and silicon IP that we don’t find in any other device to this day. So when NVIDIA reached out to offer us a sample, we decided to have a go at assembling a high-level overview of what the board and the new Xavier chip can do.

First of all, we have to describe what this actually is. The Jetson AGX is a full-fledged small form-factor computer / embedded system, with the form of the whole unit not any bigger than 105x105mm. The AGX module itself is designed to be a complete commercial off the shelf (COTS) system for use in finished products, with NVIDIA aiming it at AI (read: neural networking) centric use cases such as robotics and industrial automation. Jetson boards typically occupy the small-to-mid volume end of the market, showing up in one-off products and items with limited production runs, where it doesn't make sense for a manufacturer to develop and deploy their own custom hardware.

But of course the bare module is only half the story. You can't do development against a bare module, and this is where NVIDIA's complete Jetson AGX development kit comes in. The AGX dev kit comes with everything needed to run a single module, including a power supply, a heatsink, and more important of all, a breakout board.  The breakout board offers various I/O headers and ports, ranging from your standard double USB-C 3.1 ports, HDMI connectors and Gigabit Ethernet ports, to more specialised connectivity such as MIPI CSI-2 connectors for camera connectivity and a range of typical development board headers such as a 40 pin GPIO connector.

The more unusual connectivity options of the Jetson AGX are the PCIe Gen4 x16 slot as well as a M.2 PCIe x1 extension slot that is meant to be used for connectivity add-ons such as WiFi or cellular modules, both features that aren’t common among Arm development board as most SoCs don’t have the spare PCIe controllers.

The board comes with many other connectors, and that’s one regard in which the new Jetson AGX doesn’t lack at all in flexibility. Power is supplied by an external generic 19V power supply – the stock one supplied by NVIDIA is a 65W LiteOn unit that seems no different than most laptop charger bricks.

Underneath the quite heavy and solid aluminium heatsink we find what actually powers the Jetson AGX board: the AGX Xavier module. This is a system module that sits on top of the Jetson motherboard – the module has no I/O ports by itself and merely serves as the brains of the system, integrating the core components surrounding the Xavier chip, such as the 16GB of LPDDR4x memory, a small 32GB eMMC storage chip as well as all the power delivery circuits for powering the different power rails of the DRAM as well as IP blocks of the Xavier SoC.

The Xavier chip, as said, is the brains of the platform and represents NVIDIA’s biggest and most complex SoC to date. With 9 billion transistors on a die size of 350mm², it’s among one of the heavyweights of the Arm ecosystem, although between its initial announcements and today Apple has one-upped NVIDIA in terms of transistor count as the new A12X is a 10B chip – in a much smaller manufacturing node.

Coming from the traditional PC industry, NVIDIA doesn’t shy away from showing die shots of their products, which is something that is quite rare these days among the Arm SoC vendors. The Xavier SoC is mainly dominated by two big IP blocks which consist of the majority of the space allocated on the die: The 8-core “Carmel” CPU complex as well as a four-cluster Volta GPU.

At the high level, the CPU complex contains 8 Carmel CPU cores configured in four clusters, each with a pair of Carmel CPU cores. Each cluster has an independent clock plane and shares a 2MB cache among two CPU cores. At the higher CPU complex level we find a 4MB L3 cache serving all clusters. We don’t know too much about the microarchitecture of the new Carmel cores - seemingly this looks to be a successor to NVIDIA’s Denver µarch, a design that was characterised by its dynamic code optimisation capability. The only thing that NVIDIA does advertise is that this is a 10-wide superscalar machine (10 execution ports in this case, not 10-wide decode) and has support for the ARMv8.2+RAS instruction set. We’ll come back to the CPU core later in the article.

The GPU in Xavier has its roots in the Volta architecture. Here we find the GPU configured into four TPC (Texture processing clusters), each with two SMs (Stream multiprocessors), for a total of 8 SMs or 512 ALU lanes/CUDA cores. A most interesting aspect of the GPU is that because it is based on Volta, it also inherits the Tensor processing units from its bigger brethren. This augments the total processing power of the GPU by up to 22.6 8-bit TOPs or 11.3 FP16 TOPS on the part of the Tensor cores, on top of the respectively 2.8 and 1.4 TFLOPs for FP16 and FP32 CUDA operations provided by the SMs.

Alongside the CPU and GPU there’s many other important blocks, many of which NVIDIA had covered already at its HotChips 2018 presentation last summer. The one block that really does augment the Xavier SoC is the new DLA IP block: this is very much a new type of block that follows the trend we’ve seen in the mobile SoC space – a dedicated machine inferencing acceleration unit not unlike that which we’ve seen from the likes of HiSilicon or Apple. NVIDIA’s DLA promises performances of up to 11.4 int8 TOPS and is also capable of FP16 operation at half speed at 5.7 TOPS. On the SoC, the unit is implemented as a dual-core instance.

Alongside the DLA, the programmable vision accelerator is again a key component of the Xavier system that allows it focus on vision and in particular robotics, embedded AI and automotive use-cases. The PVA is a more traditional vision IP block that handles more rudimentary tasks such as object detections in a much more efficient way than it would be able to be handled by the GPU or machine inferencing algorithms. Here the PVA will be the first IP block after the ISP in the vision pipeline that will serve to segment parts of an image into objects that will be then forwarded to other algorithms that then would happen on the GPU or DLA.

Machine Inference Performance & What's it For?
Comments Locked

51 Comments

View All Comments

  • linuxgeex - Friday, November 8, 2019 - link

    Add this line to the following files (linux/bsd or windows)

    /etc/hosts or c:/windows/system32/driver/hosts

    127.0.0.1 ads.servebom.com

    job done.
  • TheinsanegamerN - Friday, January 4, 2019 - link

    auto video ads are hell incarnate.
  • Yojimbo - Friday, January 4, 2019 - link

    Regarding NVIDIA's future CPU core development, I think it's important to note that NVIDIA has developed all major IP blocks on the SoC. That probably allows them to work on integration sooner than if they relied on externally developed IP blocks. Also, they have the ability to tune their cores and fabric to their intended application, which is a narrow subset of what ARM is developing for. I'm guessing NVIDIA doesn't tune the performance of their CPU cores using specint or specfp. They probably look at much more specific and realistic benchmarks.

    And by the time the Cortex A76AE is available for NVIDIA to use they will probably have a next iteration of their CPU which perhaps will show up in Orin in early 2021 or even late 2020. It's not clear to me what delayed Xavier from NVIDIA's original schedule. It's possible they'll be able to get the next one out with less time between the launch of the underlying GPU architecture and the availability of the SoC. There was a lot of new stuff that went into Xavier other than the GPU architecture, such as the increased safety features, the DLA, and the PVA.
  • DeepLearner - Friday, January 4, 2019 - link

    I hope they'll send you a T4 soon! I'm dying for numbers on those.
  • eastcoast_pete - Friday, January 4, 2019 - link

    @Andrei: thanks for this review. I wonder if the recent loss of a larger client in the automotive sector (Audi/Volkswagen) to Samsung played a role in Nvidia's willingness to make samples available to you for review. As of model year 2021, Audi will stop using Tegra-based units and move to Samsung's Exynos Auto V9 SoC, which actually features an 8 A76 cores based on ARM's A76 AE design for automotive/vehicular use.
    While that specialized SoC is still awaiting mass production, I also wonder if Samsung's choice to use straight-up ARM A76 cores (yes, they are AE, so not standard A76) portends a sea change for the mainstream Exynos lines also? As you pointed out, Mongoose turned out to be quite disappointing, so is there a change coming? Would appreciate your insights and comment!
  • webdoctors - Friday, January 4, 2019 - link

    I was also confused by the news of Audi using Samsung chips. I don't think this changes the Audi/Nvidia relationship from googling: http://fortune.com/2017/01/05/audi-nvidia-2020/

    I think in the infotainment sector there's just a lot of competition for cheap chips and a low bar for entry. Any Mediatek or run of the mill cellphone chip should do. I doubt you'd care about ECC or safety in the HW playing your music or watching movies. My current car has an aftermarket unit that's 10 years old that can play DVD movies, has GPS maps and integrates a backup camera.

    I'm not sure how you'd program a beast of a chip here, or even what the right benchmarks are since you wouldn't need it just play movies, show maps or run CPU benchmarks. With all the inferencing and visual processing it'd be a waste of resources and money to use it for the traditional tasks done today in cars.

    I'm really curious how Anandtech evaluates these specialized products that aren't your run of the mill CPU/GPU/HDD.
  • unrulycow - Saturday, January 5, 2019 - link

    This is obviously overkill for the entertainment system. It's main purpose is for the semi-autonomous driving systems like Cadillac's SuperCruise or Tesla's Autopilot.
  • Andrei Frumusanu - Friday, January 4, 2019 - link

    As far as I know their mobile roadmap still uses custom cores, there's probably different requirements for automotive or they could have simply said that 8 A76s make a lot more sense than 8 custom cores.
  • eastcoast_pete - Saturday, January 5, 2019 - link

    Thanks Andrei! Yes, design requirements for automotive/vehicle-embedded are different in key areas (safety/security). However, I was/am struck by Samsung not adapting their own Mongoose design for AE use. Maybe their client (Audi) preferred the stock A76 AE design, and it wasn't economical to adapt Mongoose. However, this now means that the most powerful Samsung SoC design (A76 octacore) might be found in - Audi cars.
  • unrulycow - Saturday, January 5, 2019 - link

    They are also losing Tesla as a client. Tesla decided to create their own chip which will theoretically start going into cars in Q2. I would love to see a comparison between the two chips.

Log in

Don't have an account? Sign up now