It’s an Out of Order Atom

Ever since the Pentium Pro (P6), we have been blessed with out of order microprocessor architectures - these being designs that can execute instructions out of program order to improve performance. Out of order architectures let you schedule independent instructions ahead of others that are either waiting for data from main memory or waiting for specific execution resources to free up. The resulting performance boost comes at the expense of power and die size. All of the tracking logic to make sure that instructions executed out of order still retire in order eats up die area as well as more power.

When Intel designed the Atom processor it went back to an in-order design as a way of reducing power. Intel has committed to using in-order architectures in Atom for 4 - 5 years post introduction (that would end sometime in the 2012 - 2013 time frame).

For smartphones, Intel’s commitment to in-order makes sense. Average power consumption under load needs to remain at less than 1W and you simply can’t hit that with an out-of-order Atom at 45nm.

For netbooks and notebooks however, the tradeoff makes less sense. Jarred has often argued that a CULV notebook is a far better performer than a netbook at very similar price/battery life metrics. No one is pleased with Atom’s performance in a netbook, but there’s clearly demand for the form factor and price point. Where there’s an architectural opportunity like this, AMD is usually there to act.

Over the past decade AMD has refrained from copying an Intel design, instead AMD usually looks to leapfrog Intel by implementing forward looking technologies earlier than its competitor. We saw this with the 64-bit K8 and the cache hierarchy of the original Phenom and Phenom II processors. Both featured design decisions that Intel would later adopt, they were simply ahead of their time.

With Atom stuck in an in-order world for the near future, AMD’s opportunity to innovate is clear.

The Architecture

Admittedly I was caught off guard by Bobcat’s architecture: it’s a dual-issue design, the first AMD has introduced since the K6 and also the same issue width Intel chose for Atom. Where AMD and Intel diverge however is in the execution side: Bobcat is a fully out of order architecture.

The move to out of order should provide a healthy single threaded performance boost over Atom, assuming AMD can ramp clocks up. Bobcat has a 15 stage integer pipeline, very close to Atom's 16 stage pipe. The two pipeline diagrams are below:


Click to Enlarge


Intel's Atom pipeline

You’ll note that there are technically six fetch stages, although only the first three are included in the 15 stage number I mentioned above. AMD mentioned that the remaining three stages are used for branch prediction, but in a manner it is unwilling to disclose at this time due to competitive concerns.

Bobcat has two independent, dual ported integer scheduler. One feeds two ALUs (one of which can perform integer multiplies) while the other feeds two AGUs (one for loads and one for stores).

The FPU has a single dual ported scheduler that feeds two independent FPUs. Similar to the Atom processor, only one of the ports can handle floating point multiplies. The FP mul and add units can perform two single precision (32-bit) multiplies/adds per cycle. Like the integer side, the FPU uses a physical register file to reduce power.

Bobcat supports SSE1-3, with future versions adding more instructions as necessary.

Bobcat supports out of order loads and stores similar to Intel’s Core architecture as well.

The Bobcat core has a 3-cycle 64KB L1 (32KB instruction + 32KB data cache) that’s 8-way set associative. The L2 cache is a 17-cycle, 512KB 16-way set associative cache. I originally measured Atom’s L1 and L2 at 3 and 18 cycles respectively (I’ve heard numbers as low as 15 for Atom’s L2) so AMD is definitely in the right ballpark here.


Intel's Atom Microarchitecture

Unlike the original Atom, Bobcat will never ship as a standalone microprocessor. Instead it will be integrated with other cores and a GPU and sold as a single SoC. The first incarnation of Bobcat will be a processor due out in early 2011 for netbooks and thin and light notebooks called Ontario. Ontario will integrate two Bobcat cores with an AMD GPU manufactured on TSMC’s 40nm process (Bobcat will be the first x86 core made at TSMC). This will be the first Fusion product to hit the market.

Note that there's an on-die memory controller but it's actually housed in between the CPU and GPU in order to equally serve both masters.

The Three Chip Roadmap Bobcat Performance & Power
Comments Locked

76 Comments

View All Comments

  • Dustin Sklavos - Tuesday, August 24, 2010 - link

    If you're encoding using Adobe software, ditch AMD until Bulldozer. Adobe's software makes heavy use of SSE 4.1 instructions, which current AMD chips lack, and the extra two cores don't pick up the slack compared to a fast i7.
  • flyck - Tuesday, August 24, 2010 - link

    From the design of Bulldozer's FPU it is cleared that AMD want Multi Threaded FPU to run on OpenCL.

    Not sure what you mean with that? (it is true they want to abuse that in the future with fusion) but at this moment i see: Sandybridge 2hreads -> one FPU, Bulldozer 2 threads -> one FPU
  • BitJunkie - Tuesday, August 24, 2010 - link

    I think he's picking up on the point that this general purpose design is going to favour integer operations over floating point. Looking at this architecture from the perspective of someone wanting to perform a lot of floating point matrix calculus then the performance improvement of each "core" is going to be proportionally less than for integer calcs.

    So what he's saying is that quite clearly AMD believe that general purpose CPUs are just that and have designed for a well defined balance of FP and Interger operations i.e. If you want more FLOPS go talk to the GPU?
  • stalker27 - Tuesday, August 24, 2010 - link

    "And if Bulldozer comes any later, it will be up against the die shrink of SandyBridge, Ivy Bridge. Things dont look so good in here."

    Basically, you've contradicted yourself right here:

    "Most of us dont need SUPER FAST computer."

    True, and true.... Ivy will probably be faster than Bulldozer (speculatively) as is Nehalem to Stars, but most people, i.e. the "cash cow" won't buy these expensive products. Instead they will focus on mid to low end computers which by their performance is more then/or enough for their needs.

    So things might not look good in reviews and bench tops, but in the stores and on people's bank balances they will look pretty good.
  • jabber - Tuesday, August 24, 2010 - link

    Hooray!

    I'm glad at last some folks are waking up to the fact that having the fastest or most expensive CPU means absolutely jack!

    All the latest fastest CPU stuff just means a little bit more internet traffic for tech review sites.

    The rest of the world doesnt give a damn.

    All the real world is interested in is the best CPU for the buck in a $400 PC box to run W7 and Office on. AMD needs to get a proper marketing dept to start telling folks that.

    All AMD has to do is produce good performing chips for a good price. It dosent need a CPU to beat the best of Intel.

    The real world lost interest in CPU performance the minute dual cores arrived and they could finally run IE/Office and a couple of mainframe sessions without it grinding to a halt.

    I bet Intel gives out more review samples of its top CPU than it sells.
  • JPForums - Tuesday, August 24, 2010 - link

    "All the real world is interested in is the best CPU for the buck in a $400 PC box to run W7 and Office on. AMD needs to get a proper marketing dept to start telling folks that."

    "The real world lost interest in CPU performance the minute dual cores arrived and they could finally run IE/Office and a couple of mainframe sessions without it grinding to a halt."

    Apparently us Engineers aren't part of "The rest of the world".
    Try running products from the likes of Mentor Graphics, Cadence, and Synopsis for reasonably large designs. Check out what a difference each new CPU makes in PROe (assuming sufficient GPU horsepower). Run some large Matlab simulations, Visual studio compilations, and Xilinx builds. You don't even have to get out of college before you run into many of these scenarios.

    Trust me when I say that we care about the next greatest thing.
    An extra $1000 dollars on a CPU is easily justified when companies are billing $100+ per Engineering hour (not to be confused with take home pay).
  • BitJunkie - Tuesday, August 24, 2010 - link

    Exactly so: An example would be a 24hr calculation to perform a detailed 3D finite element analysis. This is not unusual using highly spec'd Xeon work stations from your vendor of choice.

    It might take 5 to 10 days to set up a model including testing of different aspects: Mesh density, discretisation errors, boundary effects, parametric studies. The set up time with numerous supporting pre-analysis runs is what really costs. Anything we can do to reduce this is worth while.

    The above would be the typical process BEFORE considering a batch-job on a HPC cluster if we wanted to look at a series of load cases etc.

    Time is money.
  • mapesdhs - Tuesday, August 24, 2010 - link


    I know a number of movie studios who love every extra bit of
    CPU muscle they can get their hands on. Rendering really
    hammers current hardware. One place has more than 7000
    XEON cores, but it's never enough. Short of writing specialised
    sw to exploit shared-memory machines that use i7 XEONs (which
    has its own costs), the demand for ever higher processing
    speed will always persist. Visual effects complexity constantly
    increases as artists push the boundaries of what is possible.
    And this is just one example market segment. As BitJunkie
    suggests, these issues surface everywhere.

    Another good example: the new Cosmos machine in the UK
    which contains 128 x 6-core i7 XEON (Nehalem EX) with
    2TB RAM (ie. 768 cores total). This is a _single system_,
    not a cluster (SGI Altix UV). Nothing less is good enough for
    running modern cosmological simulations. There will be
    much effort by those using the system on achieving good
    efficiency with 512+ cores; atm many HPC tasks don't scale
    well beyond 32 to 64 cores. Point being, improving the
    performance of a single core is just as important as general
    core scaling for such complex tasks. SGI's goal is to produce
    a next-gen UV system which will scale to 262144 cores in
    a single shared-memory system (32768 x 8-core CPUs).

    You can never have enough computing power. :D

    Ian.
  • stalker27 - Wednesday, August 25, 2010 - link

    You're 1% of the market... for you, Intel and AMD have reserved cherry-picked chips that they can charge you 1K for but at the same time offer you that needed speed. How's that?

    BTW, he said real world, not rest of the world. That makes you somewhat of an illusion. But don't take it the bad way... more like most of us would dream working in an environment full of hot setups, big projects and big bux, unlike in the real world where you have to mop the floor after debugging for 8 hours straight... if they don't force you to work extra two hours without pay, never-mind that before you start the workday you have to go to various bureaucratic public clerk offices to deal with stuff that was supposed to be taken care by secretaries... which got fired for no apparent reason some time ago.

    So stop moaning... you have it good, even as 1%.
  • Makaveli - Tuesday, August 24, 2010 - link

    lol if AMD and intel followed your logic we would all still be running Pentium 2 and socket A Athlons silly boy.

    You make yourself look like an ass when you make a generalized statement like that, as if you are speaking for the rest of the world.

    As that other guy pointed out some of us do more than just office work on our pc's!

Log in

Don't have an account? Sign up now