: News
: Reviews
: Editorials

: Archives
: About
: Contact
: Advertising
: Privacy
: Links
: Forums

 

 

 

GeForce 8800 Technology Preview

 

Massively General

 

by Josh Walrath

 

The G80 Architecture

            In 2002 the initial work on the G80 was started, and the overall plan for the design was finalized in 2003.  It is not until now, in 2006, that the finished product has come to fruition.  Considering that NV40 design work was commenced at the same time, it is amazing to think of the foresight needed for such an ambitious design as the G80 is.  While we in the tech press have a pretty good idea of what is going on, and what future trends we can expect to see, it is nothing compared to the internal meetings of these GPU architects (mostly ex-SGI guys who know rendering backwards and forwards).  Once we dig into the details, we will get a small glimpse of the technical expertise and architectural aggressiveness these engineers possess.  Note also the timing of this architecture, as 3dfx was bought up by NVIDIA in late 2000, and many of those engineers were successfully integrated into NVIDIA by 2001.  Old 3dfx concepts such as the “texturing computer” now seem to have finally been fully developed and exposed.

This diagram will be making the rounds.  Note the 16 Stream Processors per "mega-unit" and the decoupled texture addressing and filtering units.  Also of note are the large L1 and L2 caches present throughout the design.

            As we have progressed through the years graphics chips have become more and more parallel.  The G71 had an impressive 24 pixel pipelines made up of 2 x ALU’s, one of which could handle texturing duties.  The G80 takes parallelism to the next level.  There are 128 individual “Stream Processors” that run at an astounding 1.35 GHz.  These are fully FP32 units that also conform to IEEE 754 standards.  These SPs are scalar as compared to the vector based units in previous chips.  In this case the scalar part is not as potentially powerful per clock as a vector based unit, but it has the advantage of being 100% efficient.  So while these 128 scalar units have essentially the same throughput as 32 vector units, this is only when producing Vec3 + scalar results.  When Vec1, Vec2, or scalar values are output, the scalar unit retains 100% efficiency while the traditional vector units suffer.  So when not producing Vec4 results, the vector based units are underutilized.

            The SPs also feature a Stream Out function, which allows the SPs to output full or partial results to cache which can then be put through the SP again.  In more traditional architectures these results are passed down the pixel pipeline.  With this new architecture these results can be looped back through the SPs as many times as needed.  The SPs seem significantly simplified as compared to older pixel shader units, but this simplification has allowed a lot more flexibility.  With 128 units running at 1.35 GHz, the overall results offer far more performance than traditional shaders.  The SPs are full custom design units, which explains why they are able to run at 1.35 GHz.  Typically graphics chips use a lot of automation in the design with standard cells, but when a design is fully custom its electrical properties are well known and the unit can be run at faster speeds with lower power consumption than a functionally identical unit that is produced through standard cell and automation.  While the SPs can probably run at faster clockspeeds, the surrounding architecture would have to run faster as well to support them, and that would lead to a lot more heat and power consumption.  As it is, NVIDIA has done its best to balance out the performance of the SPs with the surrounding architecture.  This leaves some performance on the table for future products based on these SPs.

Closer detail of the G80 SPs and their functionality.

            NVIDIA implemented a wider crossbar memory controller instead of going with a ringbus architecture.  The 8800 GTX features 6 x 64 bit crossbar memory controllers which gives a 384 bit path to main memory.  764 MB of GDDR-3 running at 900 MHz gives an overall bandwidth of 86.4 GB/sec.  The 8800 GTS has a 5 x 64 bit crossbar controller giving a 320 bit pathway to main memory.  620 MB of GDDR-3 running at 800 MHz gives an overall bandwidth of 64 GB/sec of bandwidth.  NVIDIA feels this is efficient enough for its needs, plus it has decreased latency as compared to a ringbus architecture.

            NVIDIA has implemented a thread dispatch engine, but is not terribly keen to give away the details.  All they are willing to say is that at any one time there are thousands of threads in flight, and that the branch efficiency is very high compared to current ATI products.

 

Next: Texture Addressing

 

If you have found this article interesting or a great help, please donate to this site.

 

Copyright 1999-2006 PenStar Systems, LLC.