Click Here

 

: News
: Reviews
: Editorials
: State of 3D
: About
: Contact
: Advertising
: Privacy
: Links
: Forums

 

The Shocking State of 3D Affairs

Cutting Through the Smoke and Hype

by Josh Walrath

Flexibility vs. Speed

            Comparing the NV3x architecture with the R3x0 is almost like comparing an Acura NSX to a nitrous charged Hemicuda.  The Acura is a well-balanced sportscar that has more than adequate performance in a variety of situations.  The Hemicuda is a quarter mile beast, but it has problems cornering on complex road races.  Unfortunately for NVIDIA, all current PS/VS 2.0 applications are quarter mile races.  The NV3x parts get their tailpipes handed to them by the R3x0 series.

            A quick comparison tells us the underlying story of the two architectures.  The R300 and R350/360 chips feature 8 pixel pipelines, each of which acts as a separate pixel shader.  In classic pixel processing situations it can produce 8 single textured pixels per pass.  Running at 325 MHz plus, it can fill an entire scene at 1600x1200x32 without a problem at fast frame rates.  ATI has a very straightforward design with each pipeline in that they take all pixel inputs, whether they are FX12 integer, FP16, FP24, and FP32, and converts them into FP24.  FP24 produces very acceptable results in terms of rendering fidelity, even when FP32 is indicated by an application.  In today’s current applications, this is not a problem, as the shaders used for games are not as complex compared to what we see with the shader effects made for film.  In very complex shaders that require many passes, FP24 is not good enough, and errors in rendering can occur in such situations, this is when FP32 is needed.  However, we are not at the point with either software or hardware performance where this becomes an issue.  Even the complex shaders that Valve is doing for Half-Life 2 are only a dozen instructions long at most.  DX9 PS2.0 in fact specifies a maximum instruction length of 64, while the R300 can handle 96.  The NV3x series can handle shaders up to 1024 instructions.

            The NV3x architecture is built for maximum flexibility, and that flexibility comes at the price of overall speed in standard pixel shading operations.  The NV30 and NV35 appear to have four separate pixel pipelines that handle FX12, FP16, and FP32 natively.  These pipelines can sometimes act like an 8 pipeline/1 texture unit, or a 4 pixel/2 texture unit, depending on what exactly is called for in software.  Each of the 4 pixel units also acts as a pixel shader, so right off the bat it appears fairly obvious that the Radeon series has an advantage here by having double the pixel shaders as the NVIDIA series.  With the NV3x there is no conversion of FX12 into FP16 or 32 (as the R3x0 series does, though with FP 24).  By doing this NVIDIA does not suffer a speed penalty as ATI does with the conversion, as most applications that only require FX12 based pixels and shaders run faster on the NV3x series than the R3x0 series.  These include many OpenGL, DX7, and DX8 applications.  The simple reason for this is internal bandwidth.  Native FX12 operations take up significantly less bandwidth inside the chip than FP 24 operations.  Only when PS/VS 2.0 applications show up does NVIDIA drop rapidly in performance.

            The NV3x architecture also has some very interesting twists that allow me to apply the term “elegant” to it.  First off the maximum pixel shading instructions that it can handle are 1024.  This is a very large number, and one that should make any Digital Content Creation user drool.  The Radeon series can “only” handle 96 instructions (though with the use of the F-buffer ATI extends that limit).  This functionality is not exposed in PS 2.0, but this is not a problem in games because using the maximum instructions allowed by PS 2.0 would kill performance on any current DX9 card.  Where this is interesting is that future cards down the road will eventually use these instructions, so NVIDIA is providing a solid foundation on which to base future technology on.  The other exciting aspect is for the professional rendering applications which do not need 30 fps in shader ops.  These complex shaders can take hours on renderfarms to process a single frame, and the promise of writing these shaders to be used on NVIDIA hardware, and having that shader rendered relatively quickly on the desktop, is something many companies are looking at.  Shaders that would take hours to render on high end machines may only take seconds on a single workstation.  Even the most complex Pixar shaders are only a couple hundred instructions in length, and if ported correctly, might possibly run inside NVIDIA hardware.

            The ability to handle both FP16 and FP32 does give a very significant degree of flexibility to NVIDIA hardware.  I do agree with NVIDIA that partial precision is a valid function.  Why waste bandwidth and space internally in a chip if a shader will not benefit from higher precision?  From a developers view this is a bit different, as they have to code their shaders to do either FP16 or FP32, depending on the needs.  ATI and DX9 basically allow the developers to code entirely in FP24, which gives a good tradeoff in current generation applications with speed and precision.  In the future, ATI will eventually have to go with FP16 and FP32 pixel formats, but for now FP24 is sufficient for current needs.

            NVIDIA hardware also has the ability to perform two FP16 operations per pixel shader, so it theoretically can do eight FP16 operations per pass, but this takes some fairly serious coding to do this optimally, and only works in a true FP16 environment.  Once FP32 is thrown into the mix, then the efficiency of combining FP16 operations decreases.  This is probably another reason why the mixed precision coding takes so long to do.  Optimizing for such a solution is time consuming, but once we see all of the graphics manufacturers go to FP16 and FP32 designs, it will become the status quo.

            Another interesting aspect of the NV3x hardware is the ability to combine several simple shader instructions into one complex instruction.  This ability should provide significant savings on both bandwidth and rendering passes.  This flexibility and programmability is very impressive, and putting this into place right now, while not terribly useful with current games, could yield great results in the future as new hardware and software comes out.

 

Next: More Flex vs. Speed

 

 

 

Copyright 1999-2003 PenStar Systems, LLC.