Architecture Overview
BROADCOM VideoCore® IV 3D Architecture Reference Guide
September 16, 2013 • VideoCoreIV-AG100-R Page 14
®
VideoCore® IV 3D Architecture Guide
Scalability is principally provided by multiple instances of a special purpose floating-point shader processor,
termed a Quad Processor (QPU). The QPU is a 16-way SIMD processor. Each processor has two vector floating-
point ALUs which carry out multiply and non-multiply operations in parallel with single instruction cycle
latency. Internally the QPU is a 4-way SIMD processor multiplexed 4× over four cycles, making it particularly
suited to processing streams of quads of pixels.
QPUs are organized into groups of up to four, termed slices, which share certain common resources. Each slice
shares an instruction cache, a Special Function Unit (for recip/recipsqrt/log/exp functions), one or two Texture
and Memory lookup Units (TMU), and Interpolation units for interpolation of varyings (VRI).
The Vertex Cache Manager and DMA (VCM and VCD) collect together batches of vertex attributes and place
these into the Vertex Pipe Memory (VPM). Each batch of vertices is shaded by one of the QPUs and the results
are stored back into the VPM.
In the tile binning phase, only the vertex coordinate transform part of the vertex shading is performed. The
Primitive Tile Binner (PTB) fetches the transformed vertex coordinates from the VPM, and works out which
tiles, if any, each primitive overlaps. As it goes along the PTB builds a list in memory for each tile, which contains
all the primitives impacting that tile, plus references to any state changes that apply.
The Primitive Setup Engine (PSE) fetches shaded vertex data from the VPM and calculates setup data for
rasterising primitives and the coefficients of the equations for interpolating varyings. The rasteriser setup
parameters and the Z and W interpolation coefficients are fed to the Front End Pipeline (FEP), but the varyings
interpolation coefficients are stored directly to a memory in every QPU slice, where they are used for just-in-
time interpolation.
The Front End Pipeline (FEP) includes the Rasteriser, Z interpolation, Early-Z test, W interpolation and W
reciprocal functions, operating at 4 pixels per clock. The early-z test uses a reduced resolution shadow of the
tile’s Z-buffer, which is maintained by the tile buffer block. Each group of four successive quads of 4 pixels
output by the FEP is stored into dedicated registers mapped into the QPU which is scheduled to carry out
fragment shading for that group of quads.
The QPUs are scheduled automatically by the hardware via the QPU scheduler (QPS). The VCM/VCD
continuously fills input buffers of batches of 16 vertex attributes in the VPM. Whenever there is an input batch
ready, the next available QPU is scheduled for vertex shading for that batch. When there is no vertex shading
batch ready to process, the next available QPU is scheduled to fragment shade the next batch of four quads
that become ready.
There is nominally one Texture and Memory lookup Unit (TMU) per slice, but texturing performance can be
scaled by adding TMUs. Due to the use of multiple slices, the same texture will appear in more than one TMU.
Each texture unit has a small L1 cache. There is an L2 cache (L2C) shared between all texture units.
The TMU in each slice can be used for general-purpose data lookups from memory as well as filtered texture
lookups. Another mechanism for reading and writing main memory is to use the VCD or VDW to DMA data into
or out of the VPM from where it can be accessed by the QPUs. The QPUs can also read program constants, as
in non-indexed shader uniforms, as a stream of data from main memory via the Uniforms Cache. The Uniforms
Cache (and Instruction Cache) for each slice fetch data from main memory via the common L2C.
The PSE stores varying interpolation coefficients simultaneously into memory in every QPU slice. With
simultaneous add and multiply, each varying interpolated only costs a single QPU instruction.
A dedicated Coverage Accumulation Pipeline (CAP) is used for OpenVG coverage rendering.