Graphics cores: trying to compare apples to apples

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

Marketing of modern graphics processors involves comparing many low-level performance metrics. For example, our PowerVR GPUs are commonly compared based on GFLOPS (a measure of computing throughput), triangles per second (a measure of geometry throughput), pixels per second and texels per second (measures of fill rate).

In addition to these more traditional metrics, it has become commonplace for companies to describe their architectures in terms of the number of cores they include. Despite being computing terminology with an established history, the meaning of this term has become distorted by GPU marketing. That said, language is malleable and terms get updated over time to reflect their common use. I’ll come back to that common use part.

What is a core?

It depends. Core count traditionally tracks the number of front-ends in a processor. Without complicating it too much, the front-end is responsible for scheduling and dispatch of threads of execution. In almost all modern GPUs, again simplifying things a lot to make the point, there are multiple schedulers and associated dispatch logic, all sitting in front of their own compute resources to schedule work on.

Each scheduler keeps track of a number of threads that need to execute, running a single instruction for a single program in a single cycle. That notion of a single instruction pointer which runs a program on a set of compute resources, regardless of the number of threads or how the compute resources are shared, is the traditional definition of a core.

However, we’ve also used the term to describe whole instances of our Series5 SGX GPUs. In an SGX544MP3 for example, there are 3 complete instances of the SGX544 IP, duplicating all GPU resources, and we call that MP3 configuration a 3 core GPU.

Creative accounting

So, with the rapid increase in the number of CPU cores in modern mobile designs, GPU vendors want to put out that message that a GPU is a multi-core design too, and many of our competitors take an extra advantage by counting individual ALU pipelines as cores. Those pipelines can’t be scheduled completely independently of each other, and so they run the same instruction per cycle as their peers in SIMD fashion. No separate front-end or individual instruction pointer as we’ve outlined, but nonetheless marketed as a core.

Let’s describe PowerVR Rogue in the same way, from the basic building block of the Unified Shading Cluster down to its individual pipelines, and see what number of cores comes out.

PowerVR Rogue USC

The Rogue architecture organises itself around a block — itself a number of other blocks — called the Unified Shading Cluster or USC for short. We scale the architecture to meet our customer’s demand for a GPU to fit their system-on-chip and the market segment it addresses, and we do that by connecting a number of USCs together, along with other associated resources, into the full GPU IP.

Lift the lid on a USC and you’ll see a collection of ALU pipelines that chew on the data and spit out results. We arrange those pipelines in parallel, 16 per USC. We do that because graphics is overwhelmingly a parallel endeavour where multiple related things, usually vertices or pixels, can be worked on at the same time. In fact, certain properties of modern pixel shading force parallel execution of related pixels together, so you always want to work on them at the same time.

Scalar SIMD execution and vector inefficiency

A key property of the USC’s execution is that it processes data in a scalar fashion. What that means is for a given work item,  for example, a pixel, a USC doesn’t work on a  vector of red, green, blue and alpha in the same cycle inside an individual pipeline. Instead, the USC works on the red component in one cycle, then the blue component in the next, and so on until all components are processed. In order to achieve the same peak throughput as a vector-based unit, a scalar SIMD unit processes multiple work items in parallel lanes. For example, a 4-wide vector unit that processes one pixel per clock would have a peak throughput equivalent to a 4-wide scalar SIMD unit that can process four pixels per clock.

GPU compute_ALU utilization scalar vs vector architecture
On the face of it this makes the two approaches appear to have equivalent throughput. However, modern GPU workloads are typically composed of data that uses many different data widths. For example, colour data typically has a width of 4 (ARGB), whereas texture coordinates might typically have a width of 2 (UV) and there are many examples of scalar (1 component) processing such as parts of typical lighting calculations.

Where data processing doesn’t fill the full width of a vector you waste the vector processor’s precious compute resources. In a scalar architecture, the types you’re working on can take any form and they get worked on a component at a time, in unison with their other buddies that make up the parallel task. For example, a shading program that consists entirely of scalar processing would execute at 25% efficiency on a 4-wide vector architecture but would execute at 100% efficiency on a scalar SIMD architecture.

Lots of power-efficient ALUs!

Let’s get back to the individual pipelines working together on the parallel task in the USC. We have 16 of them remember, but inside each pipeline, there’s actually a number of ALUs that can do work. We have 2 FP32 ALUs, 2 FP16 ALUs and 1 special function ALU.

Why dedicated FP16 ALUs? It’s for power efficiency in the main, but it also significantly affects performance. The reduced complexity of the logic in those ALUs lets us execute FP16 instruction groups at lower power than on the FP32 ALUs, all while giving a higher per-cycle throughput because of the extra operations they can perform. You’ll see what I mean there shortly.

Computation at lower precision is something that’s possible a lot of the time in modern graphics rendering, and there’s support for mixed precision computation in all of the popular graphics APIs Rogue is aimed at, and that includes Direct3D 11, as well as the much more common OpenGL ES2 and ES3 APIs. Not building a mixed-precision computational pipeline is a mistake in embedded graphics, because of the power efficiency gains it can realise via that commonality of mixed precision workloads.

Performance and capability

The ALUs aren’t equal in capability, so let’s cover what each can do so you can see what their performance is:
The FP32 ALUs in all PowerVR Series6, Series6XT and Series6XE cores are capable of up to 2 floating point operations per cycle. Per USC, that’s a peak of 64 FLOPs per cycle.

PowerVR_Series6_Rogue_USCThere can be up to eight Unified Shading Clusters (USCs) inside a PowerVR Series6 GPU

The FP16 ALUs in PowerVR Series6 GPUs are capable of up to 3 floating point operations per cycle, and we’ve improved the FP16 ALUs in Series6XE and Series6XT to perform up to 4 FLOPs per cycle. Per USC, that’s up to 128 FLOPs per cycle depending on the product and which family it finds itself in. The improved design in Series6XE and Series6XT is additionally a bit more flexible, making it easier for the compiler to issue operations to that part of the pipeline.

PowerVR_Series6XT_Rogue_USC There can be up to eight Unified Shading Clusters (USCs) inside a PowerVR Series6XT GPU

Lastly, we have the special function ALU, which handles more complex arithmetic and trigonometric operations such as sine, cosine, log, reciprocals and friends, on scalar values. It has a range of output precisions and performance per operation that vary because of their nature.

Adding it all up into ALU cores

Now that I’ve described the Rogue compute architecture from the basic building block of the USC, down to how it executes using 16 parallel pipelines, each with significant dedicated compute resource, let’s count everything the same way our competitors do: as cores. That gives us 32 FP32 ALU cores, up to 64 FP16 ALU cores, and 16 special function ALU cores per USC.
The ALU core terminology is important when comparing Rogue to competitor products marketed in the same way, and we’d like everyone to stick to it as much as possible.

Finally, remember that we have 1-to-many USCs, depending on the product, in Series6, Series6XT and Series6XE. Here are two examples, to help verify how to total everything up:
PowerVR G6230:  two Series6 USCs – 64 FP32 ALU cores with up to 128 FLOPs per cycle – 64 FP16 ALU cores with up to 192 FLOPs per cycle. That means up to 115.2 FP16 GFLOPS and up to 76.8 FP32 GFLOPS at 600MHz.
PowerVR G6230 GPU - Allwinner A80
PowerVR GX6650: six Series6XT USCs – 192 FP32 ALU cores with up to 384 FLOPs per cycle – 384 FP16 ALU cores with up to 786 FLOPs per cycle. That means up to 460.8 FP16 GFLOPS and up to 230.4 FP32 GFLOPS at 600MHz.

PowerVR GPU_PowerVR GX6650
Happy core counting!

If you have questions or want to know more about our PowerVR GPUs, please use the comment box below. Make sure to follow us on Twitter (@ImaginationTech, @PowerVRInsider, @GPUCompute) for the latest news and updates from Imagination.

Kristof Beets

Kristof Beets

Kristof Beets is the senior director of product management for PowerVR Graphics at Imagination Technologies where he drives the product roadmaps to ensure alignment with market requirements. Prior to this, he was part of the business development group and before this, he led the in-house demo development team and the competitive analysis team. His engineering background includes work on SDKs and tools for both PC and mobile products as a member of the PowerVR Developer Relations Team. His work has been published in many guides game and graphics programmers, such as Shader X2, X5 and X6, ARM IQ Magazine, and online by the Khronos Group, and 3Dfx Interactive. Kristof has a background in electrical engineering and received a Master's degree in artificial intelligence. He has spoken at GDC, SIGGRAPH, Embedded Technology, MWC and many other conferences.

24 thoughts on “Graphics cores: trying to compare apples to apples”

  1. The text says scalar design, but the diagrams make this look like a 2-way superscalar (for FP32 instructions) that extracts ILP when possible. So a XYZW vector, it could process the X and the Y in the same cycle, yes? Or am I reading this wrong?

    • Yep, the main ALU pipe has two independent FMAs and the compiler has to work to drive both of those. So in your 4-component vector example, yes, the hardware could be working on 2 of those components in the same cycle (assuming the sources for each FMA allow that to work out).

  2. So, I assume then that “ALU core” is just your name for a unified shader? And based on the diagram, it looks like you have plenty of those but not very many texture mapping units, at least compared to GPUs from NVIDIA and AMD. Also, when you say your GPU is a quad-core model, you make it sound like it only has four shaders. That made it sound like the GPU was underpowered. So, I would seriously consider stopping marketing your GPUs based on the number of “cores” and instead list the shaders, bus width, frequency and ROPs so we can properly compare them to other GPUs. Terms like “ALU,” and “core” are usually used to describe CPUs instead of GPUs.

    • The article introduces two concepts: the USC (unified shader cluster) and ALU core. One USC has several ALU cores (depending on whether it’s a Series6, Series6XE, Series6XT, Series7XE or Series7XT GPU). Each pair of USCs is linked to one texture mapping unit – except for Series6XE and Seriex7XE GPUs.
      The frequency of the GPU is determined by the silicon vendor.
      Read this article from AnandTech for more information:

      • Ok, but in the anandtech article it is mentioned that the two FP32 ALUs in a pipeline can not used at the same time.
        Or would that mean, that at the same time one ALU can work on the x-Scalar and the other one on the y-Scalar of a vector?

  3. I notice that the PowerVR is able to do far more draw calls than the competition. This seems to make sense with your TBDR solution. In a typical TBR, each call is transformed, rasterized, filled and written out to the FB (albeit in small tiles). If sorted front to back, the typical TBR can just ignore filling occluded fragments. But for your TBDR, the driver would have to defer the rasterization/filling to the very last step in order to sort fragments, meaning that each draw call would contend with far less work — it would more-or-less submit information for future processing. When you’re ready for the framebuffer, the bulk of the work would be done in batches: eg. sorting, rasterizing, fragment shading based on materials/assets, framebuffer writing.
    Is this true? If it is, it’s quite a clever implementation as it efficiently batches workloads which should be far more efficient than running a complete render cycle for each draw call.

  4. This is great information, but outside of the revelation of FP16/special ALUs, and the new ‘ALU Core’ nomenclature, most of this information has been previously disclosed. Maybe I’m missing something? From my recollection, PowerVR has historically been quite open with their GPU tech.
    Still it’s quite helpful for public perception to provide a consistent nomenclature upon which to compare GPU architectures. ALU Core feels a bit odd, but it works, and gets the peak-compute point across quite effectively.
    Now, if only there was a simple way to quantify efficiency, an area that is almost wholly overlooked by the public. Perhaps a generic perf/MHz, perf/mm2, and perf/Watt measure needs to be established, where perf is some consistent measure of performance across architectures.

      • Thanks Alex. I guess the problem with this, is that the result of this one benchmark will be buried under the results of a myriad less credible benchmarks also available. Though it may provide credible results it doesn’t necessarily scream as loudly or clearly as a simple measure like “GHz” or “cores” to the average consumer.
        But it’s great that steps are being made, as they are sorely needed in the mobile GPU space.

  5. Hi Rys. Huge thanks for publishing this – it makes my job (as an iOS dev that spends too much time optimising shaders) that bit easier, since now I know better how to structure my shader code to keep the ALU occupied.
    What would be *really* helpful though: an ALU overview for the series 5 parts. Because, as you know, there’s a lot of those out there. And they tend to be the ones that cause the performance headaches 😀
    Has such a thing been published anywhere, or if not any chance that it will be?

      • Hi Alex. A bit more in-depth would be good – what I’d specifically like to see is the pipeline layout, a diagram like the one above would be fine (unless it’s got more funky 3-op/clock type things going on, in which case it’d be great to know what that is 🙂
        Why that would help: I write realtime photo/video processing stuff for iOS. Typically that means 1 quad/1shader, and it’s often limited by shader performance. With an idea of how the pipeline execution actually happens, I can change the shader so it’s hopefully a better fit. It’ll still mean a ton of experimenting and measuring, but a lot less than I have to do now 🙂

        • Okay, I understand. I’ll talk to the guys and see what we can do.
          In the meantime, I suggest you try to register on our developer website and get support from them directly (if you don’t do that already).
          I’m sure they can make some very good performance optimization recommendations.
          Best regards,

          • Great. And yeah, next time I hit some nasty shader optimising I’ll contact dev support (should have done that before really 🙂

  6. So all of sudden you guys want tell how many cores your processors
    has. Why don’t you list the core of all the released series6 family.
    Previously you gave #s with 300 MHz.
    now you are shifting to 600 MHz.
    So G6230 @ 600 MHz is equivalent to SGX554MP4 @ 300 MHz with 76.8 GLOPS for both.
    230.4/x = 600/950
    x = 364.8 GLOPS
    K1 was announce 365 GLOPS @ 950 MHz

    • It’s not sudden, we’ve been talking about our PowerVR Rogue GPUs quite a lot on the blog. If you look back, you will find we have a blog article for every Rogue released.
      We used 300 MHz as a conservative frequency for SGX; with current process node shrinks and DOK optimizations, Rogue GPUs can achieve higher frequencies while still offering the best performance per mW or mm2.

  7. I’m more than likely confused, or I could be reading the diagrams wrong. Your powerVr series 6 USC diagram shows 3 flops per ALU16, which is consistent with the narrative. However your series 6xt shows 2 flops per ALU16, whereas the narrative says its been enhanced to 4 per ALU16 ?

    • Series 6 had 2 fp16 cores, each capable of 3 flops. 6xt has 4 fp16 cores, each capable of 2 flops. So it’s really gone from 6 flops up to 8, although what happens in practice I don’t know.
      The new cores presumably do FMA ops (multiply + add, 2 ops), what were the old ones doing that had 3 ops per clock? Again, that would be helpful to know for those of us working on the things 😀
      Presumably the change increases performance in common situations anyway.

      • I would assume the same. But here is written “…improved the FP16 ALUs in Series6XE and Series6XT to perform up to 4 FLOPs per cycle…”
        That would not confirm our theory.


Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

[email protected]
Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles


Sign up to receive the latest news and product updates from Imagination straight to your inbox.