Heterogeneous compute case study: image convolution filtering

Share on linkedin
Share on twitter
Share on facebook
Share on reddit
Share on digg
Share on email

In a previously published article, I offered a quick guide to writing OpenCL kernels for PowerVR Rogue GPUs; this sets the scene for what follows next: a practical case study that analyzes image convolution kernels written using OpenCL.

Many image processing tasks such as blurring, sharpening and edge detection can be implemented by means of a convolution between an image and a matrix of numbers (or kernel). The figure below illustrates a 3×3 kernel that implements a smoothing filter, which replaces each pixel value in an image with the mean average value of its neighbours including itself.

15-Example of image filtering by means of convolution

Kernel convolution usually requires values from pixels outside of the image boundaries. A variety of methods can be used to handle image edges, for example by extending the nearest border pixels to provide values for the convolutions (as shown above) or cropping pixels in the output image that would require values beyond the edge of an input image, which reduces the output image size.

The table below shows the algorithmic pseudo code for this filter on the left along with a C implementation on the right. In the C program, it is assumed that each pixel is represented by a 32-bit integer comprising four 8-bit values for R, G, B and A; the macro MUL4 therefore performs four separate multiplications, one for each of these 8-bit values.

Convolution filter pseudo codeConvolution filter in C
void blur(int src[Gx][],int dst[Gx][], char *weight)
{ int x, y, pixel, acc;
  for each image row in input image:
  for (y=1; y<Gx-1; y++) {
    for each pixel in image row:
    for (x=1; x<Gy-1; x++) {
      set accumulator to zero
      acc = 0;
      for each kernel row in kernel:
      for (j=-1; j<=1; j++) {
        for each element in kernel row:
        for (i=-1; i<=1; i++) {
          multiply element value corresponding to pixel value
          pixel = MUL4(src[y+j][x+i], weight[j+1][i+1], 16);
          add result to accumulator
          acc += pixel;
       } }
      set output image pixel to accumulator
       dst[y][x] = acc;

By using the same approach as introduced in my previous article to extract the inner compute kernel from the serial control flow (the outer two nested loops), and applying the programming guidelines found here, the following OpenCL kernel is produced.

__attribute__((reqd_work_group_size(8, 4, 1)))
__kernel void blur(image2d_t src, image2d_t dst, sampler_t s, float *weight)
  int x = get_global_id(0);
  int y = get_global_id(1);
  float4 pixel = 0.0f;
  for (j=-1; j<=1; j++) {
    for (i=-1; i<=1; i++)
      pixel += read_imagef(src, (int2)(x+i,y+j), s) * weight[j+1][i+1];
  write_imagef(dst, (int2)(x,y), pixel/9.f);


The statement

__attribute__((reqd_work_group_size(8, 4, 1)))

sets the workgroup size at compile-time to 32 (8x4). This restricts the host program to enqueuing this kernel as a 2D range with an 8×4 configuration, but improves the performance of the kernel when compiled.

The function declaration

__kernel void blur(image2d_t src, image2d_t dst, sampler_t s, …

specifies OpenCL image parameters and a sampler for accessing image data from system memory. To implement the border pixel behaviour in the above example, the host should configure the sampler as clamp-to-nearest-border (not shown).

The statement

float4 pixel = 0.0f;

defines a vector of four 32-bit floating point values. The threads use floating-point arithmetic to perform the convolution, which offers higher throughput compared to integer or character data types.

The statement

read_imagef(src, s, (int2)(x+i,y+j))

causes the TPU to sample a pixel from system memory into private memory, converting the constituent R, G, B and A values into four 32-bit floating-point values and placing these into a four-wide vector. This conversion is performed efficiently by the hardware, requiring the multiprocessor to issue just a single instruction.

The statement

write_imagef(dst, (int2)(x,y), pixel/9.f);

writes a (normalized) output pixel value back to system memory.

Caching frequently-used data in the common store

In the example in the previous section, all work-items operate independently of one another, each work-item independently sampling nine input pixels to calculate one output pixel. Overall the kernel has a fairly low arithmetic intensity (i.e. a low ratio of multiply-and-accumulate operations to memory sampling operations), which can lead to low performance.

For a workgroup size of 32, each workgroup performs a total of 288 (9×32) sampling operations. However, as show below, adjacent work-items use six of the same overlapping pixels from the input image.

16-A 3x3 image filter example showing overlap between adjacent sampled values

The common store is a fast on-chip memory that you can use to optimize access to frequently-used data, and also to share data between work-items in a workgroup, in this case enabling reduction of the number of sampling operations performed by a workgroup. A typical programming pattern is to stage data from system memory into common store by having each work-item in a workgroup:

  • Load data from global memory to local memory.
  • Synchronize with all other work-items in the workgroup, ensuring all work-items block until all memory reads have completed.
  • Process the data in local memory.
  • Synchronize with all other work-items in the workgroup, ensuring all work-items finish writing their results to local memory.
  • Write the results back to global memory.

The example program below is a refinement of the previous program, rewritten to use local memory to reduce the number of sampling operations to system memory.

__attribute__((reqd_work_group_size(8, 4, 1)))
__kernel void blur (image2d_t src, image2d_t dst, sampler_t s, float *weight)
  int2 gid = (int2)(get_group(id(0)*8, get_group_id(1)*4);
  int2 lid = (int2)(get_local_id(0),   get_local_id(1));
  float4 pixel = 0.0f;

  __local float4 rgb[10][6];
  prefetch_texture_samples_8x4(src, sampler, rgb, gid, lid);

  for (j=-1; j<=1; j++) {
    for (i=-1; i<=1; i++)
      pixel += rgb[lid.x+1+i][lid.y+1+i]) * weight[j+1][i+1]);
  write_imagef(dst, (int2)(x, y), pixel/9.f);

void prefetch_texture_samples_8x4(image2d_t src, sampler_t s, __local float4 rgb [10][6], int2 gid, int2 lid)
  if (lid.x == 0) {
    // work-item 1 fetches all 60 rgb samples
    for (int i=-1; i<9; i++) {
      for (int j=-1; j<5; j++)
        rgb[i+1][j+1] = read_imagef(src, s, gid+(int2)(i, j));


The statement

__local float4 rgb[10][6];

declares a local array, which is allocated in the common store.

The kernel first calls the function

void prefetch_texture_samples_8x4( …

In this function, all work-items in a work-group first test their local ID together, and work-item 0 samples data from memory into the common store; all work-items then synchronize on a barrier. This synchronization operation is necessary to prevent the other work-items from attempting to read uninitialized data from the common-store. In the main kernel, calls to read_imagef are replaced by reads from the local memory array rgb.

In this optimized program each work-group performs a total of 60 sample operations, all during initialization, compared to the 288 in-line sampling operations performed in the original program. This reduction in memory bandwidth can significantly improve performance.

The prefetch function can be further improved so that instead of a single work-item fetching 60 samples in sequence, 30 work-items each fetch two samples in sequence. The following example shows one way in which this can be implemented.


inline void prefetch_8x4_optimized(image2d_t src, sampler_t s, __local float4 rgb[10][6])
  // Coord of wi0 in NRDange
  int2 wi0Coord = (int2)(get_group_id(0)*8, get_group_id(1)*4);

  // 2D to 1D address (from 8x4 to 32x1)
  int flatLocal = get_local_id(1)*8 + get_local_id(0);

  // Only first 30 work-items load, each loads 2 values in sequence
  if (flatLocal < 30)
    /* Convert from flatLocal 1D id to 2D, 10x3 */
    int i = flatLocal % 10; // Width
    int j = flatLocal / 10; // Height
    /* 30 work iteams reads 10x3 values,
     * values 0-9, 10-19, 20-29 from 10x6 - top half
    rgb[j][i] = read_imagef(src, s, (int2)(wi0Coord.x + i - 1, wi0Coord.y + j - 1));
    /* 30 work iteams reads 10x3 values,
     * values 30-39, 40-49, 50-59 from 10x6 - bottom half
    rgb[j + 3][i] = read_imagef(src, s, (int2)(wi0Coord.x + i - 1, wi0Coord.y + j + 3 - 1));


In the best case, work-items can fetch data from the common store in a single cycle. In practice, however, a number of conditions must be met to achieve this efficiency.

Computer vision is what we’ll be focusing on for the next section of our heterogeneous compute series; stay tuned for an overview of how you build a computer vision platform for mobile and embedded devices.

Further reading

Here is a menu to help you navigate through every article published in this heterogeneous compute series:


Please let us know if you have any feedback on the materials published on the blog and leave a comment on what you’d like to see next. Make sure you also follow us on Twitter (@ImaginationTech, @GPUCompute and @PowerVRInsider) for more news and announcements from Imagination.

Alex Voica

Alex Voica

Before deciding to pursue his dream of working in technology marketing, Alexandru held various engineering roles at leading semiconductor companies in Europe. His background also includes research in computer graphics and VR at the School of Advanced Studies Sant'Anna in Pisa. You can follow him on Twitter @alexvoica.

7 thoughts on “Heterogeneous compute case study: image convolution filtering”

  1. Do you guys have implementations of sgemm that are publicly available. I’m trying to deploy a NN decoder using an sgemm formulation of the Convolution and I’m getting very poor performance (2 GFlops) on a Power VR Rogue GE8300.

  2. In prefetch_8x4_optimized, why divide into upper half and lower half part?
    BTW, please guide where to download the whole source code for comparing…
    Appreciated with your help.

  3. Excuse me, could you please teach me how to decide this size?
    float4 rgb[10][6]; <— 10×6
    prefetch_texture_samples_8x4 <— 8×4
    I am so confused…

  4. Thank you for your posting. I am currently studying OpenCL. If you don’t mind, can I translate this article into Korean and re-post? Of course I will attach the link of this article and make it clear I translated this origin. THX

  5. I think there might be a couple of minor issues here, but correct me if I’m wrong.
    1. The article says that the statement
    __attribute__((reqd_work_group_size(8, 4, 1)))
    “restricts the host program to enqueuing this kernel as a 2D range with an 8×4 configuration”.
    Why is that? I’d say that on the contrary, it forces the host to launch the kernel with a 2D group size set to (8, 4).
    2. int needs to be replaced by int2 here:
    int gid = (int2)(get_group(id(0)*8, get_group_id(1)*4);
    int lid = (int2)(get_local_id(0), get_local_id(1));
    3. In the first version of prefetch_texture_samples_8x4 the test
    if (lid == 0)
    probably needs to be replaced by
    if (lid.x == 0)


Please leave a comment below

Comment policy: We love comments and appreciate the time that readers spend to share ideas and give feedback. However, all comments are manually moderated and those deemed to be spam or solely promotional will be deleted. We respect your privacy and will not publish your personal details.

Blog Contact

If you have any enquiries regarding any of our blog posts, please contact:

United Kingdom

[email protected]
Tel: +44 (0)1923 260 511

Search by Tag

Search by Author

Related blog articles

android background

The Android Invasion: Imagination GPU IP buddies up with Google-powered devices

Google Android continues to have the lion share of the mobile market, powering around 75% of all smartphones and tablets, making it the most used operating system in the world. Imagination’s PowerVR architecture-based IP and the Android OS are bedfellows, with a host of devices based on Android coming to market all the time. Here we list a few that have appeared in Q4 2020.

Read More »
bseries imgic technology

Back in the high-performance game

My first encounter with the PowerVR GPU was helping the then VideoLogic launch boards for Matrox in Europe. Not long after I joined the company, working on the rebrand to Imagination Technologies and promoting both our own VideoLogic-branded boards and those of our partners using ST’s Kyro processors. There were tens of board partners but only for one brief moment did we have two partners in the desktop space: NEC and ST.

Read More »
pvrtune complete

What is PVRTune Complete?

PVR Tune Complete highlights exactly what the application is doing at the GPU level, helping to identify any bottlenecks in the compute stage, the renderer, and the tiler.

Read More »


Sign up to receive the latest news and product updates from Imagination straight to your inbox.