Instagram-like computational photography with PowerVR SGX GPUs

18 September 2013
Imagination Technologies

Mobile phones have been visually-oriented devices since the appearance of the first camera phones, and today user-produced content has been a driving force behind everything from network utilisation to app creation. However, developers have recently begun to use sophisticated manipulation of the image data to create a wide range of new user experiences, from intelligent vision systems, through augmented reality apps to avatar creation plus many others. This burst of creativity is limited only by the amount of processing which can be performed on the handset; and therein lies the challenge.

Computational photography: one of the main applications for mobile compute

Imaging workloads have long been the domain of the CPU, which works well enough for still images and small format video, but with HD now standard and 4K arriving very shortly, the ability to run interesting apps on the CPU has become severely limited by the thermal envelope of the device – simply adding more cores and running at higher frequencies does no good if the transistors overheat and shut down. This is a problem crying out for a solution.

Fortunately a solution is here: all modern application processors contain many compute engines such as a GPU, DSP and ISP (image synthesis processor) which can efficiently perform the required imaging tasks. As a result, developers are turning to heterogeneous computing, which is the art of combining these blocks together to provide high performance within restricted power and thermal budgets.

The PowerVR SGX GPU is running a saturation filter on Full HD (1080p) real-time video at 70 fps

PowerVR GPUs can be found in many of the world’s most iconic mobile products, delivering world-class 3D graphics at low power. Thanks to an extremely flexible architecture, devices incorporating our industry-leading PowerVR GPUs can be deployed as efficient compute engines to accelerate image and video processing tasks. Apps can delegate imaging kernels to the GPU using the OpenCL EP (Embedded Profile) API, which provides the optimal precision for visual computation, while keeping power consumption to a minimum.

This leaves the issue of how to share the image data between the various compute engines without redundant operations. The rest of this blog is devoted to describing how that can be done using tools developed by Imagination which were recently released to developers.

The challenge with sharing – ensuring zero copy

Apps designed to take advantage of heterogeneous computing require efficient interoperability between the APIs used to program the different compute engines. In a teleconferencing app that performs real-time airbrushing, for example, input frames captured by the camera might first be inspected by the GPU to determine the position of a face and its individual features (i.e. eyes, lips, nose and possibly others), passing these coordinates to the CPU to analyse. The CPU can then determine a set of image filters for the GPU to apply, such as removing blemishes or wrinkles, with the GPU generating a transformed frame for input to a video codec. In this scenario, up to four different system components each require access to the same image data in memory.

Until now, all OpenCL implementations in the market created behind-the-scenes copy of the image data when transferring its ownership between the camera and GPU. This operation unnecessarily increases system memory traffic, reduces performance and consumes power, negating (and in some cases eliminating) the benefit of offloading a task to the GPU. Imagination has been working with its lead partners over the last year to eliminate this barrier to efficiently enable GPU camera-based applications using OpenCL.

How did we solve this challenge?

We developed a set of extensions that allow images to be shared between multiple components which share the same system memory. These extensions are based on Khronos EGL images, which provide an interface between multiple Khronos APIs and the native platform windowing system, and handle issues related to binding and synchronization.

Many cameras generate image data in YUV format, where the Y and UV data are stored in separate planes. The PowerVR SGX GPU can automatically, in its fixed function texturing hardware, perform colour space conversion from YUV (NV21) to RGB and each pixel can be processed by the GPU as a vector operation, enabling efficient operation on R, G and B pixel values in parallel. This leads to OpenCL kernels similar in structure to the example below. The input and output images are represented as OpenCL Image data types, with individual pixels processed as float4 data types that match the native SGX vector width.

/* Generic 3x3 linear filter kernel. */
__kernel void convolve3x3(__read_only  image2d_t  srcImage,
                          __write_only image2d_t  dstImage,
                          sampler_t               sampler,
                          __constant float        *kVals,
                          float                    normalizationVal)
{
        int2 coords = (int2)(get_global_id(0), get_global_id(1));

        float4 colour;
        float4 colours[9];

        colours[0] = read_imagef(srcImage, sampler, coords + (int2)(-1, -1));
        colours[1] = read_imagef(srcImage, sampler, coords + (int2)( 0, -1));
        colours[2] = read_imagef(srcImage, sampler, coords + (int2)( 1, -1));
        colours[3] = read_imagef(srcImage, sampler, coords + (int2)(-1,  0));
        colours[4] = read_imagef(srcImage, sampler, coords + (int2)( 0,  0));
        colours[5] = read_imagef(srcImage, sampler, coords + (int2)( 1,  0));
        colours[6] = read_imagef(srcImage, sampler, coords + (int2)(-1,  1));
        colours[7] = read_imagef(srcImage, sampler, coords + (int2)( 0,  1));
        colours[8] = read_imagef(srcImage, sampler, coords + (int2)( 1,  1));

        colour  = colours[0] * kVals[0] + colours[1] * kVals[1] + colours[2] * kVals[2];
        colour += colours[3] * kVals[3] + colours[4] * kVals[4] + colours[5] * kVals[5];
        colour += colours[6] * kVals[6] + colours[7] * kVals[7] + colours[8] * kVals[8];

        colour /= normalizationVal;
        write_imagef(dstImage, coords, colour);
}

An illustration of the complete system solution is shown below. We convert the camera YUV data to RGB using the Khronos extension OES_EGL_image_external, and then use PowerVR’s zero-copy functionality to directly access the camera texture data from OpenCL, without requiring any intermediate copy from OpenGL ES to OpenCL. The filtered output image is then inserted into another zero-copy texture which can be written to the screen in 2D or 3D using OpenGL ES.

Zero copying data between multiple system components

OpenCL video processing in action

In a series of recent videos, Imagination demonstrates video-rate computational photography running at Full HD (1080p). This demonstration shows how an ‘Instagram-like’ app written for the Samsung Exynos 5410, the application processor found in the Samsung Galaxy S4 i9500 and Meizu MX3 smartphones as well as the Hardkernel ODROID-XU development board, is able to leverage the PowerVR GPU to achieve significantly better performance (30-70 fps) compared to running the image processing tasks on the CPU (4-7 fps), including reduced power consumption.

The PowerVR SGX544MP3 GPU inside the Exynos 5410-based Galaxy S4 is able to process real-time, Full HD (1080p) video

To enable developers to create their own high-performance GPU compute applications, Imagination will be soon releasing a GPU compute dev program where registered users will have the opportunity to access the PowerVR GPU compute SDK and programming guidelines for PowerVR GPUs. The extensions discussed in this post are fully supported in the OpenCL driver that ships with the low-cost Hardkernel ORDOID-XU board. Furthermore, developers looking to take full advantage of the compute capabilities of our PowerVR Series6 GPU will be available to use tools like PVRTune which will include enhanced profiling for OpenCL and API timing data, allowing developers to better profile their apps.

What are your first impressions on the demos shown above? Leave us your feedback in the comments box below and follow us on Twitter (@ImaginationTech) for more news and announcements.