COMMON EDGE AI SOFTWARE
For Edge AI, the software ecosystem separates into distinct areas
- Device delegates: Providing the runtime environment and work dispatch mechanics for running multiple user models in parallel on the same hardware, this is very much like running graphics applications, where priority is given to foreground or visible tasks. PyTorch and LiteRT are examples of device delegate software.
- Graph runtimes: Providing a mode data-centre-like model approach where you want to compile and execute an entire model into a single task for reasons of latency or performance requirement. ONNX and TVM are examples of graph runtime software.
- Portability software: Allowing customers and users to easily migrate custom applications written for another GPU compute platform such as CUDA for ease-of-use purposes, providing that first user confidence of getting things running in hours not days. SYCL and oneAPI are examples of portability software.
- Low-level software: Compute libraries and DDKs which support the major open hardware level abstractions for compute and graphics. The core interfaces of OpenCL, Vulkan, OpenGL and DirectX sit on top of a set of compilation and workload scheduling software that maps from the higher-level applications down into instructions which execution on the underlying compute platforms.
Let’s explore some examples of software solutions in these areas in more detail.
DEVICE DELEGATES: LiteRT
LiteRT delegates are a powerful feature of the LiteRT framework, designed to optimise and accelerate the execution of machine learning models on various hardware platforms. A delegate acts as an intermediary, enabling LiteRT to offload certain operations or entire models from the default CPU execution to specialised hardware accelerators, such as GPUs, DSPs (Digital Signal Processors), and NPUs (Neural Processing Units).
Delegates in LiteRT are utilised to enhance the performance of inference operations by taking advantage of specific hardware accelerators available on the device. They can significantly reduce inference time and increase efficiency by leveraging the hardware’s computational capabilities. LiteRT provides a range of delegates to choose from, allowing developers to select the most suitable hardware acceleration based on the device’s capabilities and the application’s requirements.
They have a GPU Delegate which accelerates model execution on the device’s GPU, offering significant speedups for operations that are parallelisable. This delegate is especially beneficial for models with high computational demands, such as those used in image processing and computer vision tasks. Beyond the standard delegates provided by LiteRT, developers have the option to create custom delegates for specialised hardware accelerators. This advanced feature allows for further optimisation and customisation for unique hardware platforms not directly supported by the predefined delegates.
Implementing a delegate in a LiteRT application involves minimal changes to the codebase. Developers can instantiate a delegate and apply it to the LiteRT interpreter with simple API calls. This flexibility makes it easy to experiment with different hardware accelerations to find the optimal configuration for a given model and device.
LiteRT delegates are a crucial component for optimising ML model inference on diverse hardware platforms. They enable developers to leverage device-specific accelerators, enhancing performance and efficiency across a wide range of applications and devices.
DEVICE DELEGATES: PyTorch (ExecuTorch)
An end-to-end solution for Edge AI inference capabilities across mobile and Edge AI devices including wearables, embedded devices, and microcontrollers. It enables efficient deployment of PyTorch models to edge devices.
ExecuTorch is compatible with a wide variety of computing platforms, from high-end mobile phones to highly constrained embedded systems and microcontrollers. It means that developers can use the same toolchains and SDK from PyTorch model authoring and conversion, to debugging and deployment to a wide variety of platforms. It provides end users with a seamless and high-performance experience due to a lightweight runtime and utilising full hardware capabilities such as CPUs, NPUs, and DSPs.
The basic flow for running a PyTorch model with ExecuTorch on an Edge AI device is to export the model, compile it into an executable format and then run this format on the target device. The compilation stage introduces optimisations like model compression and memory planning. ExecuTorch has a standardised interface for delegation to
compilers. This allows third-party vendors like Imagination to implement interfaces and API entry points for compilation and execution of (either partial or full) graphs targeting our hardware. This provides greater flexibility in terms of hardware support and performance optimisation, as well as easier integration with the PyTorch open-source ecosystem for Edge AI
DEVICE DELEGATES: DirectML
DirectML provides a powerful and flexible platform for integrating machine learning into applications on Windows. It is a low-level hardware-accelerated graphics API designed to provide high-performance machine learning (ML) inference. It operates on Windows 10 and newer versions, offering developers a consistent ML performance across lots of graphics hardware, including integrated GPUs, discrete GPUs, and other DirectX 12 compatible devices. DirectML works well with other DirectX 12 APIs, making it easier for developers to add ML features like post processing effects into games. It provides a flexible API that supports a broad range of models and operations, including those from popular frameworks like LiteRT and PyTorch outlined above. It can run on any DirectX 12 compatible device, meaning that developers can use DirectML to accelerate their applications without fear of vendor lock-in.
DEVICE DELEGATES: WinML
Windows Machine Learning (WinML) democratises access to Edge AI technologies, enabling developers to harness the power of machine learning directly within Windows applications. It is developer-friendly and high-level, abstracting away the complexities of running machine learning models.
WinML is a powerful machine learning framework introduced by Microsoft that leverages the hardware accelerated performance of modern Windows devices to run machine learning models more efficiently. It supports hardware-accelerated inference and optimises performance across a range of different devices by using the CPU, GPU, or dedicated AI processors. It relies on ONNX (Open Neural Network Exchange) format, an open model format that allows for interoperability across different frameworks and tools. So, developers building frameworks in e.g. LiteRT can convert these models to ONNX for use with WinML.
GRAPH RUNTIME: ONNX RUNTIME
Referenced above for WinML, ONNX is gaining traction with the developer community as a low resistance path for deploying AI models thanks to its integration with platforms like Huggingface. ONNX Runtime is a high-performance inference engine for machine learning models, optimised for both CPU and GPU, enabling fast model inference on servers or for Edge AI. It uses the special ONNX format; once models trained in frameworks like PyTorch are converted into the ONNX format they can use the ONNX runtime and be deployed across a wide range of devices and operating systems. It leverages parallel computing and hardware acceleration to achieve the best possible performance.
GRAPH RUNTIME: APACHE TVM
Apache TVM is a comprehensive, open-source machine learning compiler framework that enables the efficient deployment of deep learning models on different hardware platforms. It bridges the gap between the rapidly evolving landscape of deep learning models and the diverse ecosystem of computing hardware. It focuses on optimising and compiling models from higher level frameworks (like PyTorch) into machine executable code that is optimised for the specific target hardware. It allows for the addition of new hardware backends as the computing landscape evolves to maintain cross-platform support.
PORTABILITY: oneAPI
oneAPI is an initiative started by Intel and now run by the UXL Foundation (a project within the Linux Foundation) aimed at creating a unified and open programming model for developing applications that can run across various computing architectures, including CPUs, GPUs, FPGAs, and other accelerators. The project seeks to address the complexity of programming for diverse hardware by providing a single, coherent framework that abstracts hardware-specific details.
oneAPI consists of DPC++ programming language, toolkits targeting specific development domains, and a set of performance libraries and APIs. It facilitates cross-platform software development and builds on open standards, notably SYCL (more below) to ensure broad compatibility and encourage adoption by the developer community.
PORTABILITY: SYCL
SYCL is a high-level programming model designed to help developers write code for heterogeneous computing using standard C++. It is developed by the Khronos Group and aims to make parallel programming more accessible and efficient by leveraging different platforms like CPUs, GPUs, DSPs, and FPGAs using a unified framework. It provides abstractions for expressing parallelism and managing memory across different compute devices. SYCL code is portable and can run on any Edge AI device that supports the SYCL runtime.
INTERFACES: OpenCL
OpenCL is widely used in applications that require high performance computing and offers a powerful way to harness the computational power of GPUs and other processors alongside traditional CPU resources. It is an open standard for cross-platform, parallel programming of diverse processors found in Edge AI devices, using kernels that execute across many parallel work units and offering explicit memory management control. It provides a framework for writing programs that execute on a heterogeneous platform. Managed by the Khronos Group, it enables developers to write efficient and portable code for a wide range of Edge AI devices.
INTERFACES: VULKAN COMPUTE
Vulkan Compute offers a powerful and flexible platform for leveraging GPUs for non-graphics computational asks. It is part of the Vulkan API that operates across multiple platforms, enabling applications to run on a wide range of devices. It gives developers explicit control over GPU operations, memory management and synchronisation for finely tuned performance optimisations. Its cross-platform nature and explicit control over GPU resources make it a compelling choice for developers looking to optimise the performance of compute-intensive Edge AI applications.