1. NeuroPilot Introduction
1.1 NeuroPilot Software Ecosystem
NeuroPilot is a collection of software tools and APIs which are at the center of MediaTek’s AI ecosystem. These tools are designed to fulfill the goal of “Edge AI”, which means that AI processing is performed locally on the device rather than remotely on a server. With NeuroPilot, users can develop and deploy AI applications on edge devices with extremely high efficiency. This makes a wide variety of AI applications run faster, while also keeping data private.
MediaTek’s hardware platforms, such as mobile System-on-Chip (SoC) and ultra-low power embedded devices, span different levels of compute density. MediaTek is deeply invested in creating an AI ecosystem with efficient yet powerful AI-processors in its devices, ranging from smartphones to smart homes, wearables, Internet-of-Things (IoT), and connected cars.
Open frameworks such as TensorFlow offer out-of-the-box usability, but typically lack optimized support for advanced hardware. NeuroPilot allows users to use all the available hardware resources of a MediaTek AI platform, beyond that offered by an open framework. NeuroPilot provides programming support for specialized device capabilities, which allows for better performance, power, memory consumption, and end-user experience.
1.1.1 Neural Network Framework Support
NeuroPilot’s software tools support common AI frameworks such as TensorFlow, PyTorch, and TensorFlow Lite (TFLite). NeuroPilot provides support for inspecting, loading, and converting models, either to MediaTek-optimized model formats, or open framework standard model formats.
For Android devices, NeuroPilot provides extensions for the Android Neural Network API (NNAPI). This enables developers and device makers to bring their code closer to the hardware, for better performance and power-efficiency on MediaTek devices. NeuroPilot also allows developers to use a ‘write once, apply everywhere’ flow for existing and future MediaTek devices, including smartphones, automotive, smart home, IoT, and more. This streamlines the creation process, saving cost and time to market.
1.1.2 NeuroPilot Software Tools
Tool |
Type |
Description |
AISimulator |
Web tool |
A simulator which simulates a neural network workload on MediaTek’s AI Processing Unit (APU). |
Android Run-Time Libraries |
Library |
Libraries which provide NNAPI delegates for special-purpose hardware cores (GPU, VPU, MDLA), and support for dynamic scheduling. |
4.1.1. Converter |
Command line tool |
Convert a pre-trained and optimized PyTorch or TensorFlow model into a TensorFlow Lite model, and perform post-training quantization. |
4.1.2. Neuron SDK |
Command line tools, API, Library |
A TFLite model compiler which produces ready-to-run compiled binary models (.dla). |
4.1.3. Quantization |
Command line tool |
Optimize a model for efficient inference on MediaTek devices using quantization-aware training. |
1.1.2.1 Neuron SDK
Neuron SDK allows users to convert their custom models to MediaTek-proprietary binaries for deployment on MediaTek platforms. The resulting models are highly efficient, with reduced latency and a smaller memory footprint. Users can also create a runtime environment, parse compiled model files, and perform inference on the edge. Neuron SDK is aimed at users who are performing bare metal C/C++ programming for AI applications, and offers an alternative to the Android Neural Networks API (NNAPI) for deploying Neural Network models on MediaTek-enabled Android devices.
Neuron SDK consists of the following components:
- 4.1.2.1.1. Neuron Compiler (ncc-tflite): An offline neural network model compiler which produces statically compiled deep learning archive (.dla) files.
- 4.1.2.1.2. Neuron Runtime (neuronrt): A command line tool which executes a specified .dla file and reports the results.
- 4.1.2.2. Neuron Runtime API: A user-invoked API which supports loading and running compiled .dla files within a user’s C++ application
1.1.2.2 Android Run-Time Libraries
MediaTek provides several run-time libraries for Android devices. These libraries allow for greater control and utilization of MediaTek special-purpose cores. The main library which implements most of this capability is an optimized Android Neural Network API (NNAPI) library, which is part of the Android NDK. The NNAPI library provides NNAPI hardware delegates, which enables the use of the GPU, VPU, and MDLA cores when running neural networks. This means that any NNAPI application can use MediaTek acceleration cores without any special changes to the application code. This accelerator support also includes support for .tflite models running in the Android TFLite run-time layer.
Note: |
MediaTek provides ready-to-run Android libraries for all Android-compatible devices. The developer does not need to interact with these libraries, and there is no special settings required to use them. |
1.2 MediaTek Device Capabilities
Using MediaTek devices gives users extraordinary speed and efficiency for AI applications. MediaTek devices deliver outstanding performance while consuming very little power.
1.2.1 Hardware Support
NeuroPilot tools can use the following target compute devices to run neural network models.
- CPU
- GPU
- VPU (Vision Processing Unit)
- MDLA (MediaTek Deep Learning Accelerator)
Successful use of these cores depends on the following factors, which interact with a user’s model.
- Neural network framework format of the trained model.
- Hardware platform (e.g. part number and device capability).
- Required model accuracy. Models with high accuracy requirements might limit the type and significance of the optimizations that can be applied to the model. This might also limit the target devices that are able to run the model with the required performance and accuracy.
- Neural network model structure. Certain operation (OP) types are not supported on certain targets device. For details, refer to the Supported Operations section of the platform’s documentation.
Note: |
|
1.2.1.1 Device Parametric Table
Device |
Operator Flexibility |
Performance |
Power Consumption |
Data Types |
---|---|---|---|---|
CPU |
Very High |
Low |
High |
FP32, FP16, INT16, INT8 |
GPU |
Medium |
Medium |
Medium |
FP32, FP16 |
VPU |
Medium |
High |
Low |
FP32, FP16, INT16, INT8 |
MDLA |
Low |
Very High |
Low |
FP16, INT16, INT8 |
As a general rule, you should target the most power-efficient device that your neural network or developer constraints can support. The lowest-power devices are also the highest performing.
1.2.2 Devices
1.2.2.1 CPU
The CPU is capable of running any neural network, and is guaranteed to support all existing and future NN operations. Support is provided in the TFlite subsystem from Google for Android devices. For native development, developers can use the TFlite C++ API. The CPU is the most flexible target device, but it is also the least optimized for power and performance.
1.2.2.2 GPU
The GPU provides neural network acceleration for floating point models.
-
ARM-based MediaTek platforms support GPU neural network acceleration via Arm NN and the Arm Compute Library.
-
Non-ARM MediaTek platforms support GPU neural network acceleration via Google’s TensorFlow Lite GPU delegate. This GPU delegate is able to accelerate a wide selection of TFlite operations.
1.2.2.3 MVPU
The MediaTek Vision Processing Unit (MVPU) offers general-purpose Digital Signal Processing (DSP) capabilities, with special hardware for accelerating complex imaging and computer vision algorithms. The MVPU also offers outstanding performance while running AI models.
1.2.2.4 MDLA
The MediaTek Deep Learning Accelerator (MDLA) is a powerful and efficient Convolutional Neural Network (CNN) accelerator. The MDLA is capable of achieving high AI benchmark results with high Multiply-Accumulate (MAC) utilization rates. The design integrates MAC units with dedicated function blocks, which handle activation functions, element-wise operations, and pooling layers.
The MDLA uses a technique called tile-based layer fusion to help achieve high compute efficiency and bandwidth reduction. Tile-based layer fusion identifies and then fuses dependent inter-layer operations, in order to reduce the amount of data the MDLA brings on-chip.
2. Hardware Support Specification
The following MediaTek platforms support NeuroPilot 5:
2.1 Hardware Specifications
2.1.1. Dimensity 9000
Feature |
D9000 |
---|---|
Process |
T-4nm |
CPU |
1x Arm Cortex-X2 at 3.05GHz, 1MB L2 |
GPU |
Arm Mali-G710 MC10 |
Memory |
4x LPDDR5X 7500MHz |
Camera |
4K30 3-exp Video HDR x 3CAM |
AI |
MediaTek APU 590 |
Video Decoder |
8K 30fps |
Video Encoder |
8K 24fps |
Display |
2480x2200 120Hz |
Connectivity |
Wi-Fi 6E 2x2, 160MHz bandwidth |
Modem |
5G NR 3CC 300MHz with ET 60MHz |
2.1.2. APU
The MediaTek AI Processing Unit (APU) is a a high-performance hardware engine for deep-learning, optimized for bandwidth and power efficiency. The APU architecture consists of big, small, and tiny cores. This highly heterogeneous design is suited for a wide variety of modern smartphone tasks, such as AI-camera, AI-assistant, and OS or in-app enhancements.
2.1.2.1. MVPU 2.0
The Vision Processing Unit (VPU) offers general-purpose Digital Signal Processing (DSP) capabilities, with special hardware for accelerating complex imaging and computer vision algorithms. The VPU also offers outstanding performance while running AI models.
2.1.2.2. MDLA 3.0
The MediaTek Deep Learning Accelerator (MDLA) is a powerful and efficient Convolutional Neural Network (CNN) accelerator. The MDLA is capable of achieving high AI benchmark results with high Multiply-Accumulate (MAC) utilization rates. The design integrates MAC units with dedicated function blocks, which handle activation functions, element-wise operations, and pooling layers.
The MDLA uses a technique called tile-based layer fusion to help achieve high compute efficiency and bandwidth reduction. Tile-based layer fusion identifies and then fuses dependent inter-layer operations, in order to reduce the amount of data the MDLA brings on-chip.
2.2. Supported Operations
This section describes all the neural network operations (OPs) that the D9000 supports through NeuroPilot, and any restrictions placed on their use.
2.2.1. TFLite Operations
Note:
NeuroPilot supports a wide variety of operations in TFLite. This allows most neural network models to run on the specialized compute cores available on Mediatek platforms.
If you trained a model using TensorFlow v1, TensorFlow v2, or PyTorch, you can convert the model to TFLite format using NeuroPilot’s Converter Tool. For details, see the Convertor Tool section in the NeuroPilot SDK documentation.
2.2.1.1. Supported Data Types
The following table lists the supported data types of each D9000 hardware target.
Device |
AsymU8 |
AsymI8 |
SymI8 |
SymI16 |
Fp16 |
Fp32 |
Bool8 |
Int32 |
---|---|---|---|---|---|---|---|---|
MDLA 3.0 |
O |
O |
O |
O |
O |
|||
MVPU 2.0 |
O |
O |
O |
O |
O |
O |
O |
O |
GPU |
O |
O |
||||||
CPU |
O |
O |
O |
O |
O |
O |
2.2.1.2. Supported NeuroPilot Operations
The following table lists the TFLite operatations supported by NeuroPilot 5.0 SDK on the Dimensity D9000 platform for each data type.
Note:
For full details of each operation, see the corresponding hardware guidelines section.
OP Name |
TFLite OP |
Ncc-tflite |
Neuron Delegate |
NNAPI Delegate |
AsymU8 |
AsymI8 |
SymI8 |
SymI16 |
Fp16 |
Fp32 |
Bool8 |
Int32 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Abs |
kTfLiteBuiltinAbs |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
|||
ArgMax |
kTfLiteBuiltinArgMax |
O |
O |
since API Level 29 |
O |
O |
O |
O |
||||
ArgMin |
kTfLiteBuiltinArgMin |
O |
O |
since API Level 29 |
O |
O |
O |
O |
||||
AvgPooling |
kTfLiteBuiltinAveragePool2d |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Cast |
kTfLiteBuiltinCast |
O |
O |
since API Level 29 |
O |
O |
O |
|||||
Concat |
kTfLiteBuiltinConcatenation |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Conv2D |
kTfLiteBuiltinConv2d |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
DepthToSpace |
kTfLiteBuiltinDepthToSpace |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
DepthwiseConv2D |
kTfLiteBuiltinDepthwiseConv2d |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Dequantize |
kTfLiteBuiltinDequantize |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
ElementWiseAdd |
kTfLiteBuiltinAdd |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
ElementWiseDiv |
kTfLiteBuiltinDiv |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
ElementWiseMul |
kTfLiteBuiltinMul |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
ElementWiseSub |
kTfLiteBuiltinSub |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
Elu |
kTfLiteBuiltinElu |
O |
since API Level 30 |
O |
O |
O |
O |
O |
O |
|||
Equal |
kTfLiteBuiltinEqual |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
FullyConnected |
kTfLiteBuiltinFullyConnected |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Gather |
kTfLiteBuiltinGather |
O |
O |
since API Level 29 |
O |
O |
||||||
Greater |
kTfLiteBuiltinGreater |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
GreaterEqual |
kTfLiteBuiltinGreaterEqual |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
HardSwish |
kTfLiteBuiltinHardSwish |
O |
O |
since API Level 30 |
O |
O |
O |
O |
O |
O |
||
L2Norm |
kTfLiteBuiltinL2Normalization |
O |
O |
since API Level 27 |
O |
O |
O |
|||||
Less |
kTfLiteBuiltinLess |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
LessEqual |
kTfLiteBuiltinLessEqual |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
Maximum |
kTfLiteBuiltinMaximum |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
MaxPooling |
kTfLiteBuiltinMaxPool2d |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Mean |
kTfLiteBuiltinMean |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
Minimum |
kTfLiteBuiltinMinimum |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
Neg |
kTfLiteBuiltinNeg |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
|||
NotEqual |
kTfLiteBuiltinNotEqual |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
Pack |
kTfLiteBuiltinPack |
O |
O |
O |
O |
O |
O |
|||||
Pad |
kTfLiteBuiltinPad |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
Pad |
kTfLiteBuiltinPadv2 |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
Pow |
kTfLiteBuiltinPow |
O |
O |
since API Level 29 |
O |
O |
O |
O |
||||
PRelu |
kTfLiteBuiltinPrelu |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
PRelu |
kTfLiteBuiltinLeakyRelu |
O |
O |
O |
O |
O |
O |
O |
O |
|||
Quantize |
kTfLiteBuiltinQuantize |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
|||
ReduceAny |
kTfLiteBuiltinReduceAny |
O |
O |
since API Level 29 |
O |
O |
||||||
ReduceMax |
kTfLiteBuiltinReduceMax |
O |
O |
since API Level 29 |
O |
O |
||||||
ReduceMin |
kTfLiteBuiltinReduceMin |
O |
O |
since API Level 29 |
O |
O |
||||||
ReLU |
kTfLiteBuiltinRelu |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
ReLU6 |
kTfLiteBuiltinRelu6 |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Reshape |
kTfLiteBuiltinReshape |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Reshape |
kTfLiteBuiltinSqueeze |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
Resize::BILINEAR |
kTfLiteBuiltinResizeBilinear |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Resize::NEAREST |
kTfLiteBuiltinResizeNearestNeighbor |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
RSqrt |
kTfLiteBuiltinRsqrt |
O |
O |
since API Level 29 |
O |
O |
||||||
Sigmoid |
kTfLiteBuiltinLogistic |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Slice |
kTfLiteBuiltinSlice |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
|||
SoftMax |
kTfLiteBuiltinSoftmax |
O |
O |
since API Level 27 |
O |
O |
O |
O |
||||
SpaceToDepth |
kTfLiteBuiltinSpaceToDepth |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Split |
kTfLiteBuiltinSplit |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
||
Sqrt |
kTfLiteBuiltinSqrt |
O |
O |
since API Level 29 |
O |
O |
||||||
Square |
kTfLiteBuiltinSquare |
O |
O |
O |
O |
O |
O |
|||||
SquaredDifference |
kTfLiteBuiltinSquaredDifference |
O |
O |
O |
O |
O |
||||||
StridedSlice |
kTfLiteBuiltinStridedSlice |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
Tanh |
kTfLiteBuiltinTanh |
O |
O |
since API Level 27 |
O |
O |
O |
O |
O |
O |
||
Tile |
kTfLiteBuiltinTile |
O |
O |
since API Level 29 |
O |
O |
||||||
Transpose |
kTfLiteBuiltinTranspose |
O |
O |
since API Level 28 |
O |
O |
O |
O |
O |
O |
||
TransposeConv2D |
kTfLiteBuiltinTransposeConv |
O |
O |
since API Level 29 |
O |
O |
O |
O |
O |
O |
2.2.1.3. Supported Hardware Operations
The table below lists the supported TFLite operations (OPs) for hardware targets on different MediaTek platforms.
Note:
Each OP has both might have hardware and software constraints.
For details on MDLA constrains, see MDLA 3.0 Guidelines.
To check the supported TFLite OP versions, use the command -–show-builtin-ops in ncc-tflite.
TFLite operations |
MDLA 1.0 |
MDLA 1.5/1.7 |
MDLA 2.0 |
MDLA 3.0 |
VPU (asym8) |
VPU (fp16) |
MVPU 2.0 |
---|---|---|---|---|---|---|---|
ABS |
O |
O |
O |
O |
|||
ADD |
O |
O |
O |
O |
O |
O |
O |
ARG_MAX |
O |
O |
|||||
ARG_MIN |
O |
O |
|||||
AVERAGE_POOL_2D |
O |
O |
O |
O |
O |
O |
|
BATCH_TO_SPACE_ND |
O |
O |
O |
O |
O |
||
CAST |
O |
O |
O |
O |
O |
||
CEIL |
O |
||||||
CHANNEL_SHUFFLE |
O |
||||||
CONCATENATION |
O |
O |
O |
O |
O |
||
CONV_2D |
O |
O |
O |
O |
O |
||
DEPTH_TO_SPACE |
O |
O |
O |
O |
O |
||
DEPTHWISE_CONV_2D |
O |
O |
O |
O |
O |
||
DEQUANTIZE |
O |
O |
O |
O |
O |
O |
|
DIV |
O |
O |
O |
O |
O |
O |
|
ELU |
O |
O |
O |
O |
|||
EQUAL |
O |
O |
|||||
EXP |
O |
||||||
EXPAND_DIMS |
O |
O |
O |
O |
O |
||
FILL |
O |
||||||
FLOOR |
O |
||||||
FULLY_CONNECTED |
O |
O |
O |
O |
O |
O |
|
GATHER |
O |
O |
|||||
GREATER |
O |
O |
|||||
GREATER_EQUAL |
O |
O |
|||||
HARD_SWISH |
O |
O |
O |
O |
O |
||
L2_NORMALIZATION |
O |
O |
|||||
L2_POOL_2D |
O |
O |
O |
O |
|||
LEAKY_RELU |
O |
||||||
LESS |
O |
O |
|||||
LESS_EQUAL |
O |
O |
|||||
LOCAL_RESPONSE_NORMALIZATION |
|||||||
LOG |
O |
||||||
LOG_SOFTMAX |
O |
||||||
LOGICAL_AND |
O |
||||||
LOGICAL_NOT |
O |
||||||
LOGICAL_OR |
O |
||||||
LOGISTIC |
O |
O |
O |
O |
O |
O |
|
LSTM (QSTM) |
O |
O |
O |
O |
O |
||
MAX_POOL_2D |
O |
O |
O |
O |
O |
O |
|
MAXIMUM |
O |
O |
O |
O |
O |
O |
|
MEAN |
O |
O |
O |
O |
O |
O |
|
MINIMUM |
O |
O |
O |
O |
O |
O |
|
MIRRORPAD |
O |
O |
O |
||||
MUL |
O |
O |
O |
O |
O |
O |
O |
NEG |
O |
O |
O |
||||
NOT_EQUAL |
O |
O |
|||||
PACK |
O |
O |
O |
O |
|||
PAD |
O |
O |
O |
O |
O |
||
POW |
O |
O |
O |
O (SQRT) |
|||
PRELU |
O |
O |
O |
O |
O |
O |
|
QUANTIZE |
O |
O |
O |
O |
O |
O |
|
REDUCE_ANY |
O |
O |
|||||
REDUCE_MAX |
O |
O |
O |
O |
O |
||
REDUCE_MIN |
O |
O |
O |
O |
O |
||
RELU |
O |
O |
O |
O |
O |
O |
|
RELU_N1_TO_1 |
O |
O |
O |
O |
O |
O |
|
RELU6 |
O |
O |
O |
O |
O |
O |
|
RESHAPE |
O |
O |
O |
O |
O |
O |
|
RESIZE_BILINEAR |
O |
O |
O |
O |
O |
||
RESIZE_NEAREAST |
O |
O |
O |
O |
|||
ROUND |
O |
||||||
RSQRT |
O |
O |
O |
O |
|||
SELECT |
O |
||||||
SLICE |
O |
O |
O |
O |
O |
||
SOFTMAX |
O |
O |
O |
O |
O |
||
SPACE_TO_BATCH_ND |
O |
O |
O |
O |
|||
SPACE_TO_DEPTH |
O |
O |
O |
O |
O |
||
SPLIT |
O |
O |
O |
O |
O |
||
SPLIT_V |
O |
O |
O |
O |
O |
||
SQRT |
O |
O |
O |
O |
|||
SQUARE |
O |
O |
O |
O |
O |
||
SQUARED_DIFFERENCE |
O |
O |
O |
||||
SQUEEZE |
O |
O |
O |
O |
|||
STRIDED_SLICE |
O |
O |
O |
O |
|||
SUB |
O |
O |
O |
O |
O |
O |
|
SUM |
O |
||||||
TANH |
O |
O |
O |
O |
O |
O |
|
TILE |
O |
||||||
TOPK_V2 |
O |
O |
|||||
TRANSPOSE |
O |
O |
O |
O |
O |
||
TRANSPOSE_CONV |
O |
O |
O |
O |
O |
||
UNPACK |
O |
O |
O |
2.2.1.3. Supported Hardware Operations
The table below lists the supported TFLite operations (OPs) for hardware targets on different MediaTek platforms.
Note:
Each OP has both might have hardware and software constraints.
For details on MDLA constrains, see MDLA 3.0 Guidelines.
To check the supported TFLite OP versions, use the command -–show-builtin-ops in ncc-tflite.
TFLite operations |
MDLA 1.0 |
MDLA 1.5/1.7 |
MDLA 2.0 |
MDLA 3.0 |
VPU (asym8) |
VPU (fp16) |
MVPU 2.0 |
---|---|---|---|---|---|---|---|
ABS |
O |
O |
O |
O |
|||
ADD |
O |
O |
O |
O |
O |
O |
O |
ARG_MAX |
O |
O |
|||||
ARG_MIN |
O |
O |
|||||
AVERAGE_POOL_2D |
O |
O |
O |
O |
O |
O |
|
BATCH_TO_SPACE_ND |
O |
O |
O |
O |
O |
||
CAST |
O |
O |
O |
O |
O |
||
CEIL |
O |
||||||
CHANNEL_SHUFFLE |
O |
||||||
CONCATENATION |
O |
O |
O |
O |
O |
||
CONV_2D |
O |
O |
O |
O |
O |
||
DEPTH_TO_SPACE |
O |
O |
O |
O |
O |
||
DEPTHWISE_CONV_2D |
O |
O |
O |
O |
O |
||
DEQUANTIZE |
O |
O |
O |
O |
O |
O |
|
DIV |
O |
O |
O |
O |
O |
O |
|
ELU |
O |
O |
O |
O |
|||
EQUAL |
O |
O |
|||||
EXP |
O |
||||||
EXPAND_DIMS |
O |
O |
O |
O |
O |
||
FILL |
O |
||||||
FLOOR |
O |
||||||
FULLY_CONNECTED |
O |
O |
O |
O |
O |
O |
|
GATHER |
O |
O |
|||||
GREATER |
O |
O |
|||||
GREATER_EQUAL |
O |
O |
|||||
HARD_SWISH |
O |
O |
O |
O |
O |
||
L2_NORMALIZATION |
O |
O |
|||||
L2_POOL_2D |
O |
O |
O |
O |
|||
LEAKY_RELU |
O |
||||||
LESS |
O |
O |
|||||
LESS_EQUAL |
O |
O |
|||||
LOCAL_RESPONSE_NORMALIZATION |
|||||||
LOG |
O |
||||||
LOG_SOFTMAX |
O |
||||||
LOGICAL_AND |
O |
||||||
LOGICAL_NOT |
O |
||||||
LOGICAL_OR |
O |
||||||
LOGISTIC |
O |
O |
O |
O |
O |
O |
|
LSTM (QSTM) |
O |
O |
O |
O |
O |
||
MAX_POOL_2D |
O |
O |
O |
O |
O |
O |
|
MAXIMUM |
O |
O |
O |
O |
O |
O |
|
MEAN |
O |
O |
O |
O |
O |
O |
|
MINIMUM |
O |
O |
O |
O |
O |
O |
|
MIRRORPAD |
O |
O |
O |
||||
MUL |
O |
O |
O |
O |
O |
O |
O |
NEG |
O |
O |
O |
||||
NOT_EQUAL |
O |
O |
|||||
PACK |
O |
O |
O |
O |
|||
PAD |
O |
O |
O |
O |
O |
||
POW |
O |
O |
O |
O (SQRT) |
|||
PRELU |
O |
O |
O |
O |
O |
O |
|
QUANTIZE |
O |
O |
O |
O |
O |
O |
|
REDUCE_ANY |
O |
O |
|||||
REDUCE_MAX |
O |
O |
O |
O |
O |
||
REDUCE_MIN |
O |
O |
O |
O |
O |
||
RELU |
O |
O |
O |
O |
O |
O |
|
RELU_N1_TO_1 |
O |
O |
O |
O |
O |
O |
|
RELU6 |
O |
O |
O |
O |
O |
O |
|
RESHAPE |
O |
O |
O |
O |
O |
O |
|
RESIZE_BILINEAR |
O |
O |
O |
O |
O |
||
RESIZE_NEAREAST |
O |
O |
O |
O |
|||
ROUND |
O |
||||||
RSQRT |
O |
O |
O |
O |
|||
SELECT |
O |
||||||
SLICE |
O |
O |
O |
O |
O |
||
SOFTMAX |
O |
O |
O |
O |
O |
||
SPACE_TO_BATCH_ND |
O |
O |
O |
O |
|||
SPACE_TO_DEPTH |
O |
O |
O |
O |
O |
||
SPLIT |
O |
O |
O |
O |
O |
||
SPLIT_V |
O |
O |
O |
O |
O |
||
SQRT |
O |
O |
O |
O |
|||
SQUARE |
O |
O |
O |
O |
O |
||
SQUARED_DIFFERENCE |
O |
O |
O |
||||
SQUEEZE |
O |
O |
O |
O |
|||
STRIDED_SLICE |
O |
O |
O |
O |
|||
SUB |
O |
O |
O |
O |
O |
O |
|
SUM |
O |
||||||
TANH |
O |
O |
O |
O |
O |
O |
|
TILE |
O |
||||||
TOPK_V2 |
O |
O |
|||||
TRANSPOSE |
O |
O |
O |
O |
O |
||
TRANSPOSE_CONV |
O |
O |
O |
O |
O |
||
UNPACK |
O |
O |
O |
2.2.2. MDLA 3.0 Guidelines
Note
The following limitations may not be equal to MDLA hardware constraints. This is because Neuron might have software workarounds for MDLA hardware, or limitations due to the current software implementation.
2.2.2.1. General Restrictions
Category |
Limitations |
---|---|
Tensor Rank |
Supported tensor ranks:
|
Batch Size (N) |
Valid batch sizes:
|
Height Size (H) |
Valid range for input and output activations: [1, 65535] |
Width Size (W) |
Valid range for input and output activations: [1, 65535] |
Channel Size (C) |
Valid range for input and output activations: [1, 65535] |
Data Type |
Supported data types:
|
Per Channel Quantization |
Only the following OPs support per channel quantization:
|
MDLA Hardware Buffer |
MDLA has different internal buffers for different uses. If there is not a buffer of sufficient size for an operation, then MDLA cannot run the operation and reports “Unsupported”. To avoid internal buffer constraints:
|
2.2.2.2. Supported OPs Specification
OP Name |
TFLite OP |
NNAPI |
Restrictions |
---|---|---|---|
Abs |
ABS |
ABS |
None |
AvgPooling |
AVERAGE_POOL_2D |
AVERAGE_POOL_2D |
|
BatchToSpace |
BATCH_TO_SPACE_ND |
BATCH_TO_SPACE_ND |
Only NHWC format is supported. |
Concat |
CONCATENATION |
CONCATENATION |
None |
Conv2D |
CONV_2D |
CONV_2D |
|
DepthwiseConv2D |
DEPTHWISE_CONV_2D |
DEPTHWISE_CONV_2D |
|
DepthToSpace |
DEPTH_TO_SPACE |
DEPTH_TO_SPACE |
|
Dequantize |
DEQUANTIZE |
DEQUANTIZE |
Input cannot be per channel quantization. |
ElementWiseAdd |
ADD |
ADD |
See Limitations of Broadcasting. |
ElementWiseDiv |
DIV |
DIV |
|
ElementWiseMul |
MUL |
MUL |
See Limitations of Broadcasting. |
ElementWiseSub |
SUB |
SUB |
|
Elu |
ELU |
ELU |
None |
FullyConnected |
FULLY_CONNECTED |
FULLY_CONNECTED |
|
HardSwish |
HARD_SWISH |
HARD_SWISH |
None |
L2Pooling |
L2_POOL_2D |
L2_POOL_2D |
|
MaxPooling |
MAX_POOL_2D |
MAX_POOL_2D |
|
Maximum |
MAXIMUM |
MAXIMUM |
See Limitations of Broadcasting. |
Mean |
MEAN |
MEAN |
None |
Minimum |
MINIMUM |
MINIMUM |
See Limitations of Broadcasting. |
MirrorPad |
MIRRORPAD |
MIRRORPAD |
Supported tensors: 4-D with padding on height or width direction. |
Neg |
NEG |
NEG |
None |
Pack |
PACK |
Cannot pack at last dimension. |
|
Pad |
PAD |
PAD |
None |
Pow |
POW |
POW |
Exponent must be a constant integer. |
PRelu |
PRELU |
PRELU |
|
QLSTM (5 inputs) |
LSTM |
QUANTIZED_16BIT_LSTM |
The last dimension of input + the last dimension of output scratch must be:
|
Quantize |
QUANTIZE |
QUANTIZE |
None |
ReduceMax |
REDUCE_MAX |
REDUCE_MAX |
The size before reduced axis must be less than 65536. |
ReduceMin |
REDUCE_MIN |
REDUCE_MIN |
The size before reduced axis must be less than 65536. |
ReLU |
RELU |
RELU |
None |
Reshape |
RESHAPE |
RESHAPE |
None |
Resize::BILINEAR |
RESIZE_BILINEAR |
RESIZE_BILINEAR |
|
Resize::NEAREST |
RESIZE_NEAREST_NEIGHBOR |
RESIZE_NEAREST_NEIGHBOR |
|
RSqrt |
RSQRT |
RSQRT |
None |
Sigmoid |
LOGISTIC |
LOGISTIC |
None |
Slice |
SLICE |
SLICE |
None |
SoftMax |
SOFTMAX |
SOFTMAX |
|
SpaceToBatch |
SPACE_TO_BATCH_ND |
SPACE_TO_BATCH_ND |
|
SpaceToDepth |
SPACE_TO_DEPTH |
SPACE_TO_DEPTH |
|
Split |
SPLIT |
SPLIT |
None |
Sqrt |
SQRT |
SQRT |
None |
Square |
SQUARE |
None |
|
SquaredDifference |
SQUARED_DIFFERENCE |
None |
|
StridedSlice |
STRIDED_SLICE |
STRIDED_SLICE |
Stride on the last dimension is unsupported. |
Sum |
SUM |
SUM |
None |
Tanh |
TANH |
TANH |
For quantized types, InputScale/OutputScale must be less than 842. |
Transpose |
TRANSPOSE |
TRANSPOSE |
None |
TransposeConv2D |
TRANSPOSE_CONV |
TRANSPOSE_CONV_2D |
|
Unpack |
UNPACK |
Cannot unpack at last dimension. |
2.2.2.3. Limitations of Broadcasting
-
Only broadcasting from a small tensor to a large tensor with compatible dimensions is supported.
-
Example 1: Input1 broadcasting to Input2 is supported.
-
Example 2: Input2 broadcasting to Input1 is supported.
-
Example 3: Input1 and Input2 broadcasting to each other is unsupported.
-
-
Hardware broadcasting is supported if either of the following conditions are met:
-
The small tensor has one of the following shapes:
-
[]
-
[1]
-
[C]
-
[1, C]
-
[1, 1, C]
-
[1, 1, 1, C]
-
-
The small tensor is broadcast on the batch or channel dimension.
-
Example 1: The shape of the small tensor is [1,H,W,C], where H,W,C are not equal to 1.
-
Example 2: The shape of the small tensor is [N,H,W,1], where N,H,W are not equal to 1.
-
Example 3: The shape of the small tensor is [1,H,W,1], where H,W are not equal to 1.
-
-
-
If the conditions for hardware broadcasting are not met, broadcasting is processed by software using multiple SPLIT and CONCAT.
-
If the small tensor is constant, the broadcasting is done at compile time. Bandwidth requirements might be larger at runtime.
-
If the small tensor is not constant, there are extra runtime DMA overheads.
-
2.2.3. MVPU 2.0 Guidelines
2.2.3.1. General Restrictions
The following table lists limitations for all MVPU 2.0 operations.
Category |
Limitations |
---|---|
Tensor Rank |
Supported tensors ranks: 1-D, 2-D, 3-D, 4-D |
Batch Size (N) |
|
Height Size (H) |
|
Width Size (W) |
|
Channel Size (C) |
|
Data Type |
Supported data types:
|
Data Format |
Only NHWC format is supported. |
2.2.3.2. TensorFlow and TFLite Operations
Operation |
Restrictions |
Data type |
---|---|---|
ADD |
|
|
ARG_MAX ARG_MIN |
|
|
AVERAGE_POOL_2D |
|
|
CAST |
|
|
CEIL |
|
|
CROP_AND_RESIZE |
|
|
CHANNEL_SHUFFLE |
|
|
DEPTH_TO_SPACE |
|
|
DEQUANTIZE |
|
|
DIV |
|
|
EQUAL NOT_EQUAL GREATER GREATER_EQUAL LESS LESS_EQUAL |
|
|
EXP |
|
|
FILL |
|
|
FLOOR |
|
|
GATHER |
|
|
HARD_SWISH |
|
|
L2_NORMALIZATION |
|
|
LEAKY_RELU |
|
|
LOG |
|
|
LOG_SOFTMAX |
|
|
LOGICAL_AND |
|
|
LOGICAL_NOT |
|
|
LOGICAL_OR |
|
|
LOGISTIC |
|
|
MAX_POOL_2D |
|
|
MAXIMUM |
|
|
MEAN |
|
|
MINIIMUM |
|
|
MUL |
|
|
PRELU |
|
|
QLSTM |
|
|
QUANTIZE |
|
|
REDUCE_ANY |
|
|
REDUCE_MAX REDUCE_MIN |
|
|
RELU RELU1 RELU6 |
|
|
RESHAPE |
|
|
ROUND |
|
|
SELECT |
|
|
SOFTMAX |
|
|
SPACE_TO_DEPTH |
|
|
SPLIT |
|
|
SPLIT_V |
|
|
SUB |
|
|
SUM |
|
|
TANH |
|
|
TOPK_V2 |
|
|
2.3. NeuroPilot SDK
The D9000 supports the following versions of NeuroPilot and Android.
-
NeuroPilot 5 for Android S