1. NeuroPilot Introduction
1.1 NeuroPilot Software Ecosystem
1.1.1 Neural Network Framework Support
1.1.2 NeuroPilot Software Tools
1.1.2.1 Neuron SDK
1.1.2.2 Android Run-Time Libraries
1.2 MediaTek Device Capabilities
1.2.1 Hardware Support
1.2.1.1 Device Parametric Table
1.2.2 Devices
1.2.2.1 CPU
1.2.2.2 GPU
1.2.2.3 MVPU
1.2.2.4 MDLA
2. Hardware Support Specification
2.1 Hardware Specifications
2.1.1. Dimensity 9000
2.1.2. APU
2.1.2.1. MVPU 2.0
2.1.2.2. MDLA 3.0
2.2. Supported Operations
2.2.1. TFLite Operations
2.2.1.1. Supported Data Types
2.2.1.2. Supported NeuroPilot Operations
2.2.1.3. Supported Hardware Operations
2.2.1.3. Supported Hardware Operations
2.2.2. MDLA 3.0 Guidelines
2.2.2.1. General Restrictions
2.2.2.2. Supported OPs Specification
2.2.2.3. Limitations of Broadcasting
2.2.3. MVPU 2.0 Guidelines
2.2.3.1. General Restrictions
2.2.3.2. TensorFlow and TFLite Operations
2.3. NeuroPilot SDK

NeuroPilot 介绍

NeuroPilot Introduction and Platform Specification

1. NeuroPilot Introduction

1.1 NeuroPilot Software Ecosystem

NeuroPilot is a collection of software tools and APIs which are at the center of MediaTek’s AI ecosystem. These tools are designed to fulfill the goal of “Edge AI”, which means that AI processing is performed locally on the device rather than remotely on a server. With NeuroPilot, users can develop and deploy AI applications on edge devices with extremely high efficiency. This makes a wide variety of AI applications run faster, while also keeping data private.

MediaTek’s hardware platforms, such as mobile System-on-Chip (SoC) and ultra-low power embedded devices, span different levels of compute density. MediaTek is deeply invested in creating an AI ecosystem with efficient yet powerful AI-processors in its devices, ranging from smartphones to smart homes, wearables, Internet-of-Things (IoT), and connected cars.

Open frameworks such as TensorFlow offer out-of-the-box usability, but typically lack optimized support for advanced hardware. NeuroPilot allows users to use all the available hardware resources of a MediaTek AI platform, beyond that offered by an open framework. NeuroPilot provides programming support for specialized device capabilities, which allows for better performance, power, memory consumption, and end-user experience.

1.1.1 Neural Network Framework Support

NeuroPilot’s software tools support common AI frameworks such as TensorFlow, PyTorch, and TensorFlow Lite (TFLite). NeuroPilot provides support for inspecting, loading, and converting models, either to MediaTek-optimized model formats, or open framework standard model formats.

For Android devices, NeuroPilot provides extensions for the Android Neural Network API (NNAPI). This enables developers and device makers to bring their code closer to the hardware, for better performance and power-efficiency on MediaTek devices. NeuroPilot also allows developers to use a ‘write once, apply everywhere’ flow for existing and future MediaTek devices, including smartphones, automotive, smart home, IoT, and more. This streamlines the creation process, saving cost and time to market.

1.1.2 NeuroPilot Software Tools

NeuroPilot Software Tools

Tool	Type	Description
AISimulator	Web tool	A simulator which simulates a neural network workload on MediaTek’s AI Processing Unit (APU).
Android Run-Time Libraries	Library	Libraries which provide NNAPI delegates for special-purpose hardware cores (GPU, VPU, MDLA), and support for dynamic scheduling.
4.1.1. Converter	Command line tool	Convert a pre-trained and optimized PyTorch or TensorFlow model into a TensorFlow Lite model, and perform post-training quantization.
4.1.2. Neuron SDK	Command line tools, API, Library	A TFLite model compiler which produces ready-to-run compiled binary models (.dla).
4.1.3. Quantization	Command line tool	Optimize a model for efficient inference on MediaTek devices using quantization-aware training.

1.1.2.1 Neuron SDK

Neuron SDK allows users to convert their custom models to MediaTek-proprietary binaries for deployment on MediaTek platforms. The resulting models are highly efficient, with reduced latency and a smaller memory footprint. Users can also create a runtime environment, parse compiled model files, and perform inference on the edge. Neuron SDK is aimed at users who are performing bare metal C/C++ programming for AI applications, and offers an alternative to the Android Neural Networks API (NNAPI) for deploying Neural Network models on MediaTek-enabled Android devices.

Neuron SDK consists of the following components:

4.1.2.1.1. Neuron Compiler (ncc-tflite): An offline neural network model compiler which produces statically compiled deep learning archive (.dla) files.
4.1.2.1.2. Neuron Runtime (neuronrt): A command line tool which executes a specified .dla file and reports the results.
4.1.2.2. Neuron Runtime API: A user-invoked API which supports loading and running compiled .dla files within a user’s C++ application

1.1.2.2 Android Run-Time Libraries

MediaTek provides several run-time libraries for Android devices. These libraries allow for greater control and utilization of MediaTek special-purpose cores. The main library which implements most of this capability is an optimized Android Neural Network API (NNAPI) library, which is part of the Android NDK . The NNAPI library provides NNAPI hardware delegates, which enables the use of the GPU, VPU, and MDLA cores when running neural networks. This means that any NNAPI application can use MediaTek acceleration cores without any special changes to the application code. This accelerator support also includes support for .tflite models running in the Android TFLite run-time layer.

Note:

MediaTek provides ready-to-run Android libraries for all Android-compatible devices. The developer does not need to interact with these libraries, and there is no special settings required to use them.

1.2 MediaTek Device Capabilities

Using MediaTek devices gives users extraordinary speed and efficiency for AI applications. MediaTek devices deliver outstanding performance while consuming very little power.

1.2.1 Hardware Support

NeuroPilot tools can use the following target compute devices to run neural network models.

CPU
GPU
VPU (Vision Processing Unit)
MDLA (MediaTek Deep Learning Accelerator)

Successful use of these cores depends on the following factors, which interact with a user’s model.

Neural network framework format of the trained model.
Hardware platform (e.g. part number and device capability).
Required model accuracy. Models with high accuracy requirements might limit the type and significance of the optimizations that can be applied to the model. This might also limit the target devices that are able to run the model with the required performance and accuracy.
Neural network model structure. Certain operation (OP) types are not supported on certain targets device. For details, refer to the Supported Operations section of the platform’s documentation.

Note:

NeuroPilot is not compatible with all types of GPU.
Some platforms do not have a VPU or MDLA.
For information about device hardware and compatibility, refer to the platform’s documentation or contact MediaTek.

1.2.1.1 Device Parametric Table

Device	Operator Flexibility	Performance	Power Consumption	Data Types
CPU	Very High	Low	High	FP32, FP16, INT16, INT8
GPU	Medium	Medium	Medium	FP32, FP16
VPU	Medium	High	Low	FP32, FP16, INT16, INT8
MDLA	Low	Very High	Low	FP16, INT16, INT8

As a general rule, you should target the most power-efficient device that your neural network or developer constraints can support. The lowest-power devices are also the highest performing.

1.2.2 Devices

1.2.2.1 CPU

The CPU is capable of running any neural network, and is guaranteed to support all existing and future NN operations. Support is provided in the TFlite subsystem from Google for Android devices. For native development, developers can use the TFlite C++ API. The CPU is the most flexible target device, but it is also the least optimized for power and performance.

1.2.2.2 GPU

The GPU provides neural network acceleration for floating point models.

ARM-based MediaTek platforms support GPU neural network acceleration via Arm NN and the Arm Compute Library.
Non-ARM MediaTek platforms support GPU neural network acceleration via Google’s TensorFlow Lite GPU delegate. This GPU delegate is able to accelerate a wide selection of TFlite operations.

1.2.2.3 MVPU

The MediaTek Vision Processing Unit (MVPU) offers general-purpose Digital Signal Processing (DSP) capabilities, with special hardware for accelerating complex imaging and computer vision algorithms. The MVPU also offers outstanding performance while running AI models.

1.2.2.4 MDLA

The MediaTek Deep Learning Accelerator (MDLA) is a powerful and efficient Convolutional Neural Network (CNN) accelerator. The MDLA is capable of achieving high AI benchmark results with high Multiply-Accumulate (MAC) utilization rates. The design integrates MAC units with dedicated function blocks, which handle activation functions, element-wise operations, and pooling layers.

The MDLA uses a technique called tile-based layer fusion to help achieve high compute efficiency and bandwidth reduction. Tile-based layer fusion identifies and then fuses dependent inter-layer operations, in order to reduce the amount of data the MDLA brings on-chip.

2. Hardware Support Specification

The following MediaTek platforms support NeuroPilot 5:

Dimensity 8000
Dimensity 8100
Dimensity 9000

2.1 Hardware Specifications

2.1.1. Dimensity 9000

Feature	D9000
Process	T-4nm
CPU	1x Arm Cortex-X2 at 3.05GHz, 1MB L2 3x Arm Cortex-A710 up to 2.85GHz, 512KB L2 4x Arm Cortex-A510 up to 1.8GHz, 256KB L2 SC/MC: 1256/4198
GPU	Arm Mali-G710 MC10 1W: 119fps, Peak: >220fps
Memory	4x LPDDR5X 7500MHz UFS 3.1, 2-lane
Camera	4K30 3-exp Video HDR x 3CAM Up to 320MP 32M+32M+32M @30 ZSD
AI	MediaTek APU 590 4x MDLA 3.0+ 2x MVPU 2.0
Video Decoder	8K 30fps
Video Encoder	8K 24fps
Display	2480x2200 120Hz WQHD+ (3680x1600) 144Hz
Connectivity	Wi-Fi 6E 2x2, 160MHz bandwidth DBDC 1x1+1x1 Bluetooth 5.3
Modem	5G NR 3CC 300MHz with ET 60MHz 4G Cat-19, DR-DSDA

2.1.2. APU

The MediaTek AI Processing Unit (APU) is a a high-performance hardware engine for deep-learning, optimized for bandwidth and power efficiency. The APU architecture consists of big, small, and tiny cores. This highly heterogeneous design is suited for a wide variety of modern smartphone tasks, such as AI-camera, AI-assistant, and OS or in-app enhancements.

2.1.2.1. MVPU 2.0

The Vision Processing Unit (VPU) offers general-purpose Digital Signal Processing (DSP) capabilities, with special hardware for accelerating complex imaging and computer vision algorithms. The VPU also offers outstanding performance while running AI models.

2.1.2.2. MDLA 3.0

2.2. Supported Operations

This section describes all the neural network operations (OPs) that the D9000 supports through NeuroPilot, and any restrictions placed on their use.

2.2.1. TFLite Operations

Note:

NeuroPilot supports a wide variety of operations in TFLite. This allows most neural network models to run on the specialized compute cores available on Mediatek platforms.

If you trained a model using TensorFlow v1, TensorFlow v2, or PyTorch, you can convert the model to TFLite format using NeuroPilot’s Converter Tool. For details, see the Convertor Tool section in the NeuroPilot SDK documentation.

2.2.1.1. Supported Data Types

The following table lists the supported data types of each D9000 hardware target.

Device	AsymU8	AsymI8	SymI8	SymI16	Fp16	Fp32	Bool8	Int32
MDLA 3.0	O	O	O	O	O
MVPU 2.0	O	O	O	O	O	O	O	O
GPU					O	O
CPU	O	O			O	O	O	O

2.2.1.2. Supported NeuroPilot Operations

The following table lists the TFLite operatations supported by NeuroPilot 5.0 SDK on the Dimensity D9000 platform for each data type.

Note:

For full details of each operation, see the corresponding hardware guidelines section.

OP Name	TFLite OP	Ncc-tflite	Neuron Delegate	NNAPI Delegate	AsymU8	AsymI8	SymI8	SymI16	Fp16	Fp32	Bool8	Int32
Abs	kTfLiteBuiltinAbs	O	O	since API Level 29	O		O	O	O	O
ArgMax	kTfLiteBuiltinArgMax	O	O	since API Level 29	O	O			O	O
ArgMin	kTfLiteBuiltinArgMin	O	O	since API Level 29	O	O			O	O
AvgPooling	kTfLiteBuiltinAveragePool2d	O	O	since API Level 27	O	O	O	O	O	O
Cast	kTfLiteBuiltinCast	O	O	since API Level 29	O	O						O
Concat	kTfLiteBuiltinConcatenation	O	O	since API Level 27	O	O	O	O	O	O
Conv2D	kTfLiteBuiltinConv2d	O	O	since API Level 27	O	O	O	O	O	O
DepthToSpace	kTfLiteBuiltinDepthToSpace	O	O	since API Level 27	O	O	O	O	O	O
DepthwiseConv2D	kTfLiteBuiltinDepthwiseConv2d	O	O	since API Level 27	O	O	O	O	O	O
Dequantize	kTfLiteBuiltinDequantize	O	O	since API Level 27	O	O	O	O	O	O
ElementWiseAdd	kTfLiteBuiltinAdd	O	O	since API Level 27	O	O	O	O	O	O
ElementWiseDiv	kTfLiteBuiltinDiv	O	O	since API Level 28	O	O	O	O	O	O
ElementWiseMul	kTfLiteBuiltinMul	O	O	since API Level 27	O	O	O	O	O	O
ElementWiseSub	kTfLiteBuiltinSub	O	O	since API Level 28	O	O	O	O	O	O
Elu	kTfLiteBuiltinElu	O		since API Level 30	O	O	O	O	O	O
Equal	kTfLiteBuiltinEqual	O	O	since API Level 29	O	O			O	O	O	O
FullyConnected	kTfLiteBuiltinFullyConnected	O	O	since API Level 27	O	O	O	O	O	O
Gather	kTfLiteBuiltinGather	O	O	since API Level 29	O	O
Greater	kTfLiteBuiltinGreater	O	O	since API Level 29	O	O			O	O	O	O
GreaterEqual	kTfLiteBuiltinGreaterEqual	O	O	since API Level 29	O	O			O	O	O	O
HardSwish	kTfLiteBuiltinHardSwish	O	O	since API Level 30	O	O	O	O	O	O
L2Norm	kTfLiteBuiltinL2Normalization	O	O	since API Level 27	O	O				O
Less	kTfLiteBuiltinLess	O	O	since API Level 29	O	O			O	O	O	O
LessEqual	kTfLiteBuiltinLessEqual	O	O	since API Level 29	O	O			O	O	O	O
Maximum	kTfLiteBuiltinMaximum	O	O	since API Level 29	O	O	O	O	O	O
MaxPooling	kTfLiteBuiltinMaxPool2d	O	O	since API Level 27	O	O	O	O	O	O
Mean	kTfLiteBuiltinMean	O	O	since API Level 28	O	O	O	O	O	O
Minimum	kTfLiteBuiltinMinimum	O	O	since API Level 29	O	O	O	O	O	O
Neg	kTfLiteBuiltinNeg	O	O	since API Level 29	O		O	O	O	O
NotEqual	kTfLiteBuiltinNotEqual	O	O	since API Level 29	O	O			O	O	O	O
Pack	kTfLiteBuiltinPack	O			O	O	O	O	O
Pad	kTfLiteBuiltinPad	O	O	since API Level 28	O	O	O	O	O	O
Pad	kTfLiteBuiltinPadv2	O	O	since API Level 29	O	O	O	O	O	O
Pow	kTfLiteBuiltinPow	O	O	since API Level 29	O		O	O	O
PRelu	kTfLiteBuiltinPrelu	O	O	since API Level 29	O	O	O	O	O	O
PRelu	kTfLiteBuiltinLeakyRelu	O	O		O	O	O	O	O	O
Quantize	kTfLiteBuiltinQuantize	O	O	since API Level 29	O		O	O	O	O
ReduceAny	kTfLiteBuiltinReduceAny	O	O	since API Level 29	O	O
ReduceMax	kTfLiteBuiltinReduceMax	O	O	since API Level 29	O	O
ReduceMin	kTfLiteBuiltinReduceMin	O	O	since API Level 29	O	O
ReLU	kTfLiteBuiltinRelu	O	O	since API Level 27	O	O	O	O	O	O
ReLU6	kTfLiteBuiltinRelu6	O	O	since API Level 27	O	O	O	O	O	O
Reshape	kTfLiteBuiltinReshape	O	O	since API Level 27	O	O	O	O	O	O
Reshape	kTfLiteBuiltinSqueeze	O	O	since API Level 28	O	O	O	O	O	O
Resize::BILINEAR	kTfLiteBuiltinResizeBilinear	O	O	since API Level 27	O	O	O	O	O	O
Resize::NEAREST	kTfLiteBuiltinResizeNearestNeighbor	O	O	since API Level 29	O	O	O	O	O	O
RSqrt	kTfLiteBuiltinRsqrt	O	O	since API Level 29					O	O
Sigmoid	kTfLiteBuiltinLogistic	O	O	since API Level 27	O	O	O	O	O	O
Slice	kTfLiteBuiltinSlice	O	O	since API Level 29	O	O	O	O	O
SoftMax	kTfLiteBuiltinSoftmax	O	O	since API Level 27	O	O			O	O
SpaceToDepth	kTfLiteBuiltinSpaceToDepth	O	O	since API Level 27	O	O	O	O	O	O
Split	kTfLiteBuiltinSplit	O	O	since API Level 29	O	O	O	O	O	O
Sqrt	kTfLiteBuiltinSqrt	O	O	since API Level 29					O	O
Square	kTfLiteBuiltinSquare	O			O	O	O	O	O
SquaredDifference	kTfLiteBuiltinSquaredDifference	O			O		O	O	O
StridedSlice	kTfLiteBuiltinStridedSlice	O	O	since API Level 28	O	O	O	O	O	O
Tanh	kTfLiteBuiltinTanh	O	O	since API Level 27	O	O	O	O	O	O
Tile	kTfLiteBuiltinTile	O	O	since API Level 29	O	O
Transpose	kTfLiteBuiltinTranspose	O	O	since API Level 28	O	O	O	O	O	O
TransposeConv2D	kTfLiteBuiltinTransposeConv	O	O	since API Level 29	O	O	O	O	O	O

2.2.1.3. Supported Hardware Operations

The table below lists the supported TFLite operations (OPs) for hardware targets on different MediaTek platforms.

Note:

Each OP has both might have hardware and software constraints.

For details on MDLA constrains, see MDLA 3.0 Guidelines.

To check the supported TFLite OP versions, use the command -–show-builtin-ops in ncc-tflite.

TFLite operations	MDLA 1.0	MDLA 1.5/1.7	MDLA 2.0	MDLA 3.0	VPU (asym8)	VPU (fp16)	MVPU 2.0
ABS	O	O	O	O
ADD	O	O	O	O	O	O	O
ARG_MAX					O		O
ARG_MIN					O		O
AVERAGE_POOL_2D	O	O	O	O	O		O
BATCH_TO_SPACE_ND	O	O	O	O	O
CAST		O	O	O	O		O
CEIL							O
CHANNEL_SHUFFLE							O
CONCATENATION	O	O	O	O	O
CONV_2D	O	O	O	O	O
DEPTH_TO_SPACE		O	O	O	O		O
DEPTHWISE_CONV_2D	O	O	O	O	O
DEQUANTIZE		O	O	O	O	O	O
DIV		O	O	O	O	O	O
ELU	O	O	O	O
EQUAL					O		O
EXP							O
EXPAND_DIMS	O	O	O	O	O
FILL							O
FLOOR							O
FULLY_CONNECTED	O	O	O	O	O	O
GATHER					O		O
GREATER					O		O
GREATER_EQUAL					O		O
HARD_SWISH		O	O	O	O		O
L2_NORMALIZATION					O		O
L2_POOL_2D	O	O	O	O
LEAKY_RELU							O
LESS					O		O
LESS_EQUAL					O		O
LOCAL_RESPONSE_NORMALIZATION
LOG							O
LOG_SOFTMAX							O
LOGICAL_AND							O
LOGICAL_NOT							O
LOGICAL_OR							O
LOGISTIC	O	O	O	O	O		O
LSTM (QSTM)		O	O	O	O		O
MAX_POOL_2D	O	O	O	O	O		O
MAXIMUM	O	O	O	O	O		O
MEAN	O	O	O	O	O		O
MINIMUM	O	O	O	O	O		O
MIRRORPAD		O	O	O
MUL	O	O	O	O	O	O	O
NEG		O	O	O
NOT_EQUAL					O		O
PACK		O	O	O	O
PAD	O	O	O	O	O
POW		O	O	O	O (SQRT)
PRELU	O	O	O	O	O		O
QUANTIZE		O	O	O	O	O	O
REDUCE_ANY					O		O
REDUCE_MAX		O	O	O	O		O
REDUCE_MIN		O	O	O	O		O
RELU	O	O	O	O	O		O
RELU_N1_TO_1	O	O	O	O	O		O
RELU6	O	O	O	O	O		O
RESHAPE	O	O	O	O	O		O
RESIZE_BILINEAR	O	O	O	O	O
RESIZE_NEAREAST		O	O	O	O
ROUND							O
RSQRT			O	O	O	O
SELECT							O
SLICE	O	O	O	O	O
SOFTMAX			O	O	O	O	O
SPACE_TO_BATCH_ND		O	O	O	O
SPACE_TO_DEPTH		O	O	O	O		O
SPLIT		O	O	O	O		O
SPLIT_V		O	O	O	O		O
SQRT			O	O	O	O
SQUARE		O	O	O	O	O
SQUARED_DIFFERENCE		O	O	O
SQUEEZE		O	O	O	O
STRIDED_SLICE		O	O	O	O
SUB	O	O	O	O	O		O
SUM							O
TANH	O	O	O	O	O		O
TILE					O
TOPK_V2					O		O
TRANSPOSE		O	O	O	O	O
TRANSPOSE_CONV	O	O	O	O	O
UNPACK		O	O	O

2.2.1.3. Supported Hardware Operations

The table below lists the supported TFLite operations (OPs) for hardware targets on different MediaTek platforms.

Note:

Each OP has both might have hardware and software constraints.

For details on MDLA constrains, see MDLA 3.0 Guidelines.

To check the supported TFLite OP versions, use the command -–show-builtin-ops in ncc-tflite.

TFLite operations	MDLA 1.0	MDLA 1.5/1.7	MDLA 2.0	MDLA 3.0	VPU (asym8)	VPU (fp16)	MVPU 2.0
ABS	O	O	O	O
ADD	O	O	O	O	O	O	O
ARG_MAX					O		O
ARG_MIN					O		O
AVERAGE_POOL_2D	O	O	O	O	O		O
BATCH_TO_SPACE_ND	O	O	O	O	O
CAST		O	O	O	O		O
CEIL							O
CHANNEL_SHUFFLE							O
CONCATENATION	O	O	O	O	O
CONV_2D	O	O	O	O	O
DEPTH_TO_SPACE		O	O	O	O		O
DEPTHWISE_CONV_2D	O	O	O	O	O
DEQUANTIZE		O	O	O	O	O	O
DIV		O	O	O	O	O	O
ELU	O	O	O	O
EQUAL					O		O
EXP							O
EXPAND_DIMS	O	O	O	O	O
FILL							O
FLOOR							O
FULLY_CONNECTED	O	O	O	O	O	O
GATHER					O		O
GREATER					O		O
GREATER_EQUAL					O		O
HARD_SWISH		O	O	O	O		O
L2_NORMALIZATION					O		O
L2_POOL_2D	O	O	O	O
LEAKY_RELU							O
LESS					O		O
LESS_EQUAL					O		O
LOCAL_RESPONSE_NORMALIZATION
LOG							O
LOG_SOFTMAX							O
LOGICAL_AND							O
LOGICAL_NOT							O
LOGICAL_OR							O
LOGISTIC	O	O	O	O	O		O
LSTM (QSTM)		O	O	O	O		O
MAX_POOL_2D	O	O	O	O	O		O
MAXIMUM	O	O	O	O	O		O
MEAN	O	O	O	O	O		O
MINIMUM	O	O	O	O	O		O
MIRRORPAD		O	O	O
MUL	O	O	O	O	O	O	O
NEG		O	O	O
NOT_EQUAL					O		O
PACK		O	O	O	O
PAD	O	O	O	O	O
POW		O	O	O	O (SQRT)
PRELU	O	O	O	O	O		O
QUANTIZE		O	O	O	O	O	O
REDUCE_ANY					O		O
REDUCE_MAX		O	O	O	O		O
REDUCE_MIN		O	O	O	O		O
RELU	O	O	O	O	O		O
RELU_N1_TO_1	O	O	O	O	O		O
RELU6	O	O	O	O	O		O
RESHAPE	O	O	O	O	O		O
RESIZE_BILINEAR	O	O	O	O	O
RESIZE_NEAREAST		O	O	O	O
ROUND							O
RSQRT			O	O	O	O
SELECT							O
SLICE	O	O	O	O	O
SOFTMAX			O	O	O	O	O
SPACE_TO_BATCH_ND		O	O	O	O
SPACE_TO_DEPTH		O	O	O	O		O
SPLIT		O	O	O	O		O
SPLIT_V		O	O	O	O		O
SQRT			O	O	O	O
SQUARE		O	O	O	O	O
SQUARED_DIFFERENCE		O	O	O
SQUEEZE		O	O	O	O
STRIDED_SLICE		O	O	O	O
SUB	O	O	O	O	O		O
SUM							O
TANH	O	O	O	O	O		O
TILE					O
TOPK_V2					O		O
TRANSPOSE		O	O	O	O	O
TRANSPOSE_CONV	O	O	O	O	O
UNPACK		O	O	O

2.2.2. MDLA 3.0 Guidelines

Note

The following limitations may not be equal to MDLA hardware constraints. This is because Neuron might have software workarounds for MDLA hardware, or limitations due to the current software implementation.

2.2.2.1. General Restrictions

Category	Limitations
Tensor Rank	Supported tensor ranks: For operation (OP) type Conv3D, AvgPool3D, L2Pool3D, MinPool3D, and MaxPool3D: 5-D For all other OP types: 0-D, 1-D, 2-D, 3-D, 4-D
Batch Size (N)	Valid batch sizes: FULLY_CONNECTED: {1, 2, 4, 8}. FULLY_CONNECTED with any other batch size is converted to OP CONV_2D. CONV_2D, DEPTHWISE_CONV_2D, TRANSPOSE_CONV: No batch size limit. If batch size is 65535 or less, the OP is split into multiple OPs. All other OPs: [1, 65535]
Height Size (H)	Valid range for input and output activations: [1, 65535]
Width Size (W)	Valid range for input and output activations: [1, 65535]
Channel Size (C)	Valid range for input and output activations: [1, 65535]
Data Type	Supported data types: Asymmetric unsigned 8-bit Asymmetric signed 8-bit Symmetric signed 8-bit Symmetric signed 16-bit Symmetric signed 16-bit activation + Symmetric signed 8-bit weight 16-bit floating point (FP16) 32-bit floating point Converted to FP16 if relax-FP32 is enabled
Per Channel Quantization	Only the following OPs support per channel quantization: CONV_2D DEPTHWISE_CONV_2D TRANSPOSE_CONV FULLY_CONNECTED MTK_TRANSPOSE_CONV PRELU
MDLA Hardware Buffer	MDLA has different internal buffers for different uses. If there is not a buffer of sufficient size for an operation, then MDLA cannot run the operation and reports “Unsupported”. To avoid internal buffer constraints: Keep the input channel size small. For operations that have stride, such as convolution and pooling, keep the stride values small in both width and height. Keep filter size small in both width and height, especially for convolution-like operations.

2.2.2.2. Supported OPs Specification

OP Name	TFLite OP	NNAPI	Restrictions
Abs	ABS	ABS	None
AvgPooling	AVERAGE_POOL_2D	AVERAGE_POOL_2D	Only NHWC format is supported. Filter shape, stride, and paddings attributes must meet the following conditions: If filter size is equal to input size (both H and W dimensions in output are equal to 1): For quantized types: The input_height * input_width must be in the range [1, 2^20]. For floating-point types: The input_height and input_width must satisfy one of the following constraints to avoid accuracy issues: input_height(input_width) must be less than or equal to S, where S = 64. input_height(input_width) must be factorable in the form of “2^a * 3^b * 5^c * 7^d * N”, where N is 1 or a prime number less than or equal to S. If filter size is not equal to input size: Filter shape height and width must be in the range [1, 8]. Stride height must be in the range [1, filter_height]. Stride width must be in the range [1, filter_width]. Top and bottom paddings must be in the range [0, filter_height-1]. Left and right paddings must be in the range [0, filter_width-1].
BatchToSpace	BATCH_TO_SPACE_ND	BATCH_TO_SPACE_ND	Only NHWC format is supported.
Concat	CONCATENATION	CONCATENATION	None
Conv2D	CONV_2D	CONV_2D	NHWC and NCHW formats are supported. Filter size If stride is not 1x1, filter height and width must be in the range [1, 25]. Otherwise, filter width must be in the range [1, 31]. Stride If dilation rate is equal to 1, stride height and width must be in {1, 2, 3, 4, 8}. Padding For 1x1 filter, there must be no padding. Otherwise, padding must be in the range [0, 15]. Dilation rate Dilation rate height must be {1, 2, 4, 8}. Dilation rate width must be in {1, 2, 4, 8}. There are no limitations if ncc-tflite option “–use-sw-dilated-conv” is enabled. This option applies a software solution for dilated convolution.
DepthwiseConv2D	DEPTHWISE_CONV_2D	DEPTHWISE_CONV_2D	Filter size Filter height and width must be in the range [1, 25]. Channel multiplier Channel multiplier must be in {1, 2, 4, 8, 16}. Otherwise, channel multiplier must be equal to output channel (i.e., input channel is 1). Other contraints are the same as CONV_2D
DepthToSpace	DEPTH_TO_SPACE	DEPTH_TO_SPACE	Only NHWC format is supported. Input batch must be 1. Output batch must be 1.
Dequantize	DEQUANTIZE	DEQUANTIZE	Input cannot be per channel quantization.
ElementWiseAdd	ADD	ADD	See Limitations of Broadcasting.
ElementWiseDiv	DIV	DIV	We recommend not applying this operation for quantized types because of accuracy issues. See Limitations of Broadcasting.
ElementWiseMul	MUL	MUL	See Limitations of Broadcasting.
ElementWiseSub	SUB	SUB	See Limitations of Broadcasting. The scale of input1 (minuend) must be greater than or equal to the scale of input2 (subtrahend).
Elu	ELU	ELU	None
FullyConnected	FULLY_CONNECTED	FULLY_CONNECTED	Filter input channel (i.e. the 2nd dimension of filter) must be in the range [1, 1048575]. FULLY_CONNECTED with dynamic weight is converted to CONV_2D. Bias must be a constant tensor.
HardSwish	HARD_SWISH	HARD_SWISH	None
L2Pooling	L2_POOL_2D	L2_POOL_2D	Same as AVERAGE_POOL. Except if filter size is equal to input size (both H and W dimensions in output are equal to 1), then filter_height * filter_width must be in the range [1, 2^10]. Input activation with floating point data type is unsupported.
MaxPooling	MAX_POOL_2D	MAX_POOL_2D	Same as AVERAGE_POOL_2D. Additional supported: input dimension is equal to output dimension, padding SAME, stride 1
Maximum	MAXIMUM	MAXIMUM	See Limitations of Broadcasting.
Mean	MEAN	MEAN	None
Minimum	MINIMUM	MINIMUM	See Limitations of Broadcasting.
MirrorPad	MIRRORPAD	MIRRORPAD	Supported tensors: 4-D with padding on height or width direction.
Neg	NEG	NEG	None
Pack	PACK		Cannot pack at last dimension.
Pad	PAD PADV2	PAD PAD_V2	None
Pow	POW	POW	Exponent must be a constant integer.
PRelu	PRELU	PRELU	Alpha must be a constant. Alpha must be a scalar (0-D) or 1-D tensor.
QLSTM (5 inputs)	LSTM	QUANTIZED_16BIT_LSTM	The last dimension of input + the last dimension of output scratch must be: 16-aligned In the range [1, 1048575]
Quantize	QUANTIZE	QUANTIZE	None
ReduceMax	REDUCE_MAX	REDUCE_MAX	The size before reduced axis must be less than 65536.
ReduceMin	REDUCE_MIN	REDUCE_MIN	The size before reduced axis must be less than 65536.
ReLU ReLU1 ReLU6	RELU RELU_N1_TO_1 RELU6	RELU RELU1 RELU6	None
Reshape	RESHAPE	RESHAPE	None
Resize::BILINEAR	RESIZE_BILINEAR	RESIZE_BILINEAR	Only NHWC format is supported. Input Height must be in the range [1, 8192]. Input Width must be in the range [1, 8192].
Resize::NEAREST	RESIZE_NEAREST_NEIGHBOR	RESIZE_NEAREST_NEIGHBOR	Only NHWC format is supported. Input Height must be in the range [1, 8192]. Input Width must be in the range [1, 8192].
RSqrt	RSQRT	RSQRT	None
Sigmoid	LOGISTIC	LOGISTIC	None
Slice	SLICE	SLICE	None
SoftMax	SOFTMAX	SOFTMAX	Axis must be -1; this means only the input channel is normalized. Quantized types are dequantized to FP16 due to an accuracy issue.
SpaceToBatch	SPACE_TO_BATCH_ND	SPACE_TO_BATCH_ND	Only NHWC format is supported. Input batch must be 1.
SpaceToDepth	SPACE_TO_DEPTH	SPACE_TO_DEPTH	Only NHWC format is supported. Input batch must be 1.
Split	SPLIT	SPLIT	None
Sqrt	SQRT	SQRT	None
Square	SQUARE		None
SquaredDifference	SQUARED_DIFFERENCE		None
StridedSlice	STRIDED_SLICE	STRIDED_SLICE	Stride on the last dimension is unsupported.
Sum	SUM	SUM	None
Tanh	TANH	TANH	For quantized types, InputScale/OutputScale must be less than 842.
Transpose	TRANSPOSE	TRANSPOSE	None
TransposeConv2D	TRANSPOSE_CONV	TRANSPOSE_CONV_2D	Weight must be a constant tensor Filter size Filter height and width must be in the range [1, 25]. Stride Stride height must be less than or equal to filter height. Stride width must be less than or equal to filter width. Other contraints are the same as CONV_2D
Unpack	UNPACK		Cannot unpack at last dimension.

2.2.2.3. Limitations of Broadcasting

Only broadcasting from a small tensor to a large tensor with compatible dimensions is supported.
- Example 1: Input1 broadcasting to Input2 is supported.
- Example 2: Input2 broadcasting to Input1 is supported.
- Example 3: Input1 and Input2 broadcasting to each other is unsupported.

Hardware broadcasting is supported if either of the following conditions are met:
1. The small tensor has one of the following shapes:
  - []
  - [1]
  - [C]
  - [1, C]
  - [1, 1, C]
  - [1, 1, 1, C]
2. The small tensor is broadcast on the batch or channel dimension.
  - Example 1: The shape of the small tensor is [1,H,W,C], where H,W,C are not equal to 1.
  - Example 2: The shape of the small tensor is [N,H,W,1], where N,H,W are not equal to 1.
  - Example 3: The shape of the small tensor is [1,H,W,1], where H,W are not equal to 1.

If the conditions for hardware broadcasting are not met, broadcasting is processed by software using multiple SPLIT and CONCAT.
- If the small tensor is constant, the broadcasting is done at compile time. Bandwidth requirements might be larger at runtime.
- If the small tensor is not constant, there are extra runtime DMA overheads.

2.2.3. MVPU 2.0 Guidelines

2.2.3.1. General Restrictions

The following table lists limitations for all MVPU 2.0 operations.

Category	Limitations
Tensor Rank	Supported tensors ranks: 1-D, 2-D, 3-D, 4-D
Batch Size (N)	Dynamic shape is not supported. Valid range for input and output activations: [1, 65535]
Height Size (H)	Dynamic shape is not supported. Valid range for input and output activations: [1, 65535]
Width Size (W)	Dynamic shape is not supported. Valid range for input and output activations: [1, 65535]
Channel Size (C)	Dynamic shape is not supported. Valid range for input and output activations: [1, 65535]
Data Type	Supported data types: Asymmetric unsigned 8-bit (Asym U8) Asymmetric signed 8-bit (Asym I8) Symmetric 8-bit (Sym I8) Symmetric 16-bit (Sym I16) 8-bit boolean for logical operations (Bool 8) 16-bit floating point (FP16) 32-bit floating point (FP32) 32-bit integer (Int32)
Data Format	Only NHWC format is supported.

2.2.3.2. TensorFlow and TFLite Operations

Operation	Restrictions	Data type
ADD	Maximum input/output tensor rank: 4D Support broadcast operation Support requantization One of input tensors can be constant Only support inputScale / (2 * maxInputScale) < 1	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Input: FP16 Output: FP16
ARG_MAX ARG_MIN	Maximum input/output tensor rank: 4D Do not support batch axis	Quantization data type Input: Asym U8 / Asym I8 Output: Int32 Floating data type Input: FP16 / FP32 Output: Not supported
AVERAGE_POOL_2D	Input and output must be 4D Filter W, H = [1:128] Stride W = H Support requantization Support RELU/RELU1/RELU6 fusion	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
CAST	Maximum input/output tensor rank: 4D Input must not be a constant	Quantization data type Input: Asym U8 / Asym I8 / Int32 Output: Asym U8 / Asym I8 / Int32 Floating data type Support casting from FP32 to Int32 Support casting from Int32 to FP32 Support casting from Int32 to FP16
CEIL	Maximum input/output tensor rank: 4D	Quantization data type Not supported Floating data type Input Value: FP16 / FP32 Output: FP16 / FP32
CROP_AND_RESIZE	Input and output must be 4D Input boxes must be 2D I/O must be in the same scale and zero point Input box Indices must be the same as output batch size	Quantization data type Input: Asym U8 Input boxes: FP32 Input box indices: Int32 Output value: Asym U8 Floating data type Not supported
CHANNEL_SHUFFLE	Maximum input/output tensor rank: 4D Input and output must be in the same scale and zero point group_size = num_channels / num_groups The number of channels must be divisible by num_groups. The input scalar (the dimension channel shuffle) must be in the range [-n, n-1]	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
DEPTH_TO_SPACE	Input and output must be 4D Block size >= 1 Input and output must be in the same scale and zero point	Quantization data type Not supported Floating data type Input Value: FP16 / FP32 Output: FP16 / FP32
DEQUANTIZE	Maximum input/output tensor rank: 4D Input and Output must have the same shape Input scale must be > 0 Per-channel quantization is not supported	Quantization data type Not supported Floating data type Input: Asym U8 / Asym I8 Output: FP16
DIV	Maximum input/output tensor rank: 4D Support broadcast operation	Quantization data type Not supported Floating data type Input: FP16 Output: FP16
EQUAL NOT_EQUAL GREATER GREATER_EQUAL LESS LESS_EQUAL	Maximum input/output tensor rank: 4D Support broadcast operation	Quantization data type Input: Asym U8 / Asym I8 / Bool 8 / Int32 Output: Bool 8 Floating data type Not supported
EXP	Maximum input/output tensor rank: 4D	Quantization data type Not supported Floating data type Input Value: FP16 / FP32 Output: FP16 / FP32
FILL	The n-D output can be reshaped to 4-D, with each dimension < 16 bits Input(fill value) and output must be in the same scale and zero point Input 0 (describing output shape) must be 1D	Quantization data type Input: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32 Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32 Floating data type Input Value: FP16/ FP32 Output: FP16 / FP32
FLOOR	Maximum input/output tensor rank: 4D	Quantization data type Not supported Floating data type Input Value: FP16 Output: FP16
GATHER	Maximum input/output tensor rank: 4D Input and output must be in the same scale and zero point Only support single batch. Do not support gathering in batch axis Axis must be smaller than input rank Input dimension + Indices dimension <= 4 Support requantization	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
HARD_SWISH	Maximum input/output tensor rank: 4D	Quantization data type Input: Asym U8 / Asym I8 / Sym I8 Output: Asym U8 / Asym I8 / Sym I8 Floating data type Not supported
L2_NORMALIZATION	Maximum input/output tensor rank: 4D Axis must be constant Only support single axis, and axis can not be batch dimension	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
LEAKY_RELU	Maximum input/output tensor rank: 4D	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
LOG	Maximum input/output tensor rank: 4D	Quantization data type Not supported Floating data type Input Value: FP16 / FP32 Output: FP16 / FP32
LOG_SOFTMAX	Support 2D/4D output Only supported axis is the channel dimension Beta > 0 Support RESHAPE fusion in front of OP	Quantization data type Not supported Floating data type Input: FP32 Output: FP32
LOGICAL_AND	Maximum input/output tensor rank: 4D Support broadcast operation	Quantization data type Input: Bool 8 Output: Bool 8 Floating data type Not supported
LOGICAL_NOT	Maximum input/output tensor rank: 4D Support broadcast operation	Quantization data type Input: Bool 8 Output: Bool 8 Floating data type Not supported
LOGICAL_OR	Maximum input/output tensor rank: 4D Support broadcast operation	Quantization data type Input: Bool 8 Output: Bool 8 Floating data type Not supported
LOGISTIC	Maximum input/output tensor rank: 4D Output scale = 1/256, Output zeropoint = 0 for Asym U8 Output scale = 1/256, Output zeropoint = -128 for Asym I8	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
MAX_POOL_2D	Input and output must be 4D Weight W, H = [1:16] Stride W = H Support requantization Support RELU/RELU1/RELU6 fusion	Quantization data type Input: Asym U8 / Asym I8 Weight: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
MAXIMUM	Maximum input/output tensor rank: 4D Support broadcast operation Support requantization One input tensor can be constant Only support inputScale / (2 * maxInputScale) < 1	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
MEAN	Maximum input/output tensor rank: 4D	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
MINIIMUM	Maximum input/output tensor rank: 4D Support broadcast operation Support requantization One input tensor can be constant Only support inputScale / (2 * maxInputScale) < 1	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
MUL	Maximum input/output tensor rank: 4D Support broadcast operation Support requantization One input tensor can be constant Only support inputProdScale / outputScale < 1	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Input: FP16 / FP32 Output: FP16 / FP32
PRELU	Maximum input/output tensor rank: 4D Broadcast is only supported if the following conditions are true: Alpha tensor rank must be 0D or 1D Alpha tensor size must be 1 or the same as input channel size	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
QLSTM	A version of quantized LSTM, using 16 bit quantization for internal state Supports NNAPI v1.2 behavior Projection is not supported Layer normalization is not supported 16 inputs and 2 outputs	Quantization data type Input: Asym U8 Output Cell State: Sym I16 Output Value: Asym U8 Floating data type Not supported
QUANTIZE	Maximum input/output tensor rank: 4D Input and Output must have the same shape Output scale must be > 0 Per-channel quantization is not supported	Quantization data type Not supported Floating data type Input: FP16 Output: Asym U8 / Asym I8
REDUCE_ANY	Maximum input/output tensor rank: 4D	Quantization data type Input: Bool 8 Output: Bool 8 Floating data type Not supported
REDUCE_MAX REDUCE_MIN	Maximum input/output tensor rank: 4D	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
RELU RELU1 RELU6	Maximum input/output tensor rank: 4D Input and output must be in the same scale and zero point	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
RESHAPE	Maximum input/output tensor rank: 4D	Quantization data type Input: Asym U8 / Asym I8 / Asym U16 / Asym I16 / Bool 8 / UInt32 / Int32 Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / Bool 8 / UInt32 / Int32 Floating data type Input: FP16 / FP32 Output: FP16 / FP32
ROUND	Maximum input/output tensor rank: 4D	Quantization data type Not supported Floating data type Input Value: FP16 / FP32 Output: FP16 / FP32
SELECT	Maximum input/output tensor rank: 4D Input and output must be the same shape One input tensor can be constant Support requantization	Quantization data type Input: Asym U8 / Asym I8 Input condition: Bool 8 Output: Asym U8 / Asym I8 Floating data type Not supported
SOFTMAX	Support 2D/4D output Only supported axis is the channel dimension Beta > 0 Support RESHAPE fusion in front of OP	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Input: FP16 / FP32 Output: FP16 / FP32
SPACE_TO_DEPTH	Input and output must be 4D Block size >= 1 Input and output must be in the same scale and zero point	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
SPLIT	mNumSplits = number of output Input rank = Output rank Support n-D Input0, but its merged shape rank must be >= 2 and <= 4 Input1 (axis) must be a constant scalar	Quantization data type Input0: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32 Input1: Int32 Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32 Floating data type Input: FP16 / FP32 Output: FP16 / FP32
SPLIT_V	mNumSplits = number of output Input rank = Output rank Support n-D Input0, but its merged shape rank must be >= 2 and <= 4 Input1 (axis) must be a constant scalar	Quantization data type Input0: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32 Input1: Int32 Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32 Floating data type Input: FP16 / FP32 Output: FP16 / FP32
SUB	Maximum input/output tensor rank: 4D Support broadcast operation Support requantization One input tensor can be constant Only support inputScale / (2 * maxInputScale) < 1	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supporteded
SUM	Maximum input/output tensor rank: 4D	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Not supported
TANH	Maximum input/output tensor rank: 4D Output scale = 1/128, Output zeropoint = 128 for Asym U8 Output scale = 1/128, Output zeropoint = 0 for Asym I8	Quantization data type Input: Asym U8 / Asym I8 Output: Asym U8 / Asym I8 Floating data type Input: FP16 / FP32 Output: FP16 / FP32
TOPK_V2	Input and output must be 4D Output values and indices must have the same dimensions Batch size must be the same for both input and output K value must be in the range [1, the size of last input dimension]	Quantization data type Input: Asym U8 Output value: Asym U8 Output indices: Int32 Floating data type Not supported

2.3. NeuroPilot SDK

The D9000 supports the following versions of NeuroPilot and Android.

NeuroPilot 5 for Android S

人工智能 Artificial Intelligence

9 篇内容

了解更多