1. NeuroPilot Introduction

1.1 NeuroPilot Software Ecosystem

NeuroPilot is a collection of software tools and APIs which are at the center of MediaTek’s AI ecosystem. These tools are designed to fulfill the goal of “Edge AI”, which means that AI processing is performed locally on the device rather than remotely on a server. With NeuroPilot, users can develop and deploy AI applications on edge devices with extremely high efficiency. This makes a wide variety of AI applications run faster, while also keeping data private.

MediaTek’s hardware platforms, such as mobile System-on-Chip (SoC) and ultra-low power embedded devices, span different levels of compute density. MediaTek is deeply invested in creating an AI ecosystem with efficient yet powerful AI-processors in its devices, ranging from smartphones to smart homes, wearables, Internet-of-Things (IoT), and connected cars.

Open frameworks such as TensorFlow offer out-of-the-box usability, but typically lack optimized support for advanced hardware. NeuroPilot allows users to use all the available hardware resources of a MediaTek AI platform, beyond that offered by an open framework. NeuroPilot provides programming support for specialized device capabilities, which allows for better performance, power, memory consumption, and end-user experience.

1.1.1 Neural Network Framework Support

NeuroPilot’s software tools support common AI frameworks such as TensorFlow, PyTorch, and TensorFlow Lite (TFLite). NeuroPilot provides support for inspecting, loading, and converting models, either to MediaTek-optimized model formats, or open framework standard model formats.

For Android devices, NeuroPilot provides extensions for the Android Neural Network API (NNAPI). This enables developers and device makers to bring their code closer to the hardware, for better performance and power-efficiency on MediaTek devices. NeuroPilot also allows developers to use a ‘write once, apply everywhere’ flow for existing and future MediaTek devices, including smartphones, automotive, smart home, IoT, and more. This streamlines the creation process, saving cost and time to market.

1.1.2 NeuroPilot Software Tools

NeuroPilot Software Tools

Tool

Type

Description

AISimulator

Web tool

A simulator which simulates a neural network workload on MediaTek’s AI Processing Unit (APU).

Android Run-Time Libraries

Library

Libraries which provide NNAPI delegates for special-purpose hardware cores (GPU, VPU, MDLA), and support for dynamic scheduling.

4.1.1. Converter

Command line tool

Convert a pre-trained and optimized PyTorch or TensorFlow model into a TensorFlow Lite model, and perform post-training quantization.

4.1.2. Neuron SDK

Command line tools, API, Library

A TFLite model compiler which produces ready-to-run compiled binary models (.dla).

4.1.3. Quantization

Command line tool

Optimize a model for efficient inference on MediaTek devices using quantization-aware training.

1.1.2.1 Neuron SDK

Neuron SDK allows users to convert their custom models to MediaTek-proprietary binaries for deployment on MediaTek platforms. The resulting models are highly efficient, with reduced latency and a smaller memory footprint. Users can also create a runtime environment, parse compiled model files, and perform inference on the edge. Neuron SDK is aimed at users who are performing bare metal C/C++ programming for AI applications, and offers an alternative to the Android Neural Networks API (NNAPI) for deploying Neural Network models on MediaTek-enabled Android devices.

Neuron SDK consists of the following components:

  • 4.1.2.1.1. Neuron Compiler (ncc-tflite): An offline neural network model compiler which produces statically compiled deep learning archive (.dla) files.
  • 4.1.2.1.2. Neuron Runtime (neuronrt): A command line tool which executes a specified .dla file and reports the results.
  • 4.1.2.2. Neuron Runtime API: A user-invoked API which supports loading and running compiled .dla files within a user’s C++ application
1.1.2.2 Android Run-Time Libraries

MediaTek provides several run-time libraries for Android devices. These libraries allow for greater control and utilization of MediaTek special-purpose cores. The main library which implements most of this capability is an optimized Android Neural Network API (NNAPI) library, which is part of the Android NDK. The NNAPI library provides NNAPI hardware delegates, which enables the use of the GPU, VPU, and MDLA cores when running neural networks. This means that any NNAPI application can use MediaTek acceleration cores without any special changes to the application code. This accelerator support also includes support for .tflite models running in the Android TFLite run-time layer.

Note:

MediaTek provides ready-to-run Android libraries for all Android-compatible devices. The developer does not need to interact with these libraries, and there is no special settings required to use them.

1.2 MediaTek Device Capabilities

Using MediaTek devices gives users extraordinary speed and efficiency for AI applications. MediaTek devices deliver outstanding performance while consuming very little power.

1.2.1 Hardware Support

NeuroPilot tools can use the following target compute devices to run neural network models.

  • CPU
  • GPU
  • VPU (Vision Processing Unit)
  • MDLA (MediaTek Deep Learning Accelerator)

Successful use of these cores depends on the following factors, which interact with a user’s model.

  • Neural network framework format of the trained model.
  • Hardware platform (e.g. part number and device capability).
  • Required model accuracy. Models with high accuracy requirements might limit the type and significance of the optimizations that can be applied to the model. This might also limit the target devices that are able to run the model with the required performance and accuracy.
  • Neural network model structure. Certain operation (OP) types are not supported on certain targets device. For details, refer to the Supported Operations section of the platform’s documentation.

Note:

  • NeuroPilot is not compatible with all types of GPU.
  • Some platforms do not have a VPU or MDLA.
  • For information about device hardware and compatibility, refer to the platform’s documentation or contact MediaTek.
1.2.1.1 Device Parametric Table

Device

Operator Flexibility

Performance

Power Consumption

Data Types

CPU

Very High

Low

High

FP32, FP16, INT16, INT8

GPU

Medium

Medium

Medium

FP32, FP16

VPU

Medium

High

Low

FP32, FP16, INT16, INT8

MDLA

Low

Very High

Low

FP16, INT16, INT8

As a general rule, you should target the most power-efficient device that your neural network or developer constraints can support. The lowest-power devices are also the highest performing.

1.2.2 Devices

1.2.2.1 CPU

The CPU is capable of running any neural network, and is guaranteed to support all existing and future NN operations. Support is provided in the TFlite subsystem from Google for Android devices. For native development, developers can use the TFlite C++ API. The CPU is the most flexible target device, but it is also the least optimized for power and performance.

1.2.2.2 GPU

The GPU provides neural network acceleration for floating point models.

  • ARM-based MediaTek platforms support GPU neural network acceleration via Arm NN and the Arm Compute Library.

  • Non-ARM MediaTek platforms support GPU neural network acceleration via Google’s TensorFlow Lite GPU delegate. This GPU delegate is able to accelerate a wide selection of TFlite operations.

1.2.2.3 MVPU

The MediaTek Vision Processing Unit (MVPU) offers general-purpose Digital Signal Processing (DSP) capabilities, with special hardware for accelerating complex imaging and computer vision algorithms. The MVPU also offers outstanding performance while running AI models.

1.2.2.4 MDLA

The MediaTek Deep Learning Accelerator (MDLA) is a powerful and efficient Convolutional Neural Network (CNN) accelerator. The MDLA is capable of achieving high AI benchmark results with high Multiply-Accumulate (MAC) utilization rates. The design integrates MAC units with dedicated function blocks, which handle activation functions, element-wise operations, and pooling layers.

The MDLA uses a technique called tile-based layer fusion to help achieve high compute efficiency and bandwidth reduction. Tile-based layer fusion identifies and then fuses dependent inter-layer operations, in order to reduce the amount of data the MDLA brings on-chip.

 

2. Hardware Support Specification

The following MediaTek platforms support NeuroPilot 5:

  • Dimensity 8000
  • Dimensity 8100
  • Dimensity 9000

2.1 Hardware Specifications

2.1.1. Dimensity 9000

Feature

D9000

Process

T-4nm

CPU

1x Arm Cortex-X2 at 3.05GHz, 1MB L2
3x Arm Cortex-A710 up to 2.85GHz, 512KB L2
4x Arm Cortex-A510 up to 1.8GHz, 256KB L2
SC/MC: 1256/4198

GPU

Arm Mali-G710 MC10
1W: 119fps, Peak: >220fps

Memory

4x LPDDR5X 7500MHz
UFS 3.1, 2-lane

Camera

4K30 3-exp Video HDR x 3CAM
Up to 320MP
32M+32M+32M @30 ZSD

AI

MediaTek APU 590
4x MDLA 3.0+
2x MVPU 2.0

Video Decoder

8K 30fps

Video Encoder

8K 24fps

Display

2480x2200 120Hz
WQHD+ (3680x1600) 144Hz

Connectivity

Wi-Fi 6E 2x2, 160MHz bandwidth
DBDC 1x1+1x1
Bluetooth 5.3

Modem

5G NR 3CC 300MHz with ET 60MHz
4G Cat-19, DR-DSDA

2.1.2. APU

The MediaTek AI Processing Unit (APU) is a a high-performance hardware engine for deep-learning, optimized for bandwidth and power efficiency. The APU architecture consists of big, small, and tiny cores. This highly heterogeneous design is suited for a wide variety of modern smartphone tasks, such as AI-camera, AI-assistant, and OS or in-app enhancements.

2.1.2.1. MVPU 2.0

The Vision Processing Unit (VPU) offers general-purpose Digital Signal Processing (DSP) capabilities, with special hardware for accelerating complex imaging and computer vision algorithms. The VPU also offers outstanding performance while running AI models.

2.1.2.2. MDLA 3.0

The MediaTek Deep Learning Accelerator (MDLA) is a powerful and efficient Convolutional Neural Network (CNN) accelerator. The MDLA is capable of achieving high AI benchmark results with high Multiply-Accumulate (MAC) utilization rates. The design integrates MAC units with dedicated function blocks, which handle activation functions, element-wise operations, and pooling layers.

The MDLA uses a technique called tile-based layer fusion to help achieve high compute efficiency and bandwidth reduction. Tile-based layer fusion identifies and then fuses dependent inter-layer operations, in order to reduce the amount of data the MDLA brings on-chip.

2.2. Supported Operations

This section describes all the neural network operations (OPs) that the D9000 supports through NeuroPilot, and any restrictions placed on their use.

2.2.1. TFLite Operations

Note:

NeuroPilot supports a wide variety of operations in TFLite. This allows most neural network models to run on the specialized compute cores available on Mediatek platforms.

If you trained a model using TensorFlow v1, TensorFlow v2, or PyTorch, you can convert the model to TFLite format using NeuroPilot’s Converter Tool. For details, see the Convertor Tool section in the NeuroPilot SDK documentation.

2.2.1.1. Supported Data Types

The following table lists the supported data types of each D9000 hardware target.

Device

AsymU8

AsymI8

SymI8

SymI16

Fp16

Fp32

Bool8

Int32

MDLA 3.0

O

O

O

O

O

     

MVPU 2.0

O

O

O

O

O

O

O

O

GPU

       

O

O

   

CPU

O

O

   

O

O

O

O

2.2.1.2. Supported NeuroPilot Operations

The following table lists the TFLite operatations supported by NeuroPilot 5.0 SDK on the Dimensity D9000 platform for each data type.

Note:

For full details of each operation, see the corresponding hardware guidelines section.

OP Name

TFLite OP

Ncc-tflite

Neuron Delegate

NNAPI Delegate

AsymU8

AsymI8

SymI8

SymI16

Fp16

Fp32

Bool8

Int32

Abs

kTfLiteBuiltinAbs

O

O

since API Level 29

O

 

O

O

O

O

   

ArgMax

kTfLiteBuiltinArgMax

O

O

since API Level 29

O

O

   

O

O

   

ArgMin

kTfLiteBuiltinArgMin

O

O

since API Level 29

O

O

   

O

O

   

AvgPooling

kTfLiteBuiltinAveragePool2d

O

O

since API Level 27

O

O

O

O

O

O

   

Cast

kTfLiteBuiltinCast

O

O

since API Level 29

O

O

         

O

Concat

kTfLiteBuiltinConcatenation

O

O

since API Level 27

O

O

O

O

O

O

   

Conv2D

kTfLiteBuiltinConv2d

O

O

since API Level 27

O

O

O

O

O

O

   

DepthToSpace

kTfLiteBuiltinDepthToSpace

O

O

since API Level 27

O

O

O

O

O

O

   

DepthwiseConv2D

kTfLiteBuiltinDepthwiseConv2d

O

O

since API Level 27

O

O

O

O

O

O

   

Dequantize

kTfLiteBuiltinDequantize

O

O

since API Level 27

O

O

O

O

O

O

   

ElementWiseAdd

kTfLiteBuiltinAdd

O

O

since API Level 27

O

O

O

O

O

O

   

ElementWiseDiv

kTfLiteBuiltinDiv

O

O

since API Level 28

O

O

O

O

O

O

   

ElementWiseMul

kTfLiteBuiltinMul

O

O

since API Level 27

O

O

O

O

O

O

   

ElementWiseSub

kTfLiteBuiltinSub

O

O

since API Level 28

O

O

O

O

O

O

   

Elu

kTfLiteBuiltinElu

O

 

since API Level 30

O

O

O

O

O

O

   

Equal

kTfLiteBuiltinEqual

O

O

since API Level 29

O

O

   

O

O

O

O

FullyConnected

kTfLiteBuiltinFullyConnected

O

O

since API Level 27

O

O

O

O

O

O

   

Gather

kTfLiteBuiltinGather

O

O

since API Level 29

O

O

           

Greater

kTfLiteBuiltinGreater

O

O

since API Level 29

O

O

   

O

O

O

O

GreaterEqual

kTfLiteBuiltinGreaterEqual

O

O

since API Level 29

O

O

   

O

O

O

O

HardSwish

kTfLiteBuiltinHardSwish

O

O

since API Level 30

O

O

O

O

O

O

   

L2Norm

kTfLiteBuiltinL2Normalization

O

O

since API Level 27

O

O

     

O

   

Less

kTfLiteBuiltinLess

O

O

since API Level 29

O

O

   

O

O

O

O

LessEqual

kTfLiteBuiltinLessEqual

O

O

since API Level 29

O

O

   

O

O

O

O

Maximum

kTfLiteBuiltinMaximum

O

O

since API Level 29

O

O

O

O

O

O

   

MaxPooling

kTfLiteBuiltinMaxPool2d

O

O

since API Level 27

O

O

O

O

O

O

   

Mean

kTfLiteBuiltinMean

O

O

since API Level 28

O

O

O

O

O

O

   

Minimum

kTfLiteBuiltinMinimum

O

O

since API Level 29

O

O

O

O

O

O

   

Neg

kTfLiteBuiltinNeg

O

O

since API Level 29

O

 

O

O

O

O

   

NotEqual

kTfLiteBuiltinNotEqual

O

O

since API Level 29

O

O

   

O

O

O

O

Pack

kTfLiteBuiltinPack

O

   

O

O

O

O

O

     

Pad

kTfLiteBuiltinPad

O

O

since API Level 28

O

O

O

O

O

O

   

Pad

kTfLiteBuiltinPadv2

O

O

since API Level 29

O

O

O

O

O

O

   

Pow

kTfLiteBuiltinPow

O

O

since API Level 29

O

 

O

O

O

     

PRelu

kTfLiteBuiltinPrelu

O

O

since API Level 29

O

O

O

O

O

O

   

PRelu

kTfLiteBuiltinLeakyRelu

O

O

 

O

O

O

O

O

O

   

Quantize

kTfLiteBuiltinQuantize

O

O

since API Level 29

O

 

O

O

O

O

   

ReduceAny

kTfLiteBuiltinReduceAny

O

O

since API Level 29

O

O

           

ReduceMax

kTfLiteBuiltinReduceMax

O

O

since API Level 29

O

O

           

ReduceMin

kTfLiteBuiltinReduceMin

O

O

since API Level 29

O

O

           

ReLU

kTfLiteBuiltinRelu

O

O

since API Level 27

O

O

O

O

O

O

   

ReLU6

kTfLiteBuiltinRelu6

O

O

since API Level 27

O

O

O

O

O

O

   

Reshape

kTfLiteBuiltinReshape

O

O

since API Level 27

O

O

O

O

O

O

   

Reshape

kTfLiteBuiltinSqueeze

O

O

since API Level 28

O

O

O

O

O

O

   

Resize::BILINEAR

kTfLiteBuiltinResizeBilinear

O

O

since API Level 27

O

O

O

O

O

O

   

Resize::NEAREST

kTfLiteBuiltinResizeNearestNeighbor

O

O

since API Level 29

O

O

O

O

O

O

   

RSqrt

kTfLiteBuiltinRsqrt

O

O

since API Level 29

       

O

O

   

Sigmoid

kTfLiteBuiltinLogistic

O

O

since API Level 27

O

O

O

O

O

O

   

Slice

kTfLiteBuiltinSlice

O

O

since API Level 29

O

O

O

O

O

     

SoftMax

kTfLiteBuiltinSoftmax

O

O

since API Level 27

O

O

   

O

O

   

SpaceToDepth

kTfLiteBuiltinSpaceToDepth

O

O

since API Level 27

O

O

O

O

O

O

   

Split

kTfLiteBuiltinSplit

O

O

since API Level 29

O

O

O

O

O

O

   

Sqrt

kTfLiteBuiltinSqrt

O

O

since API Level 29

       

O

O

   

Square

kTfLiteBuiltinSquare

O

   

O

O

O

O

O

     

SquaredDifference

kTfLiteBuiltinSquaredDifference

O

   

O

 

O

O

O

     

StridedSlice

kTfLiteBuiltinStridedSlice

O

O

since API Level 28

O

O

O

O

O

O

   

Tanh

kTfLiteBuiltinTanh

O

O

since API Level 27

O

O

O

O

O

O

   

Tile

kTfLiteBuiltinTile

O

O

since API Level 29

O

O

           

Transpose

kTfLiteBuiltinTranspose

O

O

since API Level 28

O

O

O

O

O

O

   

TransposeConv2D

kTfLiteBuiltinTransposeConv

O

O

since API Level 29

O

O

O

O

O

O

   

2.2.1.3. Supported Hardware Operations

The table below lists the supported TFLite operations (OPs) for hardware targets on different MediaTek platforms.

Note:

  • Each OP has both might have hardware and software constraints.

  • For details on MDLA constrains, see MDLA 3.0 Guidelines.

  • To check the supported TFLite OP versions, use the command -–show-builtin-ops in ncc-tflite.

TFLite operations

MDLA 1.0

MDLA 1.5/1.7

MDLA 2.0

MDLA 3.0

VPU (asym8)

VPU (fp16)

MVPU 2.0

ABS

O

O

O

O

     

ADD

O

O

O

O

O

O

O

ARG_MAX

       

O

 

O

ARG_MIN

       

O

 

O

AVERAGE_POOL_2D

O

O

O

O

O

 

O

BATCH_TO_SPACE_ND

O

O

O

O

O

   

CAST

 

O

O

O

O

 

O

CEIL

           

O

CHANNEL_SHUFFLE

           

O

CONCATENATION

O

O

O

O

O

   

CONV_2D

O

O

O

O

O

   

DEPTH_TO_SPACE

 

O

O

O

O

 

O

DEPTHWISE_CONV_2D

O

O

O

O

O

   

DEQUANTIZE

 

O

O

O

O

O

O

DIV

 

O

O

O

O

O

O

ELU

O

O

O

O

     

EQUAL

       

O

 

O

EXP

           

O

EXPAND_DIMS

O

O

O

O

O

   

FILL

           

O

FLOOR

           

O

FULLY_CONNECTED

O

O

O

O

O

O

 

GATHER

       

O

 

O

GREATER

       

O

 

O

GREATER_EQUAL

       

O

 

O

HARD_SWISH

 

O

O

O

O

 

O

L2_NORMALIZATION

       

O

 

O

L2_POOL_2D

O

O

O

O

     

LEAKY_RELU

           

O

LESS

       

O

 

O

LESS_EQUAL

       

O

 

O

LOCAL_RESPONSE_NORMALIZATION

             

LOG

           

O

LOG_SOFTMAX

           

O

LOGICAL_AND

           

O

LOGICAL_NOT

           

O

LOGICAL_OR

           

O

LOGISTIC

O

O

O

O

O

 

O

LSTM (QSTM)

 

O

O

O

O

 

O

MAX_POOL_2D

O

O

O

O

O

 

O

MAXIMUM

O

O

O

O

O

 

O

MEAN

O

O

O

O

O

 

O

MINIMUM

O

O

O

O

O

 

O

MIRRORPAD

 

O

O

O

     

MUL

O

O

O

O

O

O

O

NEG

 

O

O

O

     

NOT_EQUAL

       

O

 

O

PACK

 

O

O

O

O

   

PAD

O

O

O

O

O

   

POW

 

O

O

O

O (SQRT)

   

PRELU

O

O

O

O

O

 

O

QUANTIZE

 

O

O

O

O

O

O

REDUCE_ANY

       

O

 

O

REDUCE_MAX

 

O

O

O

O

 

O

REDUCE_MIN

 

O

O

O

O

 

O

RELU

O

O

O

O

O

 

O

RELU_N1_TO_1

O

O

O

O

O

 

O

RELU6

O

O

O

O

O

 

O

RESHAPE

O

O

O

O

O

 

O

RESIZE_BILINEAR

O

O

O

O

O

   

RESIZE_NEAREAST

 

O

O

O

O

   

ROUND

           

O

RSQRT

   

O

O

O

O

 

SELECT

           

O

SLICE

O

O

O

O

O

   

SOFTMAX

   

O

O

O

O

O

SPACE_TO_BATCH_ND

 

O

O

O

O

   

SPACE_TO_DEPTH

 

O

O

O

O

 

O

SPLIT

 

O

O

O

O

 

O

SPLIT_V

 

O

O

O

O

 

O

SQRT

   

O

O

O

O

 

SQUARE

 

O

O

O

O

O

 

SQUARED_DIFFERENCE

 

O

O

O

     

SQUEEZE

 

O

O

O

O

   

STRIDED_SLICE

 

O

O

O

O

   

SUB

O

O

O

O

O

 

O

SUM

           

O

TANH

O

O

O

O

O

 

O

TILE

       

O

   

TOPK_V2

       

O

 

O

TRANSPOSE

 

O

O

O

O

O

 

TRANSPOSE_CONV

O

O

O

O

O

   

UNPACK

 

O

O

O

   

2.2.1.3. Supported Hardware Operations

The table below lists the supported TFLite operations (OPs) for hardware targets on different MediaTek platforms.

Note:

  • Each OP has both might have hardware and software constraints.

  • For details on MDLA constrains, see MDLA 3.0 Guidelines.

  • To check the supported TFLite OP versions, use the command -–show-builtin-ops in ncc-tflite.

TFLite operations

MDLA 1.0

MDLA 1.5/1.7

MDLA 2.0

MDLA 3.0

VPU (asym8)

VPU (fp16)

MVPU 2.0

ABS

O

O

O

O

     

ADD

O

O

O

O

O

O

O

ARG_MAX

       

O

 

O

ARG_MIN

       

O

 

O

AVERAGE_POOL_2D

O

O

O

O

O

 

O

BATCH_TO_SPACE_ND

O

O

O

O

O

   

CAST

 

O

O

O

O

 

O

CEIL

           

O

CHANNEL_SHUFFLE

           

O

CONCATENATION

O

O

O

O

O

   

CONV_2D

O

O

O

O

O

   

DEPTH_TO_SPACE

 

O

O

O

O

 

O

DEPTHWISE_CONV_2D

O

O

O

O

O

   

DEQUANTIZE

 

O

O

O

O

O

O

DIV

 

O

O

O

O

O

O

ELU

O

O

O

O

     

EQUAL

       

O

 

O

EXP

           

O

EXPAND_DIMS

O

O

O

O

O

   

FILL

           

O

FLOOR

           

O

FULLY_CONNECTED

O

O

O

O

O

O

 

GATHER

       

O

 

O

GREATER

       

O

 

O

GREATER_EQUAL

       

O

 

O

HARD_SWISH

 

O

O

O

O

 

O

L2_NORMALIZATION

       

O

 

O

L2_POOL_2D

O

O

O

O

     

LEAKY_RELU

           

O

LESS

       

O

 

O

LESS_EQUAL

       

O

 

O

LOCAL_RESPONSE_NORMALIZATION

             

LOG

           

O

LOG_SOFTMAX

           

O

LOGICAL_AND

           

O

LOGICAL_NOT

           

O

LOGICAL_OR

           

O

LOGISTIC

O

O

O

O

O

 

O

LSTM (QSTM)

 

O

O

O

O

 

O

MAX_POOL_2D

O

O

O

O

O

 

O

MAXIMUM

O

O

O

O

O

 

O

MEAN

O

O

O

O

O

 

O

MINIMUM

O

O

O

O

O

 

O

MIRRORPAD

 

O

O

O

     

MUL

O

O

O

O

O

O

O

NEG

 

O

O

O

     

NOT_EQUAL

       

O

 

O

PACK

 

O

O

O

O

   

PAD

O

O

O

O

O

   

POW

 

O

O

O

O (SQRT)

   

PRELU

O

O

O

O

O

 

O

QUANTIZE

 

O

O

O

O

O

O

REDUCE_ANY

       

O

 

O

REDUCE_MAX

 

O

O

O

O

 

O

REDUCE_MIN

 

O

O

O

O

 

O

RELU

O

O

O

O

O

 

O

RELU_N1_TO_1

O

O

O

O

O

 

O

RELU6

O

O

O

O

O

 

O

RESHAPE

O

O

O

O

O

 

O

RESIZE_BILINEAR

O

O

O

O

O

   

RESIZE_NEAREAST

 

O

O

O

O

   

ROUND

           

O

RSQRT

   

O

O

O

O

 

SELECT

           

O

SLICE

O

O

O

O

O

   

SOFTMAX

   

O

O

O

O

O

SPACE_TO_BATCH_ND

 

O

O

O

O

   

SPACE_TO_DEPTH

 

O

O

O

O

 

O

SPLIT

 

O

O

O

O

 

O

SPLIT_V

 

O

O

O

O

 

O

SQRT

   

O

O

O

O

 

SQUARE

 

O

O

O

O

O

 

SQUARED_DIFFERENCE

 

O

O

O

     

SQUEEZE

 

O

O

O

O

   

STRIDED_SLICE

 

O

O

O

O

   

SUB

O

O

O

O

O

 

O

SUM

           

O

TANH

O

O

O

O

O

 

O

TILE

       

O

   

TOPK_V2

       

O

 

O

TRANSPOSE

 

O

O

O

O

O

 

TRANSPOSE_CONV

O

O

O

O

O

   

UNPACK

 

O

O

O

   

 

2.2.2. MDLA 3.0 Guidelines

Note

The following limitations may not be equal to MDLA hardware constraints. This is because Neuron might have software workarounds for MDLA hardware, or limitations due to the current software implementation.

2.2.2.1. General Restrictions

Category

Limitations

Tensor Rank

Supported tensor ranks:

  • For operation (OP) type Conv3D, AvgPool3D, L2Pool3D, MinPool3D, and MaxPool3D: 5-D

  • For all other OP types: 0-D, 1-D, 2-D, 3-D, 4-D

Batch Size (N)

Valid batch sizes:

  • FULLY_CONNECTED: {1, 2, 4, 8}. FULLY_CONNECTED with any other batch size is converted to OP CONV_2D.

  • CONV_2D, DEPTHWISE_CONV_2D, TRANSPOSE_CONV: No batch size limit. If batch size is 65535 or less, the OP is split into multiple OPs.

  • All other OPs: [1, 65535]

Height Size (H)

Valid range for input and output activations: [1, 65535]

Width Size (W)

Valid range for input and output activations: [1, 65535]

Channel Size (C)

Valid range for input and output activations: [1, 65535]

Data Type

Supported data types:

  • Asymmetric unsigned 8-bit

  • Asymmetric signed 8-bit

  • Symmetric signed 8-bit

  • Symmetric signed 16-bit

  • Symmetric signed 16-bit activation + Symmetric signed 8-bit weight

  • 16-bit floating point (FP16)

  • 32-bit floating point

    • Converted to FP16 if relax-FP32 is enabled

Per Channel Quantization

Only the following OPs support per channel quantization:

  • CONV_2D

  • DEPTHWISE_CONV_2D

  • TRANSPOSE_CONV

  • FULLY_CONNECTED

  • MTK_TRANSPOSE_CONV

  • PRELU

MDLA Hardware Buffer

MDLA has different internal buffers for different uses. If there is not a buffer of sufficient size for an operation, then MDLA cannot run the operation and reports “Unsupported”. To avoid internal buffer constraints:

  • Keep the input channel size small.

  • For operations that have stride, such as convolution and pooling, keep the stride values small in both width and height.

  • Keep filter size small in both width and height, especially for convolution-like operations.

2.2.2.2. Supported OPs Specification

OP Name

TFLite OP

NNAPI

Restrictions

Abs

ABS

ABS

None

AvgPooling

AVERAGE_POOL_2D

AVERAGE_POOL_2D

  • Only NHWC format is supported.

  • Filter shape, stride, and paddings attributes must meet the following conditions:

    • If filter size is equal to input size (both H and W dimensions in output are equal to 1):

      • For quantized types: The input_height * input_width must be in the range [1, 2^20].

      • For floating-point types: The input_height and input_width must satisfy one of the following constraints to avoid accuracy issues:

        • input_height(input_width) must be less than or equal to S, where S = 64.

        • input_height(input_width) must be factorable in the form of “2^a * 3^b * 5^c * 7^d * N”, where N is 1 or a prime number less than or equal to S.

    • If filter size is not equal to input size:

      • Filter shape height and width must be in the range [1, 8].

      • Stride height must be in the range [1, filter_height].

      • Stride width must be in the range [1, filter_width].

      • Top and bottom paddings must be in the range [0, filter_height-1].

      • Left and right paddings must be in the range [0, filter_width-1].

BatchToSpace

BATCH_TO_SPACE_ND

BATCH_TO_SPACE_ND

Only NHWC format is supported.

Concat

CONCATENATION

CONCATENATION

None

Conv2D

CONV_2D

CONV_2D

  • NHWC and NCHW formats are supported.

  • Filter size

    • If stride is not 1x1, filter height and width must be in the range [1, 25].

    • Otherwise, filter width must be in the range [1, 31].

  • Stride

    • If dilation rate is equal to 1, stride height and width must be in {1, 2, 3, 4, 8}.

  • Padding

    • For 1x1 filter, there must be no padding.

    • Otherwise, padding must be in the range [0, 15].

  • Dilation rate

    • Dilation rate height must be {1, 2, 4, 8}.

    • Dilation rate width must be in {1, 2, 4, 8}.

    • There are no limitations if ncc-tflite option “–use-sw-dilated-conv” is enabled. This option applies a software solution for dilated convolution.

DepthwiseConv2D

DEPTHWISE_CONV_2D

DEPTHWISE_CONV_2D

  • Filter size

    • Filter height and width must be in the range [1, 25].

  • Channel multiplier

    • Channel multiplier must be in {1, 2, 4, 8, 16}.

    • Otherwise, channel multiplier must be equal to output channel (i.e., input channel is 1).

  • Other contraints are the same as CONV_2D

DepthToSpace

DEPTH_TO_SPACE

DEPTH_TO_SPACE

  • Only NHWC format is supported.

  • Input batch must be 1.

  • Output batch must be 1.

Dequantize

DEQUANTIZE

DEQUANTIZE

Input cannot be per channel quantization.

ElementWiseAdd

ADD

ADD

See Limitations of Broadcasting.

ElementWiseDiv

DIV

DIV

  • We recommend not applying this operation for quantized types because of accuracy issues.

  • See Limitations of Broadcasting.

ElementWiseMul

MUL

MUL

See Limitations of Broadcasting.

ElementWiseSub

SUB

SUB

  • See Limitations of Broadcasting.

  • The scale of input1 (minuend) must be greater than or equal to the scale of input2 (subtrahend).

Elu

ELU

ELU

None

FullyConnected

FULLY_CONNECTED

FULLY_CONNECTED

  • Filter input channel (i.e. the 2nd dimension of filter) must be in the range [1, 1048575].

  • FULLY_CONNECTED with dynamic weight is converted to CONV_2D.

  • Bias must be a constant tensor.

HardSwish

HARD_SWISH

HARD_SWISH

None

L2Pooling

L2_POOL_2D

L2_POOL_2D

  • Same as AVERAGE_POOL. Except if filter size is equal to input size (both H and W dimensions in output are equal to 1), then filter_height * filter_width must be in the range [1, 2^10].

  • Input activation with floating point data type is unsupported.

MaxPooling

MAX_POOL_2D

MAX_POOL_2D

  • Same as AVERAGE_POOL_2D.

  • Additional supported: input dimension is equal to output dimension, padding SAME, stride 1

Maximum

MAXIMUM

MAXIMUM

See Limitations of Broadcasting.

Mean

MEAN

MEAN

None

Minimum

MINIMUM

MINIMUM

See Limitations of Broadcasting.

MirrorPad

MIRRORPAD

MIRRORPAD

Supported tensors: 4-D with padding on height or width direction.

Neg

NEG

NEG

None

Pack

PACK

 

Cannot pack at last dimension.

Pad

PAD
PADV2

PAD
PAD_V2

None

Pow

POW

POW

Exponent must be a constant integer.

PRelu

PRELU

PRELU

  • Alpha must be a constant.

  • Alpha must be a scalar (0-D) or 1-D tensor.

QLSTM (5 inputs)

LSTM

QUANTIZED_16BIT_LSTM

The last dimension of input + the last dimension of output scratch must be:

  • 16-aligned

  • In the range [1, 1048575]

Quantize

QUANTIZE

QUANTIZE

None

ReduceMax

REDUCE_MAX

REDUCE_MAX

The size before reduced axis must be less than 65536.

ReduceMin

REDUCE_MIN

REDUCE_MIN

The size before reduced axis must be less than 65536.

ReLU
ReLU1
ReLU6

RELU
RELU_N1_TO_1
RELU6

RELU
RELU1
RELU6

None

Reshape

RESHAPE

RESHAPE

None

Resize::BILINEAR

RESIZE_BILINEAR

RESIZE_BILINEAR

  • Only NHWC format is supported.

  • Input Height must be in the range [1, 8192].

  • Input Width must be in the range [1, 8192].

Resize::NEAREST

RESIZE_NEAREST_NEIGHBOR

RESIZE_NEAREST_NEIGHBOR

  • Only NHWC format is supported.

  • Input Height must be in the range [1, 8192].

  • Input Width must be in the range [1, 8192].

RSqrt

RSQRT

RSQRT

None

Sigmoid

LOGISTIC

LOGISTIC

None

Slice

SLICE

SLICE

None

SoftMax

SOFTMAX

SOFTMAX

  • Axis must be -1; this means only the input channel is normalized.

  • Quantized types are dequantized to FP16 due to an accuracy issue.

SpaceToBatch

SPACE_TO_BATCH_ND

SPACE_TO_BATCH_ND

  • Only NHWC format is supported.

  • Input batch must be 1.

SpaceToDepth

SPACE_TO_DEPTH

SPACE_TO_DEPTH

  • Only NHWC format is supported.

  • Input batch must be 1.

Split

SPLIT

SPLIT

None

Sqrt

SQRT

SQRT

None

Square

SQUARE

 

None

SquaredDifference

SQUARED_DIFFERENCE

 

None

StridedSlice

STRIDED_SLICE

STRIDED_SLICE

Stride on the last dimension is unsupported.

Sum

SUM

SUM

None

Tanh

TANH

TANH

For quantized types, InputScale/OutputScale must be less than 842.

Transpose

TRANSPOSE

TRANSPOSE

None

TransposeConv2D

TRANSPOSE_CONV

TRANSPOSE_CONV_2D

  • Weight must be a constant tensor

  • Filter size

    • Filter height and width must be in the range [1, 25].

  • Stride

    • Stride height must be less than or equal to filter height.

    • Stride width must be less than or equal to filter width.

  • Other contraints are the same as CONV_2D

Unpack

UNPACK

 

Cannot unpack at last dimension.

2.2.2.3. Limitations of Broadcasting

  • Only broadcasting from a small tensor to a large tensor with compatible dimensions is supported.

    • Example 1: Input1 broadcasting to Input2 is supported.

    • Example 2: Input2 broadcasting to Input1 is supported.

    • Example 3: Input1 and Input2 broadcasting to each other is unsupported.

  • Hardware broadcasting is supported if either of the following conditions are met:

    1. The small tensor has one of the following shapes:

      • []

      • [1]

      • [C]

      • [1, C]

      • [1, 1, C]

      • [1, 1, 1, C]

    2. The small tensor is broadcast on the batch or channel dimension.

      • Example 1: The shape of the small tensor is [1,H,W,C], where H,W,C are not equal to 1.

      • Example 2: The shape of the small tensor is [N,H,W,1], where N,H,W are not equal to 1.

      • Example 3: The shape of the small tensor is [1,H,W,1], where H,W are not equal to 1.

  • If the conditions for hardware broadcasting are not met, broadcasting is processed by software using multiple SPLIT and CONCAT.

    • If the small tensor is constant, the broadcasting is done at compile time. Bandwidth requirements might be larger at runtime.

    • If the small tensor is not constant, there are extra runtime DMA overheads.

 

2.2.3. MVPU 2.0 Guidelines

2.2.3.1. General Restrictions

The following table lists limitations for all MVPU 2.0 operations.

Category

Limitations

Tensor Rank

Supported tensors ranks: 1-D, 2-D, 3-D, 4-D

Batch Size (N)

  • Dynamic shape is not supported.

  • Valid range for input and output activations: [1, 65535]

Height Size (H)

  • Dynamic shape is not supported.

  • Valid range for input and output activations: [1, 65535]

Width Size (W)

  • Dynamic shape is not supported.

  • Valid range for input and output activations: [1, 65535]

Channel Size (C)

  • Dynamic shape is not supported.

  • Valid range for input and output activations: [1, 65535]

Data Type

Supported data types:

  • Asymmetric unsigned 8-bit (Asym U8)

  • Asymmetric signed 8-bit (Asym I8)

  • Symmetric 8-bit (Sym I8)

  • Symmetric 16-bit (Sym I16)

  • 8-bit boolean for logical operations (Bool 8)

  • 16-bit floating point (FP16)

  • 32-bit floating point (FP32)

  • 32-bit integer (Int32)

Data Format

Only NHWC format is supported.

2.2.3.2. TensorFlow and TFLite Operations

Operation

Restrictions

Data type

ADD

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Support requantization

  • One of input tensors can be constant

  • Only support inputScale / (2 * maxInputScale) < 1

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Input: FP16

    • Output: FP16

ARG_MAX

ARG_MIN

  • Maximum input/output tensor rank: 4D

  • Do not support batch axis

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Int32

  • Floating data type

    • Input: FP16 / FP32

    • Output: Not supported

AVERAGE_POOL_2D

  • Input and output must be 4D

  • Filter W, H = [1:128]

  • Stride W = H

  • Support requantization

  • Support RELU/RELU1/RELU6 fusion

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

CAST

  • Maximum input/output tensor rank: 4D

  • Input must not be a constant

  • Quantization data type

    • Input: Asym U8 / Asym I8 / Int32

    • Output: Asym U8 / Asym I8 / Int32

  • Floating data type

    • Support casting from FP32 to Int32

    • Support casting from Int32 to FP32

    • Support casting from Int32 to FP16

CEIL

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Not supported

  • Floating data type

    • Input Value: FP16 / FP32

    • Output: FP16 / FP32

CROP_AND_RESIZE

  • Input and output must be 4D

  • Input boxes must be 2D

  • I/O must be in the same scale and zero point

  • Input box Indices must be the same as output batch size

  • Quantization data type

    • Input: Asym U8

    • Input boxes: FP32

    • Input box indices: Int32

    • Output value: Asym U8

  • Floating data type

    • Not supported

CHANNEL_SHUFFLE

  • Maximum input/output tensor rank: 4D

  • Input and output must be in the same scale and zero point

  • group_size = num_channels / num_groups

  • The number of channels must be divisible by num_groups.

  • The input scalar (the dimension channel shuffle) must be in the range [-n, n-1]

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

DEPTH_TO_SPACE

  • Input and output must be 4D

  • Block size >= 1

  • Input and output must be in the same scale and zero point

  • Quantization data type

    • Not supported

  • Floating data type

    • Input Value: FP16 / FP32

    • Output: FP16 / FP32

DEQUANTIZE

  • Maximum input/output tensor rank: 4D

  • Input and Output must have the same shape

  • Input scale must be > 0

  • Per-channel quantization is not supported

  • Quantization data type

    • Not supported

  • Floating data type

    • Input: Asym U8 / Asym I8

    • Output: FP16

DIV

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Quantization data type

    • Not supported

  • Floating data type

    • Input: FP16

    • Output: FP16

EQUAL

NOT_EQUAL

GREATER

GREATER_EQUAL

LESS

LESS_EQUAL

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Quantization data type

    • Input: Asym U8 / Asym I8 / Bool 8 / Int32

    • Output: Bool 8

  • Floating data type

    • Not supported

EXP

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Not supported

  • Floating data type

    • Input Value: FP16 / FP32

    • Output: FP16 / FP32

FILL

  • The n-D output can be reshaped to 4-D, with each dimension < 16 bits

  • Input(fill value) and output must be in the same scale and zero point

  • Input 0 (describing output shape) must be 1D

  • Quantization data type

    • Input: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32

    • Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32

  • Floating data type

    • Input Value: FP16/ FP32

    • Output: FP16 / FP32

FLOOR

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Not supported

  • Floating data type

    • Input Value: FP16

    • Output: FP16

GATHER

  • Maximum input/output tensor rank: 4D

  • Input and output must be in the same scale and zero point

  • Only support single batch.

  • Do not support gathering in batch axis

  • Axis must be smaller than input rank

  • Input dimension + Indices dimension <= 4

  • Support requantization

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

HARD_SWISH

  • Maximum input/output tensor rank: 4D

  • Quantization data type

  • Input: Asym U8 / Asym I8 / Sym I8

  • Output: Asym U8 / Asym I8 / Sym I8

  • Floating data type

  • Not supported

L2_NORMALIZATION

  • Maximum input/output tensor rank: 4D

  • Axis must be constant

  • Only support single axis, and axis can not be batch dimension

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

LEAKY_RELU

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

LOG

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Not supported

  • Floating data type

    • Input Value: FP16 / FP32

    • Output: FP16 / FP32

LOG_SOFTMAX

  • Support 2D/4D output

  • Only supported axis is the channel dimension

  • Beta > 0

  • Support RESHAPE fusion in front of OP

  • Quantization data type

    • Not supported

  • Floating data type

    • Input: FP32

    • Output: FP32

LOGICAL_AND

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Quantization data type

    • Input: Bool 8

    • Output: Bool 8

  • Floating data type

    • Not supported

LOGICAL_NOT

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Quantization data type

    • Input: Bool 8

    • Output: Bool 8

  • Floating data type

    • Not supported

LOGICAL_OR

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Quantization data type

    • Input: Bool 8

    • Output: Bool 8

  • Floating data type

    • Not supported

LOGISTIC

  • Maximum input/output tensor rank: 4D

  • Output scale = 1/256, Output zeropoint = 0 for Asym U8

  • Output scale = 1/256, Output zeropoint = -128 for Asym I8

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

MAX_POOL_2D

  • Input and output must be 4D

  • Weight W, H = [1:16]

  • Stride W = H

  • Support requantization

  • Support RELU/RELU1/RELU6 fusion

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Weight: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

MAXIMUM

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Support requantization

  • One input tensor can be constant

  • Only support inputScale / (2 * maxInputScale) < 1

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

MEAN

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

MINIIMUM

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Support requantization

  • One input tensor can be constant

  • Only support inputScale / (2 * maxInputScale) < 1

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

MUL

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Support requantization

  • One input tensor can be constant

  • Only support inputProdScale / outputScale < 1

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Input: FP16 / FP32

    • Output: FP16 / FP32

PRELU

  • Maximum input/output tensor rank: 4D

  • Broadcast is only supported if the following conditions are true:

    • Alpha tensor rank must be 0D or 1D

    • Alpha tensor size must be 1 or the same as input channel size

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

QLSTM

  • A version of quantized LSTM, using 16 bit quantization for internal state

  • Supports NNAPI v1.2 behavior

  • Projection is not supported

  • Layer normalization is not supported

  • 16 inputs and 2 outputs

  • Quantization data type

    • Input: Asym U8

    • Output Cell State: Sym I16

    • Output Value: Asym U8

  • Floating data type

    • Not supported

QUANTIZE

  • Maximum input/output tensor rank: 4D

  • Input and Output must have the same shape

  • Output scale must be > 0

  • Per-channel quantization is not supported

  • Quantization data type

    • Not supported

  • Floating data type

    • Input: FP16

    • Output: Asym U8 / Asym I8

REDUCE_ANY

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Input: Bool 8

    • Output: Bool 8

  • Floating data type

    • Not supported

REDUCE_MAX

REDUCE_MIN

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

RELU

RELU1

RELU6

  • Maximum input/output tensor rank: 4D

  • Input and output must be in the same scale and zero point

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

RESHAPE

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Input: Asym U8 / Asym I8 / Asym U16 / Asym I16 / Bool 8 / UInt32 / Int32

    • Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / Bool 8 / UInt32 / Int32

  • Floating data type

    • Input: FP16 / FP32

    • Output: FP16 / FP32

ROUND

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Not supported

  • Floating data type

    • Input Value: FP16 / FP32

    • Output: FP16 / FP32

SELECT

  • Maximum input/output tensor rank: 4D

  • Input and output must be the same shape

  • One input tensor can be constant

  • Support requantization

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Input condition: Bool 8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

SOFTMAX

  • Support 2D/4D output

  • Only supported axis is the channel dimension

  • Beta > 0

  • Support RESHAPE fusion in front of OP

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Input: FP16 / FP32

    • Output: FP16 / FP32

SPACE_TO_DEPTH

  • Input and output must be 4D

  • Block size >= 1

  • Input and output must be in the same scale and zero point

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

SPLIT

  • mNumSplits = number of output

  • Input rank = Output rank

  • Support n-D Input0, but its merged shape rank must be >= 2 and <= 4

  • Input1 (axis) must be a constant scalar

  • Quantization data type

    • Input0: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32

    • Input1: Int32

    • Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32

  • Floating data type

    • Input: FP16 / FP32

    • Output: FP16 / FP32

SPLIT_V

  • mNumSplits = number of output

  • Input rank = Output rank

  • Support n-D Input0, but its merged shape rank must be >= 2 and <= 4

  • Input1 (axis) must be a constant scalar

  • Quantization data type

    • Input0: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32

    • Input1: Int32

    • Output: Asym U8 / Asym I8 / Asym U16 / Asym I16 / UInt32 / Int32

  • Floating data type

    • Input: FP16 / FP32

    • Output: FP16 / FP32

SUB

  • Maximum input/output tensor rank: 4D

  • Support broadcast operation

  • Support requantization

  • One input tensor can be constant

  • Only support inputScale / (2 * maxInputScale) < 1

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supporteded

SUM

  • Maximum input/output tensor rank: 4D

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Not supported

TANH

  • Maximum input/output tensor rank: 4D

  • Output scale = 1/128, Output zeropoint = 128 for Asym U8

  • Output scale = 1/128, Output zeropoint = 0 for Asym I8

  • Quantization data type

    • Input: Asym U8 / Asym I8

    • Output: Asym U8 / Asym I8

  • Floating data type

    • Input: FP16 / FP32

    • Output: FP16 / FP32

TOPK_V2

  • Input and output must be 4D

  • Output values and indices must have the same dimensions

  • Batch size must be the same for both input and output

  • K value must be in the range [1, the size of last input dimension]

  • Quantization data type

    • Input: Asym U8

    • Output value: Asym U8

    • Output indices: Int32

  • Floating data type

    • Not supported

2.3. NeuroPilot SDK

The D9000 supports the following versions of NeuroPilot and Android.

  • NeuroPilot 5 for Android S