规模化智能通过AI模型的效率_第1页
规模化智能通过AI模型的效率_第2页
规模化智能通过AI模型的效率_第3页
规模化智能通过AI模型的效率_第4页
规模化智能通过AI模型的效率_第5页
已阅读5页,还剩35页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

April

6,

2021@QCOMResearchIntelligence

atscale

throughAImodelefficiencyQualcomm

Technologies,

Inc.2Why

efficient

machine

learningis

necessaryfor

AI

to

proliferateOur

latest

research

to

makeAI

models

more

efficientOuropen-source

projectsto

scale

efficientAIAgenda33Video

monitoringExtended

realitySmart

citiesSmart

factoriesAutonomousvehiclesVideo

conferencingSmart

homesSmartphoneAI

is

being

used

allaround

usincreasing

productivity,

enhancing

collaboration,and

transforming

industriesAI

video

analysis

is

on

the

riseTrend

toward

more

cameras,

higher

resolution,and

increased

frame

rate

across

devices44Source:

WellingWill

we

have

reached

the

capacity

of

the

human

brain?Energy

efficiency

of

a

brain

is

100x

better

than

current

hardware2025Weight

parameter

count194019701980199020002010202020301943:

First

NN

(+/-

N=10)1950

19601988:

NetTalk(+/-N=20K)2009:

Hinton’s

DeepBelief

Net

(+/-

N=10M)2013:

Google/Y!(N=+/-

1B)2025:N=100T

=1014101210101081061014104102100Deep

neural

networksare

energy

hungryand

growingfastAI

isbeing

powered

by

the

explosivegrowth

of

deep

neuralnetworks2021:

Extremely

largeneural

networks

(N

=1.6T)2017:

Very

largeneuralnetworks

(N

=137B)5Power

andthermalefficiency

are

essentialfor

on-device

AIThe

challenge

ofAI

workloadsConstrained

mobileenvironmentVerycomputeintensiveLarge,complicated

neuralnetwork

modelsComplexconcurrenciesAlways-onReal-timeMust

bethermallyefficient

for

sleek,ultra-light

designsStorage/memorybandwidth

limitationsRequires

long

batterylife

for

all-day

use6Holisticmodel

efficiencyresearchMultiple

axes

to

shrinkAI

models

and

efficientlyrun

them

onhardwareQuantizationLearning

to

reducebit-precision

while

keepingdesired

accuracyCompressionLearning

to

prunemodel

while

keepingdesired

accuracyCompilationLearning

to

compileAImodels

forefficienthardware

executionNeuralarchitecturesearchLearning

to

design

smallerneural

networks

that

are

on

paroroutperform

hand-designedarchitectures

on

realhardware71:

FP32

model

compared

to

quantized

modelPromising

results

show

thatlow-precision

integer

inferencecan

become

widespreadVirtually

the

same

accuracybetween

a

FP32

and

quantizedAI

model

through:Automated,

data

free,post-training

methodsAutomated

training-basedmixed-precision

methodSignificant

performance

perwattimprovements

through

quantizationLeadingresearch

toefficientlyquantize

AImodelsAutomated

reduction

in

precisionof

weights

and

activations

whilemaintaining

accuracyModels

trained

athigh

precision32-bit

floating

point3452.31948-bit

Integer255Increase

in

performanceper

watt

fromsavings

inmemory

and

compute1Inference

atlower

precision16-bit

Integer345201010101Increase

in

performanceper

watt

fromsavings

inmemory

and

compute1up

to4X4-bit

Integer15Increase

in

performanceper

watt

fromsavings

inmemory

and

compute101010101up

to16Xup

to64X0101010101010101010101010101010101010101010188dData-freequantizationHow

can

we

makequantization

as

simpleas

possible?Created

an

automated

methothat

addresses

biasandimbalance

in

weight

ranges:NotrainingData

freePushing

thelimits

of

what’spossible

withquantizationAdaRoundIs

rounding

to

the

nearestvalue

the

best

approachfor

quantization?Created

an

automatedmethod

for

finding

thebest

rounding

choice:NotrainingMinimal

unlabeled

dataSOTA

8-bit

resultsMaking

8-bit

weightquantization

ubiquitous<1%Accuracy

drop

forMobileNet

V2against

FP32

modelData-Free

Quantization

Through

WeightEqualization

andBias

Correction

(Nagel,

van

Baalen,

et

al.,

ICCV

2019)SOTA:

State-of-the-artAccuracy

drop

for<2.5%

MobileNet

V2against

FP32

modelUp

or

Down?

AdaptiveRounding

for

Post-TrainingQuantization

(Nagel,Amjad,

et

al.,ICML

2020)Bayesian

bitsCan

we

quantize

layers

todifferent

bit

widths

basedon

precision

sensitivity?Created

a

novel

methodto

learn

mixed-precisionquantization:Training

requiredTraining

data

requiredJointly

learns

bit-widthprecision

and

pruningSOTA

mixed-precision

resultsAutomating

mixed-precisionquantization

and

enabling

the

tradeoffbetween

accuracy

and

kernel

bit-width<1%Accuracy

drop

forMobileNet

V2

against

FP32model

for

mixed

precisionmodel

with

computationalcomplexity

equivalent

to

a4-bit

weight

modelBayesian

Bits:

Unifying

Quantization

and

Pruningvan

Baalen,

Louizos,

et

al.,

NeurIPS2020)8SOTA

4-bit

weight

resultsMaking

4-bit

weight

quantizationubiquitous99Optimizing

and

deploying

state-of-the-art

AImodelsfor

diverse

scenarios

at

scale

ischallengingNeuralnetworkcomplexityManystate-of-the-artneural

networksolutions

are

large,complex,

and

donot

run

efficientlyon

target

hardwareNeuralnetworkdiversityDevicediversityCostFor

differenttasks

and

usecase

cases,many

differentneural

networksare

requiredDeploying

neuralnetworks

to

manydifferent

deviceswith

differentconfigurations

andchanging

softwareis

requiredCompute

andengineeringresources

fortraining

plusevaluation

aretoo

costly

andtime

consuming10NASNeuralArchitectureSearchAn

automated

way

to

learna

network

topology

that

canachieve

the

bestperformanceon

a

certain

taskSearchspaceSet

of

operationsand

how

theycan

be

connectedto

formvalidnetworkarchitecturesSearchalgorithmMethod

forsampling

apopulation

ofgood

networkarchitecturecandidatesEvaluationstrategyMethod

toestimate

theperformanceof

samplednetworkarchitectures11High

costBrute

force

search

is

expensive>40,000

epochs

per

platformLack

diverse

searchHard

to

search

in

diverse

spaces,

withdifferentblock-types,

attention,

and

activations

Repeated

training

phase

for

every

new

scenarioDo

not

scaleRepeated

training

phase

for

every

new

device>40,000

epochs

per

platformUnreliable

hardware

modelsRequires

differentiable

cost-functionsRepeated

training

phase

for

every

new

deviceExistingNASsolutions

do

notaddress

all

thechallenges12Introducing

new

AI

researchDistilling

Optimal

NeuralNetwork

ArchitecturesEfficient

NAS

withhardware-aware

optimizationAscalable

method

that

finds

pareto-optimalnetwork

architectures

interms

of

accuracy

andlatency

for

any

hardware

platform

at

low

costStarts

from

an

oversized

pretrainedreference

architectureDONNADONNAOversized

pretrainedreference

architectureSet

ofpareto-optimalnetwork

architecturesLow

costLow

start-up

cost

of

1000-4000

epochs,equivalent

to

training

2-10

networks

from

scratchDiverse

search

to

findthe

best

modelsSupports

diverse

spaces

with

different

cell-types,attention,

and

activation

functions

(ReLU,

Swish,

etc.)ScalableScales

to

many

hardware

devicesat

minimal

costReliable

hardwaremeasurementsUses

direct

hardware

measurements

insteadof

a

potentially

inaccurate

hardware

modelDistilling

Optimal

Neural

Networks:

Rapid

Search

in

Diverse

Spaces

(Moons,

Bert,

et

al.,

arXiv2020)Varying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesDONNA

4-step

processObjective:

Build

accuracy

model

of

search

space

once,

then

deploy

to

many

scenariosDefine

referenceand

searchA13space

onceDefine

backbone:FixedchannelsHead

and

Stem1

2

3

4

514Define

reference

architecture

andsearch-space

onceA

diverse

search

space

is

essential

for

finding

optimal

architectures

with

higher

accuracySelect

referencearchitectureThe

largest

modelin

the

search-spaceChop

the

NNinto

blocksFix

the

STEM,

HEAD,#

blocks,strides,#

channels

at

block-edgeChoose

search

spaceDiverse

factorizedhierarchical

search

space,including

variable

kernel-size,

expansion-rate,

depth,#

channels,

cell-type,activation,

attentionSTEM1,

s=22,

s=23,

s=25,

s=2HEADch=32ch=64ch=96ch=1284,

s=1ch=196ch=256Conv1x1FCAvgch=1536Conv3x3s2DWConvch=32Kernel:

3,5,7Expand:

2,3,4,6Depth:

1,2,3,4Attention: SE,

no

SEActivation:

ReLU/SwishCelltype: grouped,

DW,

…Width

scale:

0.5x,

1.0xChoose

diverse

search

spaceCh:

channel;

SE:

Squeeze-and-Excitation15Varying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesApproximate

ideal

projectionsof

a

reference

model

through

KD1MSE21

2MSE3MSE43

4MSE55MSEUse

quality

of

blockwise

approximationsto

build

accuracy

modelMSE

MSE

MSE

MSE

MSE1

2

3

4

5Define

backbone:FixedchannelsHead

and

Stem1

2

3

4

5DONNA

4-step

processObjective:

Build

accuracy

model

of

search

space

once,

then

deploy

to

many

scenarios15Build

accuracy

model

viaKnowledge

Distillation(KD)

onceBDefine

referenceand

searchspace

onceA16Build

accuracy

predictor

via

BKD

onceLow-cost

hardware-agnostic

trainingphasePretrain

all

blocks

in

search-space

through

blockwiseknowledge

distillationFast

block

trainingTrivial

parallelized

trainingBroad

search

spaceBlockpretrainedweightsBlockqualitymetricsFinetunedarchitecturesBlock

library Architecture

libraryQuickly

finetune

arepresentative

setof

architecturesFinetune

sampled

networksFast

network

trainingOnly20-30

NN

requiredAccuracy

predictorFit

linearregressionmodelRegularized

Ridge

RegressionAccurate

predictionsBKD:

blockwise

knowledge

distillation17Varying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesDefine

backbone:FixedchannelsHead

and

Stem1

2

3

4

5AApproximate

ideal

projectionsof

a

reference

model

through

KD1MSE2MSE3MSE4MSE5MSEUse

quality

of

blockwise

approximationsto

build

accuracy

modelMSE

MSE

MSE

MSE

MSE1

2

3

4

5HWlatencyPredicted

accuracydifferent

compiler

versions,different

image

sizesScenario-specificsearchDONNA

4-step

processObjective:

Build

accuracy

model

of

search

space

once,

then

deploy

to

many

scenarios17Build

accuracy

model

viaKnowledge

Distillation(KD)

onceBEvolutionarysearch

in

24hCDefine

referenceand

searchspace

once1

23

45Qualcomm

Snapdragon

is

a

product

of

Qualcomm

Technologies,

Inc.

and/or

its

subsidiaries.18Evolutionary

search

with

real

hardware

measurementsScenario-specific

search

allows

users

to

select

optimal

architectures

for

real-life

deploymentsQuick

turnaround

timeResults

in

+/-

1

day

using

one

measurement

deviceNSGA:

Non-dominated

Sorting

Genetic

AlgorithmAccurate

scenario-specific

searchCaptures

all

intricacies

of

the

hardware

platformand

software

e.g.

run-time

version

or

devicesNSGA-IIsamplingalgorithmTarget

HWTask

accpredictorPredicted

task

accuracyMeasured

latency

on

deviceEnd-to-endmodel19HWlatencyPredicted

accuracydifferent

compiler

versions,different

image

sizesScenario-specificsearchMSE

MSE

MSE

MSE

MSE1

2

3

4

541523Use

KD-initializedblocks

from

step

Bto

finetune

anynetwork

inthesearch

space

in15-50

epochsinstead

of450DONNA

4-step

processObjective:

Build

accuracy

model

of

search

space

once,

then

deploy

to

many

scenarios19Evolutionarysearch

in24hCSample

andfinetuneDVarying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesDefine

backbone:FixedchannelsHead

and

Stem1

2

3

4

5AApproximate

ideal

projectionsof

a

reference

model

through

KD1MSE2MSE3MSE4MSE5MSEUse

quality

of

blockwise

approximationsto

build

accuracy

modelMSE

MSE

MSE

MSE

MSE1

2

3

4

5Build

accuracy

model

viaKnowledge

Distillation(KD)

onceBDefine

referenceand

searchspace

once1

23

4520Quickly

finetune

predicted

Pareto-optimal

architecturesFinetune

to

reach

full

accuracy

and

complete

hardware-aware

optimization

for

on-device

AI

deployments4152341523CESoftCESoft

distillation

on

teacher

logitsGround-truthlabelsBlockpretrainedweightsBKD-reference

networkPredictedaccuracyHW

latencyConfirmedaccuracyHW

latency21Top-1

val

accuracy

[%]Qualcomm®

Adreno™

660

GPU

in

the

Snapdragon

888

running

on

the

Samsung

Galaxy

S21.2:

Qualcomm®

Hexagon™

780

Processor

in

the

Snapdragon

888

running

on

the

Samsung

GalaxyS21.Qualcomm

Adreno

and

Qualcomm

Hexagon

are

products

of

Qualcomm

Technologies,

Inc.

and/or

its

subsidiaries.DONNA

finds

state-of-the-artnetworks

for

on-device

scenarios#

of

Parameters

[M]FLOPSFPSDesktop

GPU

latencyFPSMobile

SoC

latency1(Adreno

660

GPU)FPSMobile

SoC

latency2(Hexagon

780Processor)20%faster

at

similaraccuracy20%faster

at

similaraccuracy224x224

images224x224

images224x224

images672x672

imagesQuickly

optimize

and

make

tradeoffs

in

model

accuracy

with

respectto

the

deploymentconditionsthat

matter20%faster

at

similaraccuracy2222DONNA

provides

MnasNet-level

diversity

at

100x

lower

cost*Training

1

model

from

scratch

=

450

epochsDONNAefficientlyfinds

optimalmodels

overdiversescenariosCost

of

trainingis

a

handfulofarchitectures*MethodGranularityMacro-diversitySearch-cost1scenario[epochs]Cost

/

scenario4scenarios[epochs]Cost

/

scenario∞

scenarios[epochs]OFALayer-levelFixed1200+10×[25

75]550

—1050250

—750DNALayer-levelFixed770+10×45047004500MNasNetBlock-levelVariable40000+10×4504450044500DONNABlock-levelVariable4000+10×501500500GoodOKNot

good23DONNA

findsstate-of-the-artnetworks

foron-devicescenariosQuickly

optimize

andmake

tradeoffs

inmodelaccuracy

with

respectto

the

deploymentconditions

that

matter#

Multiply

Accumulate

Operations

[FLOPS]Mobile

modelsResNet-50Predicted

top-1

accuracyVIT-BDEIT-BObjectdetectionVisiontransformersDONNA

applies

directly

to

downstreamtasks

and

non-CNN

neural

architectureswithout

conceptual

codechangesVAL

mAP(%)24Execute

predictor-drivenevolutionary

searchon-deviceDefine

a

search-spaceof

smaller,

fasternetwork

architecturesRoughly

equal

to

training2-10

nets

from

scratchCreate

an

accuracypredictor

forallnetwork

architecturesUser

perspective

for

DONNABuild

accuracy

model

of

search

space

once,then

deploy

to

manyscenariosOversized

pretrainedreference

architectureINDONNAABCDeploy

best

models

@

3ms,

6ms,

9ms,

on

different

chipsSet

ofPareto-optimalnetwork

architecturesOUT1

day

/use-caseFinetune

best

searchedmodels,

9-30x

faster

vsregular

trainingD50GPU-hrs/net2525DONNAConclusionsDONNA

shrinks

big

networksin

ahardware-efficient

wayDONNA

can

be

rerun

for

anynewdevice

or

setting

within

a

dayDONNA

works

on

many

differenttasks

out

of

theboxDONNA

enables

scalability

andallows

models

to

be

easily

updatedafter

small

changes

rather

thanstarting

from

scratch252626AIMET

and

AIMET

Model

Zoo

are

products

of

QualcommInnovation

Center,

Inc.Leading

AI

research

and

fast

commercializationDriving

the

industry

towards

integer

inference

and

power-efficient

AIAIModel

Efficiency

Toolkit

(AIMET)AIMETModel

ZooRelaxed

Quantization(ICLR

2019)Data-free

Quantization(ICCV

2019)AdaRound(ICML

2020)Bayesian

Bits(NeurIPS

2020)QuantizationresearchQuantizationopen-sourcing27Open-source

projects

to

scale

model-efficient

AI

to

the

massesAIMET

&AIMET

Model

Zoo2828AIMET

makes

AI

models

smallOpen-sourcedGitHub

project

that

includes

state-of-the-artquantizationand

compression

techniquesfrom

Qualcomm

AI

ResearchTrainedAI

modelAI

Model

Efficiency

Toolkit(AIMET)OptimizedAI

modelTensorFlow

or

PyTorchIf

interested,

please

join

the

AIMET

GitHub

project:

/quic/aimetCompressionQuantizationDeployedAI

modelFeatures:State-of-the-artnetwork

compressiontoolsState-of-the-artquantizationtoolsSupport

for

bothTensorFlowand

PyTorchBenchmarksand

tests

formany

modelsDeveloped

byprofessional

softwaredevelopers29BenefitsLower

memorybandwidthLowerpowerLowerstorageHigherperformanceMaintains

modelaccuracySimpleease

of

useAIMETProviding

advancedmodelefficiencyfeatures

and

benefitsFeaturesQuantization

Compression

State-of-the-art

INT8

andINT4

performancePost-training

quantization

methods,including

Data-Free

Quantizationand

Adaptive

Rounding

(AdaRound)

—coming

soonQuantization-aware

trainingQuantization

simulationEfficient

tensor

decompositionand

removal

of

redundantchannels

in

convolution

layersSpatial

singular

valuedecomposition

(SVD)Channel

pruningVisualization

Analysis

tools

for

drawing

insightsfor

quantization

and

compressionWeight

rangesPer-layer

compression

sensitivity3030APIs

invoked

directly

from

the

pipelineSupports

TensorFlow

and

PyTorchDirect

algorithm

API

frameworksUser-friendlyAPIscompress_model

(model,eval_callback=obj_det_eval,compress_scheme=Scheme.spatial_svd,...

)equalize_model

(model,...)AIMET

features

and

APIs

are

easy

touseDesigned

to

fit

naturally

in

the

AI

model

developmentworkflow

for

researchers,

developers,

and

ISVsAIMETextensionsextensionsModel

optimization

library(techniques

to

compress

&

quantize

models)Frameworkspecific

APIOtherframeworksAlgorithmAPI31Activationrange

estimationBiascorrectionWeightquantizationBiasabsorptionEqualizeweight

rangesAvoid

highactivation

rangesDFQ:

datafreequantizationData

Free

Quantization

results

in

AIMETPost-training

techniqueenabling

INT8

inference

with

very

minimal

loss

in

accuracyCross-layerequalizationMeasure

andcorrect

shift

inlayer

outputsEstimateactivation

rangesfor

quantization<1%DFQexampleresults%

Reduction

inaccuracybetween

FP32

ad

INT8MobileNet-v2(top-1

accuracy)<1%ResNet-50(top-1

accuracy)<1%DeepLabv3mean

intersection

over

union)3132INT8,

AdaRound

quantizationAP:

AveragePrecisionINT8,

baseline

quantizationAdaRound

iscoming

soonto

AIMETPost-training

technique

thatmakes

INT8

quantizationmore

accurate

andINT4quantization

possibleBitwidth

Mean

AP

(mAP)

FP32

82.20INT8

baselinequantizationINT8

AdaRoundquantization49.8581.21<1%Reduction

inaccuracybetweenFP32

ad

INT8AdaRoundquantization3233AIMETModel

ZooAccurate

pre-trained

8-bitquantized

modelsImageclassificationSemanticsegmentationPoseestimationSpeechrecognitionSuperresolutionObjectdetection35*:

Comparison

betweenFP32

model

and

INT8

model

quantized

with

AIMET.For

further

details,

check

out:

/quic/aimet-model-zoo/ResNet-50(v1)75.21%

74.96%FP32

INT8Top-1

accuracy*MobileNet-v2-1.4Top-1

accuracy*75%

74.21%FP32

INT8EfficientNetLite74.93%

74.99%FP32

INT8Top-1

accuracy*SSDMobileNet-v20.2469

0.2456FP32

INT8mAP*RetinaNetmAP*0.35FP320.349INT8PoseestimationmAP*0.383FP320.379INT8SRGANPSNR*25.45FP3224.78INT8MobileNetV2Top-1

accuracy*7167%FP3271.14%INT8EfficientNet-lite0Top-1

accuracy*75.42%FP3274.44%INT8DeepLabV3+mIoU*72.62%FP3272.22%INT8MobileNetV2-SSD-LitemAP*68.7%FP3268.6%INT8PoseestimationmAP*0.364FP320.359INT8SRGANPSNR25.51FP3225.5INT8DeepSpeech2WER*9.92%FP3210.22%INT8AIMET

Model

Zoo

includes

popular

quantized

AI

modelsAccuracy

is

maintainedfor

INT8

models—

lessthan

1%loss*35<1%Loss

inaccuracy*36Baseline

quantization:

Post-trainingquantizationusing

min-max

based

quantization

gridAIMET

quantization:

Model

fine-tuned

usingQuantization

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论