规模化智能通过AI模型的效率

上传人：媚*** IP属地：安徽上传时间：2024-04-17 格式：PPTX 页数：40 大小：12.11MB 积分：20 举报 版权申诉

已阅读5页，还剩35页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

April

2021@QCOMResearchIntelligence

atscale

throughAImodelefficiencyQualcomm

Technologies,

Inc.2Why

efficient

machine

learningis

necessaryfor

proliferateOur

latest

research

makeAI

models

efficientOuropen-source

projectsto

scale

efficientAIAgenda33Video

monitoringExtended

realitySmart

citiesSmart

factoriesAutonomousvehiclesVideo

conferencingSmart

homesSmartphoneAI

being

used

allaround

usincreasing

productivity,

enhancing

collaboration,and

transforming

industriesAI

video

analysis

the

riseTrend

toward

cameras,

higher

resolution,and

increased

frame

rate

across

devices44Source:

WellingWill

have

reached

the

capacity

the

human

brain?Energy

efficiency

brain

100x

better

than

current

hardware2025Weight

parameter

count194019701980199020002010202020301943:

First

(+/-

N=10)1950

19601988:

NetTalk(+/-N=20K)2009:

Hinton’s

DeepBelief

Net

(+/-

N=10M)2013:

Google/Y!(N=+/-

1B)2025:N=100T

=1014101210101081061014104102100Deep

neural

networksare

energy

hungryand

growingfastAI

isbeing

powered

the

explosivegrowth

deep

neuralnetworks2021:

Extremely

largeneural

networks

=1.6T)2017:

Very

largeneuralnetworks

=137B)5Power

andthermalefficiency

are

essentialfor

on-device

AIThe

challenge

ofAI

workloadsConstrained

mobileenvironmentVerycomputeintensiveLarge,complicated

neuralnetwork

modelsComplexconcurrenciesAlways-onReal-timeMust

bethermallyefficient

for

sleek,ultra-light

designsStorage/memorybandwidth

limitationsRequires

long

batterylife

for

all-day

use6Holisticmodel

efficiencyresearchMultiple

axes

shrinkAI

models

and

efficientlyrun

them

onhardwareQuantizationLearning

reducebit-precision

while

keepingdesired

accuracyCompressionLearning

prunemodel

while

keepingdesired

accuracyCompilationLearning

compileAImodels

forefficienthardware

executionNeuralarchitecturesearchLearning

design

smallerneural

networks

that

are

paroroutperform

hand-designedarchitectures

realhardware71:

FP32

model

compared

quantized

modelPromising

results

show

thatlow-precision

integer

inferencecan

become

widespreadVirtually

the

same

accuracybetween

FP32

and

quantizedAI

model

through:Automated,

data

free,post-training

methodsAutomated

training-basedmixed-precision

methodSignificant

performance

perwattimprovements

through

quantizationLeadingresearch

toefficientlyquantize

AImodelsAutomated

reduction

precisionof

weights

and

activations

whilemaintaining

accuracyModels

trained

athigh

precision32-bit

floating

point3452.31948-bit

Integer255Increase

performanceper

watt

fromsavings

inmemory

and

compute1Inference

atlower

precision16-bit

Integer345201010101Increase

performanceper

watt

fromsavings

inmemory

and

compute1up

to4X4-bit

Integer15Increase

performanceper

watt

fromsavings

inmemory

and

compute101010101up

to16Xup

to64X0101010101010101010101010101010101010101010188dData-freequantizationHow

can

makequantization

simpleas

possible?Created

automated

methothat

addresses

biasandimbalance

weight

ranges:NotrainingData

freePushing

thelimits

what’spossible

withquantizationAdaRoundIs

rounding

the

nearestvalue

the

best

approachfor

quantization?Created

automatedmethod

for

finding

thebest

rounding

choice:NotrainingMinimal

unlabeled

dataSOTA

8-bit

resultsMaking

8-bit

weightquantization

ubiquitous<1%Accuracy

drop

forMobileNet

V2against

FP32

modelData-Free

Quantization

Through

WeightEqualization

andBias

Correction

(Nagel,

van

Baalen,

al.,

ICCV

2019)SOTA:

State-of-the-artAccuracy

drop

for<2.5%

MobileNet

V2against

FP32

modelUp

Down?

AdaptiveRounding

for

Post-TrainingQuantization

(Nagel,Amjad,

al.,ICML

2020)Bayesian

bitsCan

quantize

layers

todifferent

bit

widths

basedon

precision

sensitivity?Created

novel

methodto

learn

mixed-precisionquantization:Training

requiredTraining

data

requiredJointly

learns

bit-widthprecision

and

pruningSOTA

mixed-precision

resultsAutomating

mixed-precisionquantization

and

enabling

the

tradeoffbetween

accuracy

and

kernel

bit-width<1%Accuracy

drop

forMobileNet

against

FP32model

for

mixed

precisionmodel

with

computationalcomplexity

equivalent

a4-bit

weight

modelBayesian

Bits:

Unifying

Quantization

and

Pruningvan

Baalen,

Louizos,

al.,

NeurIPS2020)8SOTA

4-bit

weight

resultsMaking

4-bit

weight

quantizationubiquitous99Optimizing

and

deploying

state-of-the-art

AImodelsfor

diverse

scenarios

scale

ischallengingNeuralnetworkcomplexityManystate-of-the-artneural

networksolutions

are

large,complex,

and

donot

run

efficientlyon

target

hardwareNeuralnetworkdiversityDevicediversityCostFor

differenttasks

and

usecase

cases,many

differentneural

networksare

requiredDeploying

neuralnetworks

manydifferent

deviceswith

differentconfigurations

andchanging

softwareis

requiredCompute

andengineeringresources

fortraining

plusevaluation

aretoo

costly

andtime

consuming10NASNeuralArchitectureSearchAn

automated

way

learna

network

topology

that

canachieve

the

bestperformanceon

certain

taskSearchspaceSet

operationsand

how

theycan

connectedto

formvalidnetworkarchitecturesSearchalgorithmMethod

forsampling

apopulation

ofgood

networkarchitecturecandidatesEvaluationstrategyMethod

toestimate

theperformanceof

samplednetworkarchitectures11High

costBrute

force

expensive>40,000

epochs

per

platformLack

diverse

searchHard

diverse

spaces,

withdifferentblock-types,

attention,

and

activations

Repeated

training

phase

for

every

new

scenarioDo

not

scaleRepeated

training

phase

for

every

new

device>40,000

epochs

per

platformUnreliable

hardware

modelsRequires

differentiable

cost-functionsRepeated

training

phase

for

every

new

deviceExistingNASsolutions

notaddress

all

thechallenges12Introducing

new

researchDistilling

Optimal

NeuralNetwork

ArchitecturesEfficient

NAS

withhardware-aware

optimizationAscalable

method

that

finds

pareto-optimalnetwork

architectures

interms

accuracy

andlatency

for

any

hardware

platform

low

costStarts

from

oversized

pretrainedreference

architectureDONNADONNAOversized

pretrainedreference

architectureSet

ofpareto-optimalnetwork

architecturesLow

costLow

start-up

cost

1000-4000

epochs,equivalent

training

2-10

networks

from

scratchDiverse

findthe

best

modelsSupports

diverse

spaces

with

different

cell-types,attention,

and

activation

functions

(ReLU,

Swish,

etc.)ScalableScales

many

hardware

devicesat

minimal

costReliable

hardwaremeasurementsUses

direct

hardware

measurements

insteadof

potentially

inaccurate

hardware

modelDistilling

Optimal

Neural

Networks:

Rapid

Diverse

Spaces

(Moons,

Bert,

al.,

arXiv2020)Varying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesDONNA

4-step

processObjective:

Build

accuracy

model

space

once,

then

deploy

many

scenariosDefine

referenceand

searchA13space

onceDefine

backbone:FixedchannelsHead

and

Stem1

514Define

reference

architecture

andsearch-space

onceA

diverse

space

essential

for

finding

optimal

architectures

with

higher

accuracySelect

referencearchitectureThe

largest

modelin

the

search-spaceChop

the

NNinto

blocksFix

the

STEM,

HEAD,#

blocks,strides,#

channels

block-edgeChoose

spaceDiverse

factorizedhierarchical

space,including

variable

kernel-size,

expansion-rate,

depth,#

channels,

cell-type,activation,

attentionSTEM1,

s=22,

s=23,

s=25,

s=2HEADch=32ch=64ch=96ch=1284,

s=1ch=196ch=256Conv1x1FCAvgch=1536Conv3x3s2DWConvch=32Kernel:

3,5,7Expand:

2,3,4,6Depth:

1,2,3,4Attention: SE,

SEActivation:

ReLU/SwishCelltype: grouped,

DW,

…Width

scale:

0.5x,

1.0xChoose

diverse

spaceCh:

channel;

SE:

Squeeze-and-Excitation15Varying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesApproximate

ideal

projectionsof

reference

model

through

KD1MSE21

2MSE3MSE43

4MSE55MSEUse

quality

blockwise

approximationsto

build

accuracy

modelMSE

MSE

MSE1

5Define

backbone:FixedchannelsHead

and

Stem1

5DONNA

4-step

processObjective:

Build

accuracy

model

space

once,

then

deploy

many

scenarios15Build

accuracy

model

viaKnowledge

Distillation(KD)

onceBDefine

referenceand

searchspace

onceA16Build

accuracy

predictor

via

BKD

onceLow-cost

hardware-agnostic

trainingphasePretrain

all

blocks

search-space

through

blockwiseknowledge

distillationFast

block

trainingTrivial

parallelized

trainingBroad

spaceBlockpretrainedweightsBlockqualitymetricsFinetunedarchitecturesBlock

library Architecture

libraryQuickly

finetune

arepresentative

setof

architecturesFinetune

sampled

networksFast

network

trainingOnly20-30

requiredAccuracy

predictorFit

linearregressionmodelRegularized

Ridge

RegressionAccurate

predictionsBKD:

blockwise

knowledge

distillation17Varying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesDefine

backbone:FixedchannelsHead

and

Stem1

5AApproximate

ideal

projectionsof

reference

model

through

KD1MSE2MSE3MSE4MSE5MSEUse

quality

blockwise

approximationsto

build

accuracy

modelMSE

MSE

MSE1

5HWlatencyPredicted

accuracydifferent

compiler

versions,different

image

sizesScenario-specificsearchDONNA

4-step

processObjective:

Build

accuracy

model

space

once,

then

deploy

many

scenarios17Build

accuracy

model

viaKnowledge

Distillation(KD)

onceBEvolutionarysearch

24hCDefine

referenceand

searchspace

once1

45Qualcomm

Snapdragon

product

Qualcomm

Technologies,

Inc.

and/or

its

subsidiaries.18Evolutionary

with

real

hardware

measurementsScenario-specific

allows

users

select

optimal

architectures

for

real-life

deploymentsQuick

turnaround

timeResults

+/-

day

using

one

measurement

deviceNSGA:

Non-dominated

Sorting

Genetic

AlgorithmAccurate

scenario-specific

searchCaptures

all

intricacies

the

hardware

platformand

software

—

e.g.

run-time

version

devicesNSGA-IIsamplingalgorithmTarget

HWTask

accpredictorPredicted

task

accuracyMeasured

latency

deviceEnd-to-endmodel19HWlatencyPredicted

accuracydifferent

compiler

versions,different

image

sizesScenario-specificsearchMSE

MSE

MSE1

541523Use

KD-initializedblocks

from

step

Bto

finetune

anynetwork

inthesearch

space

in15-50

epochsinstead

of450DONNA

4-step

processObjective:

Build

accuracy

model

space

once,

then

deploy

many

scenarios19Evolutionarysearch

in24hCSample

andfinetuneDVarying

parameters:KernelSizeExpansion

FactorsNetwork

depthNetwork

widthAttention/activationDifferent

efficientlayertypesDefine

backbone:FixedchannelsHead

and

Stem1

5AApproximate

ideal

projectionsof

reference

model

through

KD1MSE2MSE3MSE4MSE5MSEUse

quality

blockwise

approximationsto

build

accuracy

modelMSE

MSE

MSE1

5Build

accuracy

model

viaKnowledge

Distillation(KD)

onceBDefine

referenceand

searchspace

once1

4520Quickly

finetune

predicted

Pareto-optimal

architecturesFinetune

reach

full

accuracy

and

complete

hardware-aware

optimization

for

on-device

deployments4152341523CESoftCESoft

distillation

teacher

logitsGround-truthlabelsBlockpretrainedweightsBKD-reference

networkPredictedaccuracyHW

latencyConfirmedaccuracyHW

latency21Top-1

val

accuracy

[%]Qualcomm®

Adreno™

660

GPU

the

Snapdragon

888

running

the

Samsung

Galaxy

S21.2:

Qualcomm®

Hexagon™

780

Processor

the

Snapdragon

888

running

the

Samsung

GalaxyS21.Qualcomm

Adreno

and

Qualcomm

Hexagon

are

products

Qualcomm

Technologies,

Inc.

and/or

its

subsidiaries.DONNA

finds

state-of-the-artnetworks

for

on-device

scenarios#

Parameters

[M]FLOPSFPSDesktop

GPU

latencyFPSMobile

SoC

latency1(Adreno

660

GPU)FPSMobile

SoC

latency2(Hexagon

780Processor)20%faster

similaraccuracy20%faster

similaraccuracy224x224

images224x224

images672x672

imagesQuickly

optimize

and

make

tradeoffs

model

accuracy

with

respectto

the

deploymentconditionsthat

matter20%faster

similaraccuracy2222DONNA

provides

MnasNet-level

diversity

100x

lower

cost*Training

model

from

scratch

450

epochsDONNAefficientlyfinds

optimalmodels

overdiversescenariosCost

trainingis

handfulofarchitectures*MethodGranularityMacro-diversitySearch-cost1scenario[epochs]Cost

scenario4scenarios[epochs]Cost

scenario∞

scenarios[epochs]OFALayer-levelFixed1200+10×[25

—

75]550

—1050250

—750DNALayer-levelFixed770+10×45047004500MNasNetBlock-levelVariable40000+10×4504450044500DONNABlock-levelVariable4000+10×501500500GoodOKNot

good23DONNA

findsstate-of-the-artnetworks

foron-devicescenariosQuickly

optimize

andmake

tradeoffs

inmodelaccuracy

with

respectto

the

deploymentconditions

that

matter#

Multiply

Accumulate

Operations

[FLOPS]Mobile

modelsResNet-50Predicted

top-1

accuracyVIT-BDEIT-BObjectdetectionVisiontransformersDONNA

applies

directly

downstreamtasks

and

non-CNN

neural

architectureswithout

conceptual

codechangesVAL

mAP(%)24Execute

predictor-drivenevolutionary

searchon-deviceDefine

search-spaceof

smaller,

fasternetwork

architecturesRoughly

equal

training2-10

nets

from

scratchCreate

accuracypredictor

forallnetwork

architecturesUser

perspective

for

DONNABuild

accuracy

model

space

once,then

deploy

manyscenariosOversized

pretrainedreference

architectureINDONNAABCDeploy

best

models

3ms,

6ms,

9ms,

…

different

chipsSet

ofPareto-optimalnetwork

architecturesOUT1

day

/use-caseFinetune

best

searchedmodels,

9-30x

faster

vsregular

trainingD50GPU-hrs/net2525DONNAConclusionsDONNA

shrinks

big

networksin

ahardware-efficient

wayDONNA

can

rerun

for

anynewdevice

setting

within

dayDONNA

works

many

differenttasks

out

theboxDONNA

enables

scalability

andallows

models

easily

updatedafter

small

changes

rather

thanstarting

from

scratch252626AIMET

and

AIMET

Model

Zoo

are

products

QualcommInnovation

Center,

Inc.Leading

research

and

fast

commercializationDriving

the

industry

towards

integer

inference

and

power-efficient

AIAIModel

Efficiency

Toolkit

(AIMET)AIMETModel

ZooRelaxed

Quantization(ICLR

2019)Data-free

Quantization(ICCV

2019)AdaRound(ICML

2020)Bayesian

Bits(NeurIPS

2020)QuantizationresearchQuantizationopen-sourcing27Open-source

projects

scale

model-efficient

the

massesAIMET

&AIMET

Model

Zoo2828AIMET

makes

models

smallOpen-sourcedGitHub

project

that

includes

state-of-the-artquantizationand

compression

techniquesfrom

Qualcomm

ResearchTrainedAI

modelAI

Model

Efficiency

Toolkit(AIMET)OptimizedAI

modelTensorFlow

PyTorchIf

interested,

please

join

the

AIMET

GitHub

project:

/quic/aimetCompressionQuantizationDeployedAI

modelFeatures:State-of-the-artnetwork

compressiontoolsState-of-the-artquantizationtoolsSupport

for

bothTensorFlowand

PyTorchBenchmarksand

tests

formany

modelsDeveloped

byprofessional

softwaredevelopers29BenefitsLower

memorybandwidthLowerpowerLowerstorageHigherperformanceMaintains

modelaccuracySimpleease

useAIMETProviding

advancedmodelefficiencyfeatures

and

benefitsFeaturesQuantization

Compression

State-of-the-art

INT8

andINT4

performancePost-training

quantization

methods,including

Data-Free

Quantizationand

Adaptive

Rounding

(AdaRound)

—coming

soonQuantization-aware

trainingQuantization

simulationEfficient

tensor

decompositionand

removal

redundantchannels

convolution

layersSpatial

singular

valuedecomposition

(SVD)Channel

pruningVisualization

Analysis

tools

for

drawing

insightsfor

quantization

and

compressionWeight

rangesPer-layer

compression

sensitivity3030APIs

invoked

directly

from

the

pipelineSupports

TensorFlow

and

PyTorchDirect

algorithm

API

frameworksUser-friendlyAPIscompress_model

(model,eval_callback=obj_det_eval,compress_scheme=Scheme.spatial_svd,...

)equalize_model

(model,...)AIMET

features

and

APIs

are

easy

touseDesigned

fit

naturally

the

model

developmentworkflow

for

researchers,

developers,

and

ISVsAIMETextensionsextensionsModel

optimization

library(techniques

compress

quantize

models)Frameworkspecific

APIOtherframeworksAlgorithmAPI31Activationrange

estimationBiascorrectionWeightquantizationBiasabsorptionEqualizeweight

rangesAvoid

highactivation

rangesDFQ:

datafreequantizationData

Free

Quantization

results

AIMETPost-training

techniqueenabling

INT8

inference

with

very

minimal

loss

accuracyCross-layerequalizationMeasure

andcorrect

shift

inlayer

outputsEstimateactivation

rangesfor

quantization<1%DFQexampleresults%

Reduction

inaccuracybetween

FP32

INT8MobileNet-v2(top-1

accuracy)<1%ResNet-50(top-1

accuracy)<1%DeepLabv3mean

intersection

over

union)3132INT8,

AdaRound

quantizationAP:

AveragePrecisionINT8,

baseline

quantizationAdaRound

iscoming

soonto

AIMETPost-training

technique

thatmakes

INT8

quantizationmore

accurate

andINT4quantization

possibleBitwidth

Mean

(mAP)

FP32

82.20INT8

baselinequantizationINT8

AdaRoundquantization49.8581.21<1%Reduction

inaccuracybetweenFP32

INT8AdaRoundquantization3233AIMETModel

ZooAccurate

pre-trained

8-bitquantized

modelsImageclassificationSemanticsegmentationPoseestimationSpeechrecognitionSuperresolutionObjectdetection35*:

Comparison

betweenFP32

model

and

INT8

model

quantized

with

AIMET.For

further

details,

check

out:

/quic/aimet-model-zoo/ResNet-50(v1)75.21%

74.96%FP32

INT8Top-1

accuracy*MobileNet-v2-1.4Top-1

accuracy*75%

74.21%FP32

INT8EfficientNetLite74.93%

74.99%FP32

INT8Top-1

accuracy*SSDMobileNet-v20.2469

0.2456FP32

INT8mAP*RetinaNetmAP*0.35FP320.349INT8PoseestimationmAP*0.383FP320.379INT8SRGANPSNR*25.45FP3224.78INT8MobileNetV2Top-1

accuracy*7167%FP3271.14%INT8EfficientNet-lite0Top-1

accuracy*75.42%FP3274.44%INT8DeepLabV3+mIoU*72.62%FP3272.22%INT8MobileNetV2-SSD-LitemAP*68.7%FP3268.6%INT8PoseestimationmAP*0.364FP320.359INT8SRGANPSNR25.51FP3225.5INT8DeepSpeech2WER*9.92%FP3210.22%INT8AIMET

Model

Zoo

includes

popular

quantized

modelsAccuracy

maintainedfor

INT8

models—

lessthan

1%loss*35<1%Loss

inaccuracy*36Baseline

quantization:

Post-trainingquantizationusing

min-max

based

quantization

gridAIMET

quantization:

Model

fine-tuned

usingQuantization

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

规模化智能通过AI模型的效率

文档简介

温馨提示

最新文档

评论

规模化智能通过AI模型的效率

文档简介

温馨提示

最新文档

评论

相关文档