2025 DeepSeek-OCR技术报告:视觉压缩长文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第1页
2025 DeepSeek-OCR技术报告:视觉压缩长文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第2页
2025 DeepSeek-OCR技术报告:视觉压缩长文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第3页
2025 DeepSeek-OCR技术报告:视觉压缩长文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第4页
2025 DeepSeek-OCR技术报告:视觉压缩长文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第5页
已阅读5页,还剩37页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

deepseelcHaoranWei,YaofengSun,YukunLiDeepSeek-AIWepresentDeepSeek-OCRasaninitialinvestigationintothefeasibilityofcompressinglongcontextsviaoptical2Dmapping.DeepSeek-OCRconsistsoftwocomponents:DeepEncoderandDeepSeek3B-MoE-A570Masthedecoder.Specifically,DeepEncoderservesasthecoreengine,designedtomaintainlowactivationsunderhigh-resolutioninputwhileachievinghighcompressionratiostoensureanoptimalandmanageablenumberofvisiontokens.Experimentsshowthatwhenthenumberoftexttokensiswithin10timesPrecision(%)OverallPerformancePrecision(%)OverallPerformance(EditDistance)Compression(×)TextTokensinPerPage(Ground-truth)(a)CompressiononFoxbenchmarkAverageVisionTokensperImage(b)PerformanceonOmnidocbenchFigure1|Figure(a)showsthecompressionratio(numberoftexttokensingroundtruth/numberofvisiontokensmodelused)testingonFox[21]benchmark;Figure(b)showsperformancecomparisonsonOmniDocBench[27].DeepSeek-OCRcanachievestate-of-the-artperformanceamongend-to-endmodelsenjoyingthefewestvisiontokens.21Introduction32.1TypicalVisionEncodersinVLMs 2.2End-to-endOCRModels 3Methodology53.1Architecture 3.2.1ArchitectureofDeepEncoder 3.2.2Multipleresolutionsupport 3.3TheMoEDecoder 3.4DataEngine 3.4.3Generalvisiondata 3.4.4Text-onlydata 3.5TrainingPipelines 3.5.1TrainingDeepEncoder 3.5.2TrainingDeepSeek-OCR 4.1Vision-textCompressionStudy 4.2OCRPracticalPerformance 4.3QualitativeStudy 4.3.1Deepparsing 4.3.2Multilingualrecognition 4.3.3Generalvisionunderstanding 31.IntroductionCurrentLargeLanguageModels(LLMs)facesignificantcomputationalchallengeswhenprocess-inglongtextualcontentduetoquadraticscalingwithsequencelength.Weexploreapotentialsolution:leveragingvisualmodalityasanefficientcompressionmediumfortextualinformation.Asingleimagecontainingdocumenttextcanrepresentrichinformationusingsubstantiallyfewertokensthantheequivalentdigitaltext,suggestingthatopticalcompressionthroughvisiontokenscouldachievemuchhighercompressionratios.Thisinsightmotivatesustoreexaminevision-languagemodels(VLMs)fromanLLM-centricperspective,focusingonhowvisionencoderscanenhanceLLMs’efficiencyinprocessingtextualinformationratherthanbasicVQA[12,16,24,32,41]whathumansexcelat.OCRtasks,asanintermediatemodalitybridgingvisionandlanguage,provideanidealtestbedforthisvision-textcompressionparadigm,astheyestablishanaturalcompression-decompressionmappingbetweenvisualandtextualrepresentationswhileofferingquantitativeevaluationmetrics.Accordingly,wepresentDeepSeek-OCR,aVLMdesignedasapreliminaryproof-of-conceptforefficientvision-textcompression.Ourworkmakesthreeprimarycontributions:First,weprovidecomprehensivequantitativeanalysisofvision-texttokencompressionratios.Ourmethodachieves96%+OCRdecodingprecisionat9-10×textcompression,~90%at10-12×compression,and~60%at20×compressiononFox[21]benchmarksfeaturingdiversedocumentlayouts(withactualaccuracybeingevenhigherwhenaccountingforformattingdifferencesbetweenoutputandgroundtruth),asshowninFigure1(a).TheresultsdemonstratethatcompactlanguagemodelscaneffectivelylearntodecodecompressedvisualrepresentatsuggestingthatlargerLLMscouldreadilyacquiresimilarcapabilitiesthroughappropriatepretrainingdesign.Second,weintroduceDeepEncoder,anovelarchitecturethatmaintainslowactivationmem-oryandminimalvisiontokensevenwithhigh-resolutioninputs.Itseriallyconnectswindowattentionandglobalattentionencodercomponentsthrougha16×convolutionalcompressor.Thisdesignensuresthatthewindowattentioncomponentprocessesalargenumberofvisiontokens,whilethecompressorreducesvisiontokensbeforetheyenterthedenseglobalattentioncomponent,achievingeffectivememoryandtokencompression.Third,wedevelopDeepSeek-OCRbasedonDeepEncoderandDeepSeek3B-MoE[19,20].AsshowninFigure1(b),itachievesstate-of-the-artperformancewithinend-to-endmodelsonOmniDocBenchwhileusingthefewestvisiontokens.Additionally,weequipthemodelwithcapabilitiesforparsingcharts,chemicalformulas,simplegeometricfigures,andnaturalimagestoenhanceitspracticalutilityfurther.Inproduction,DeepSeek-OCRcangenerate33millionpagesofdataperdayforLLMsorVLMsusing20nodes(eachwith8A100-40GGPUs).Insummary,thisworkpresentsapreliminaryexplorationofusingvisualmodalityasanefficientcompressionmediumfortextualinformationprocessinginLLMs.ThroughDeepSeek-OCR,wedemonstratethatvision-textcompressioncanachievesignificanttokenreduction(7-20×)fordifferenthistoricalcontextstages,offeringapromisingdirectionforaddressinglong-contextchallengesinlargelanguagemodels.OurquantitativeanalysisprovidesempiricalguidelinesforVLMtokenallocationoptimization,whiletheproposedDeepEncoderarchitectureshowcasespracticalfeasibilitywithreal-worlddeploymentcapabilities.AlthoughfocusedonOCRasaproof-of-concept,thisparadigmopensnewpossibilitiesforrethinkinghowvisionandlanguagemodalitiescanbesynergisticallycombinedtoenhancecomputationalefficiencyinlarge-scaletextprocessingandagentsystems.4Vary/DeepSeekVL/...LLMLLMVITVIT[×]unsupportedpipelineparallel[×]twopre-processes[×]unsupportedextremeresolution[×]hardtodeploymentusually>15384384[×]lownativeresolution[×]toomanyvisiontokensInternVLseries/DeepSeekVL2/...LLMLLM[×]overlysmallpatches[×]smallglobalviewQwen2(.5)/3VLseries...wVITVIT(navit)tokens=(w//14(16))×(h//14(16))[×]toomanyvisiontokens[×]needlongsequencelength[×]largeactivations[×]slowinferencespeed1024224384384VITDet1024224VITLLMhFigure2|TypicalvisionencodersinpopularVLMs.Herearethreetypesofencoderscommonlyusedincurrentopen-sourceVLMs,allofwhichsufferfromtheirrespectivedeficiencies.1024224384384VITDet1024224VITLLMh2.1.TypicalVisionEncodersinVLMsCurrentopen-sourceVLMsemploythreemaintypesofvisionencoders,asillustratedinFigure2.Thefirsttypeisadual-towerarchitecturerepresentedbyVary[36],whichutilizesparallelSAM[17]encodertoincreasevisualvocabularyparametersforhigh-resolutionimageprocessing.Whileofferingcontrollableparametersandactivationmemory,thisapproachsuffersfromsignificantdrawbacks:itrequiresdualimagepreprocessingthatcomplicatesdeploymentandmakesencoderpipelineparallelismchallengingduringtraining.Thesecondtypeistile-basedmethodexemplifiedbyInternVL2.0[8],whichprocessesimagesbydividingthemintosmalltilesforparallelcomputation,reducingactivationmemoryunderhigh-resolutionsettings.Althoughcapableofhandlingextremelyhighresolutions,thisapproachhasnotablelimitationsduetoitstypicallylownativeencoderresolution(below512×512),causinglargeimagestobeexcessivelyfragmentedandresultinginnumerousvisiontokens.ThethirdtypeisadaptiveresolutionencodingrepresentedbyQwen2-VL[35],whichadoptstheNaViT[10]paradigmtodirectlyprocessfullimagesthroughpatch-basedsegmentationwithouttileparallelization.Whilethisencodercanhandlediverseresolutionsflexibly,itfacessubstantialchallengeswithlargeimagesduetomassiveactivationmemoryconsumptionthatcancauseGPUmemoryoverflow,andsequencepackingrequiresextremelylongsequencelengthsduringtraining.Longvisiontokenswillslowdownbothprefillandgenerationphasesofinference.2.2.End-to-endOCRModelsOCR,particularlydocumentparsingtask,hasbeenahighlyactivetopicintheimage-to-textdomain.WiththeadvancementofVLMs,alargenumberofend-to-endOCRmodelshaveemerged,fundamentallytransformingthetraditionalpipelinearchitecture(whichrequiredseparatedetectionandrecognitionexpertmodels)bysimplifyingOCRsystems.Nougat[6]firstemploysend-to-endframeworkforacademicpaperOCRonarXiv,demonstratingthepotentialofmodelsinhandlingdenseperceptiontasks.GOT-OCR2.0[38]expandsthescopeofOCR2.0toincludemoresyntheticimageparsingtasksanddesignsanOCRmodelwithperformance-efficiencytrade-offs,furtherhighlightingthepotentialofend-to-endOCRre-searches.Additionally,generalvisionmodelssuchasQwen-VLseries[35],InternVLseries[8],andmanytheirderivativescontinuouslyenhancetheirdocumentOCRcapabilitiestoexploredensevisualperceptionboundaries.However,acrucialresearchquestionthatcurrentmodelsneededfordecoding?Thisquestionholdssignificantimportanceforresearchintheprinciplethat5Outputn/16n/16ConvSAMCLIPConvSAMCLIPVIT300MDeepSeek-3B(MOE-A570M)VITDETDeepSeek-3B(MOE-A570M)down-sampleglobalattentionEmbeddinglayerdown-sampleglobalattentionEmbeddinglayerDecoderlocalattentionInputlocalattentionlowactivationDeepEncoderPromptTokenizerlowactivationDeepEncoderPromptTokenizerFigure3|ThearchitectureofDeepSeek-OCR.DeepSeek-OCRconsistsofaDeepEncoderandaDeepSeek-3B-MoEdecoder.DeepEncoderisthecoreofDeepSeek-OCR,comprisingthreecomponents:aSAM[17]forperceptiondominatedbywindowattention,aCLIP[29]forknowledgewithdenseglobalattention,anda16×tokencompressorthatbridgesbetweenthem.3.MethodologyAsshowninFigure3,DeepSeek-OCRenjoysaunifiedend-to-endVLMarchitectureconsistingofanencoderandadecoder.Theencoder(namelyDeepEncoder)isresponsibleforextractingimagefeaturesandtokenizingaswellascompressingvisualrepresentations.Thedecoderisusedforgeneratingtherequiredresultbasedonimagetokensandprompts.DeepEncoderisapproximately380Minparameters,mainlycomposedofan80MSAM-base[17]anda300MCLIP-large[29]connectedinseries.Thedecoderadoptsa3BMactivatedparameters.Inthefollowingparagraphs,wewilldelveintothemodelcomponents,dataengineering,andtrainingskills.Toexplorethefeasibilityofcontextsopticalcompression,weneedavisionencoderwiththefollowingfeatures:1.Capableofprocessinghighresolutions;2.Lowactivationathighresolutions;3.Fewvisiontokens;4.Supportformultipleresolutioninputs;5.Moderateparametercount.However,asdescribedintheSection2.1,currentopen-sourceencoderscannotfullysatisfyalltheseconditions.Therefore,wedesignanovelvisionencoderourselves,namedDeepEncoder.DeepEncodermainlyconsistsoftwocomponents:avisualperceptionfeatureextractioncompo-nentdominatedbywindowattention,andavisualknowledgefeatureextractioncomponentwithdenseglobalattention.Tobenefitfromthepretraininggainsofpreviousworks,weuseSAM-base(patch-size16)andCLIP-largeasthemainarchitecturesforthetwocomponentsrespectively.ForCLIP,weremovethefirstpatchembeddinglayersinceitsinputisnolongerimagesbutoutputtokensfromthepreviouspipeline.Betweenthetwocomponents,weborrowfromVary[36]andusea2-layerconvolutionalmoduletoperform16×downsamplingofvisiontokens.Eachconvolutionallayerhasakernelsizeof3,strideof2,paddingof1,andchannelsincreasefrom256to1024.Assumingweinputa1024×1024image,theDeepEncoderwillseg-mentitinto1024/16×1024/16=4096patchtokens.Sincethefirsthalfofencoderisdominatedbywindowattentionandonly80M,theactivationisacceptable.Beforeenteringglobalattention,6W:512||640ResizeMode:Tiny||SmallToken:64||100PaddingMode:Base||LargeR=1-(H-W)/WToken:256||400Valid:(256||400)×RR=1-(H-W)/W+640||1024n=6W:1024||1280640||1024n=6Mode:Gundam||Gundam(Master)Token:n×(100||256)+(256||400)Valid:n×(100||256)+(256||400)×Rn∈[2:9] H:10241280640||1024 H:10241280Figure4|Totestmodelperformanceunderdifferentcompressionratios(requiringdifferentnumbersofvisiontokens)andenhancethepracticalityofDeepSeek-OCR,weconfigureitwithmultipleresolutionmodes. H:10241280640||1024 H:10241280the4096tokensgothroughthecompressionmoduleandthetokencountbecomes4096/16=256,thusmakingtheoverallactivationmemorycontrollable.Table1|MultiresolutionsupportofDeepEncoder.Forbothresearchandapplicationpurposes,wedesignDeepEncoderwithdiversenativeresolutionanddynamicresolutionmodes.ModeNativeResolutionDynamicResolutionTinySmallBaseLargeGundamGundam-MResolutionpaddingpaddingresize+paddingresize+paddingSupposewehaveanimagewith1000opticalcharactersandwewanttotesthowmanyvisiontokensareneededfordecoding.Thisrequiresthemodeltosupportavariablenumberofvisiontokens.ThatistosaytheDeepEncoderneedstosupportmultipleresolutions.Wemeettherequirementaforementionedthroughdynamicinterpolationofpositionalencodings,anddesignseveralresolutionmodesforsimultaneousmodeltrainingtoachievethecapabilityofasingleDeepSeek-OCRmodelsupportingmultipleresolutions.AsshowninFigure4,DeepEncodermainlysupportstwomajorinputmodes:nativeresolutionanddynamicresolution.Eachofthemcontainsmultiplesub-modes.Nativeresolutionsupportsfoursub-modes:Tiny,Small,Base,andLarge,withcorrespondingresolutionsandtokencountsof512×512(64),640×640(100),1024×1024(256),and1280×1280(400)respectively.SinceTinyandSmallmodeshaverelativelysmallresolutions,toavoidwastingvisiontokens,imagesareprocessedbydirectlyresizingtheoriginalshape.ForBaseandLargemodes,inordertopreservetheoriginalimageaspectratio,imagesarepaddedtothecorrespondingsize.Afterpadding,thenumberofvalidvisiontokensislessthantheactualnumberofvisiontokens,withthecalculationformulabeing:Nvalid=⌈Nactual×[1−((max(w,h)−min(w,h))/(max(w,h)))]⌉(1)wherewandhrepresentthewidthandheightoftheoriginalinputimage.7Dynamicresolutioncanbecomposedoftwonativeresolutions.Forexample,Gundammodeconsistsofn×640×640tiles(localviews)anda1024×1024globalview.ThetilingmethodfollowingInternVL2.0[8].Supportingdynamicresolutionismainlyforapplicationconsidera-tions,especiallyforultra-high-resolutioninputs(suchasnewspaperimages).Tilingisaformofsecondarywindowattentionthatcaneffectivelyreduceactivationmemoryfurther.It’sworthnotingthatduetoourrelativelylargenativeresolutions,imageswon’tbefragmentedtoomuchunderdynamicresolution(thenumberoftilesiscontrolledwithintherangeof2to9).ThevisiontokennumberoutputbytheDeepEncoderunderGundammodeis:n×100+256,wherenisthenumberoftiles.Forimageswithbothwidthandheightsmallerthan640,nissetto0,i.e.,GundammodewilldegradetoBasemode.Gundammodeistrainedtogetherwiththefournativeresolutionmodestoachievethegoalofonemodelsupportingmultipleresolutions.NotethatGundam-mastermode(1024×1024localviews+1280×1280globalview)isobtainedthroughcontinuedtrainingonatrainedDeepSeek-OCRmodel.Thisismainlyforloadbalancing,asGundam-master’sresolutionistoolargeandtrainingittogetherwouldslowdowntheoveralltrainingspeed.3.3.TheMoEDecoderOurdecoderusestheDeepSeekMoE[19,20],specificallyDeepSeek-3B-MoE.Duringinference,themodelactivates6outof64routedexpertsand2sharedexperts,withabout570Mactivatedparameters.The3BDeepSeekMoEisverysuitablefordomain-centric(OCRforus)VLMresearch,asitobtainstheexpressivecapabilityofa3Bmodelwhileenjoyingtheinferenceefficiencyofa500Msmallmodel.ThedecoderreconstructstheoriginaltextrepresentationfromthecompressedlatentvisiontokensofDeepEncoderas:fdec:Rn×dlatent→RN×dtext;=fdec(Z)wheren≤N(2)whereZ∈Rn×dlatentarethecompressedlatent(vision)tokensfromDeepEncoderand∈RN×dtextisthereconstructedtextrepresentation.Thefunctionfdecrepresentsanon-linearmappingthatcanbeeffectivelylearnedbycompactlanguagemodelsthroughOCR-styletraining.ItisreasonabletoconjecturethatLLMs,throughspecializedpretrainingoptimization,woulddemonstratemorenaturalintegrationofsuchcapabilities.WeconstructecomplexanddiversetrainingdataforDeepSeek-OCR,includingOCR1.0data,whichmainlyconsistsoftraditionalOCRtaskssuchassceneimageOCRanddocumentOCR;OCR2.0data,whichmainlyincludesparsingtasksforcomplexartificialimages,suchascommoncharts,chemicalformulas,andplanegeometryparsingdata;Generalvisiondata,whichismainlyusedtoinjectcertaingeneralimageunderstandingcapabilitiesintoDeepSeek-OCRandpreservethegeneralvisioninterface.DocumentdataisthetoppriorityforDeepSeek-OCR.Wecollect30MpagesofdiversePDFdatacoveringabout100languagesfromtheInternet,withChineseandEnglishaccountingforapproximately25Mandotherlanguagesaccountingfor5M.Forthisdata,wecreatetwotypesofgroundtruth:coarseannotationsandfineannotations.Coarseannotationsareextracted8(a)Groundtruthimage(b)FineannotationswithlayoutsFigure5|OCR1.0fineannotationsdisplay.Weformatthegroundtruthintoaninterleavedlayoutandtextformat,whereeachparagraphoftextisprecededbythecoordinatesandlabelofitintheoriginalimage.Allcoordinatesarenormalizedinto1000bins.directlyfromthefulldatasetusingfitz,aimedatteachingthemodeltorecognizeopticaltext,especiallyinminoritylanguages.Fineannotationsinclude2MpageseachforChineseandEnglish,labeledusingadvancedlayoutmodels(suchasPP-DocLayout[33])andOCRmodels(suchasMinuerU[34]andGOT-OCR2.0[38])toconstructdetectionandrecognitioninterleaveddata.Forminoritylanguages,inthedetectionpart,wefindthatthelayoutmodelenjoyscertaingeneralizationcapabilities.Intherecognitionpart,weusefitztocreatesmallpatchdatatotrainaGOT-OCR2.0,thenusethetrainedmodeltolabelsmallpatchesafterlayoutprocessing,employingamodelflywheeltocreate600Kdatasamples.DuringthetrainingofDeepSeek-OCR,coarselabelsandfinelabelsaredistinguishedusingdifferentprompts.Thegroundtruthforfineannotationimage-textpairscanbeseeninFigure5.Wealsocollect3MWorddata,constructinghigh-qualityimage-textpairswithoutlayoutbydirectlyextractingcontent.ThisdatamainlybringsbenefitstoformulasandHTML-formattedtables.Additionally,weselectsomeopen-sourcedata[28,37]assupplements.FornaturalsceneOCR,ourmodelmainlysupportsChineseandEnglish.TheimagedatasourcescomefromLAION[31]andWukong[13],labeledusingPaddleOCR[9],with10MdatasampleseachforChineseandEnglish.LikedocumentOCR,naturalsceneOCRcanalsocontrolwhethertooutputdetectionboxesthroughprompts.FollowingGOT-OCR2.0[38],werefertochart,chemicalformula,andplanegeometryparsingdataasOCR2.0data.Forchartdata,followingOneChart[7],weusepyechartsandmatplotlib9(a)Image-textgroundtruthofchart(b)Image-textgroundtruthofgeometryFigure6|Forcharts,wedonotuseOneChart’s[7]dictionaryformat,butinsteaduseHTMLtableformataslabels,whichcansaveacertainamountoftokens.Forplanegeometry,weconvertthegroundtruthtodictionaryformat,wherethedictionarycontainskeyssuchaslinesegments,endpointcoordinates,linesegmenttypes,etc.,forbetterreadability.EachlinesegmentisencodedusingtheSlowPerception[39]manner.torender10Mimages,mainlyincludingcommonlyusedline,bar,pie,andcompositecharts.Wedefinechartparsingasimage-to-HTML-tableconversiontask,asshowninFigure6(a).Forchemicalformulas,weutilizeSMILESformatfromPubChemasthedatasourceandrenderthemintoimagesusingRDKit,constructing5Mimage-textpairs.Forplanegeometryimages,wefollowSlowPerception[39]forgeneration.Specifically,weuseperception-rulersizeas4tomodeleachlinesegment.Toincreasethediversityofrendereddata,weintroducegeometrictranslation-invariantdataaugmentation,wherethesamegeometricimageistranslatedintheoriginalimage,correspondingtothesamegroundtruthdrawnatthecenteredpositioninthecoordinatesystem.Basedonthis,weconstructatotalof1Mplanegeometryparsingdata,asillustratedinFigure6(b).DeepEncodercanbenefitfromCLIP’spretraininggainsandhassufficientparameterstoin-corporategeneralvisualknowledge.Therefore,wealsopreparesomecorrespondingdataforDeepSeek-OCR.FollowingDeepSeek-VL2[40],wegeneraterelevantdatafortaskssuchascaption,detection,andgrounding.NotethatDeepSeek-OCRisnotageneralVLMmodel,andthisportionofdataaccountsforonly20%ofthetotaldata.Weintroducesuchtypeofdatamainlytopreservethegeneralvisioninterface,sothatresearchersinterestedinourmodelandgeneralvisiontaskcanconvenientlyadvancetheirworkinthefuture.Toensurethemodel’slanguagecapabilities,weintroduced10%ofin-housetext-onlypretraindata,withalldataprocessedtoalengthof8192tokens,whichisalsothesequencelengthforDeepSeek-OCR.Insummary,whentrainingDeepSeek-OCR,OCRdataaccountsfor70%,generalvisiondataaccountsfor20%,andtext-onlydataaccountsfor10%.Ourtrainingpipelineisverysimpleandconsistsmainlyoftwostages:a).TrainingDeepEncoderindependently;b).TrainingtheDeepSeek-OCR.NotethattheGundam-mastermodeisobtainedbycontinuingtrainingonapre-trainedDeepSeek-OCRmodelwith6Msampleddata.Sincethetrainingprotocolisidenticaltoothermodes,weomitthedetaileddescriptionhereafter.FollowingVary[36],weutilizeacompactlanguagemodel[15]andusethenexttokenpredictionframeworktotrainDeepEncoder.Inthisstage,weuseallOCR1.0and2.0dataaforementioned,aswellas100MgeneraldatasampledfromtheLAION[31]dataset.Alldataistrainedfor2epochswithabatchsizeof1280,usingtheAdamW[23]optimizerwithcosineannealingscheduler[22]andalearningrateof5e-5.Thetrainingsequencelengthis4096.AfterDeepEncoderisready,weusedatamentionedinSection3.4totraintheDeepSeek-OCR.withtheentiretrainingprocessconductedontheHAI-LLM[14]platform.Theentiremodelusespipelineparallelism(PP)andisdividedinto4parts,withDeepEncodertakingtwopartsandthedecodertakingtwoparts.ForDeepEncoder,wetreatSAMandthecompressorasthevisiontokenizer,placetheminPP0andfreezetheirparameters,whiletreatingtheCLIPpartasinputembeddinglayerandplaceitinPP1withunfrozenweightsfortraining.Forthelanguagemodelpart,sinceDeepSeek3B-MoEhas12layers,weplace6layerseachonPP2andPP3.Weuse20nodes(eachwith8A100-40GGPUs)fortraining,withadataparallelism(DP)of40andaglobalbatchsizeof640.WeusetheAdamWoptimizerwithastep-basedschedulerandaninitiallearningrateof3e-5.Fortext-onlydata,thetrainingspeedis90Btokens/day,whileformultimodaldata,thetrainingspeedis70Btokens/day.Table2|WetestDeepSeek-OCR’svision-textcompressionratiousingallEnglishdocumentswith600-1300tokensfromtheFox[21]benchmarks.TexttokensrepresentthenumberoftokensaftertokenizingthegroundtruthtextusingDeepSeek-OCR’stokenizer.VisionTokens=64or100respectivelyrepresentthenumberofvisiontokensoutputbyDeepEncoderafterresizinginputimagesto512×512and640×640.VisionTokens=64VisiPrecisionCompressionPrecisionCompressionPages784WeselectFox[21]benchmarkstoverifyDeepSeek-OCR’scompressiofortext-richdocuments,inordertopreliminarilyexplorethefeasibilityandboundariesofcontextsopticalcompression.WeusetheEnglishdocumentportionofFox,tokenizethegroundtruthtextwithDeepSeek-OCR’stokenizer(vocabularysizeofapproximately129k),andselectdocumentswith600-1300tokensfortesting,whichhappenstobe100pages.Sinctexttokensisnotlarge,weonlyneedtotestperformanceinTinyandSmallmodes,whereTinymodecorrespondsto64tokensandSmallmodecorrespondsto100tokens.Weusethepromptparsingtasks.Allmetricsinthetableareeditdistances,wheresmallervaluesindicatebetterperformance."Tokens"representstheaveragenumberofvisiontokensusedperpage,and"†200dpi"meansusingfitztointerpolatetheoriginalimageto200dpi.FortheDeepSeek-OCRmodel,thevaluesinparenthesesinthe"Tokens"columnrepresentvalidvisiontokens,calculatedaccordingtoEquation1.ModelTokens overalltextformulatableorderoveralltextformulatableorderPiplineModelsDolphin[11]-Marker[1]-Mathpix[2]-MinerU-2.1.1[34]-MonkeyOCR-1.2B[18]-PPstructure-v3[9]-End-to-endModelsNougat[6]SmolDocling[25]InternVL2-76B[8]Qwen2.5-VL-7B[5]OLMOCR[28]GOT-OCR2.0[38]OCRFlux-3B[3]-InternVL3-78B[42]Qwen2.5-VL-72B[5]dots.ocr[30]Gemini2.5-Pro[4]-MinerU2.0[34]dots.ocr†200dpi[30]DeepSeekDeepSeek-OCR(end2end)TinySmallLargeGundamGundam-M†200dpiwithoutlayout:"<image>\nFreeOCR."tocontrolthemodel’s

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论