关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry

上传人：1*** IP属地：山西上传时间：2025-12-08 格式：DOCX 页数：54 大小：746.23KB 积分：19.9 举报 版权申诉

关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第2页

关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第3页

关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第4页

关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第5页

已阅读5页，还剩49页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Experimental&MolecularMedicine

/10.1038/s12276-025-01583-1

ArticleinPress

Asurveyonlargelanguagemodelsinbiologyandchemistry

Received:7July2025

Accepted:27August2025

publishedonline:15November2025

Citethisarticleas:IslambekAshyrmamatov,SuJiGwak,Su-YoungJinetal.Asurveyonlargelanguagemodelsinbiologyand

chemistryExpMolMed.(2025).

https://

/10.1038/s12276-025-01583-1

IslambekAshyrmamatov,SuJiGwak,Su-YoungJin,IkhyeongJun,UmitV.Ucak,Jay-YoonLee&JuyongLee

Weareprovidinganuneditedversionofthismanuscripttogiveearlyaccesstoitsﬁndings.Beforeﬁnalpublication,themanuscriptwillundergofurtherediting.Pleasenotetheremaybeerrorspresentwhichaffectthecontent,andalllegaldisclaimersapply.

IfthispaperispublishingunderaTransparentPeerReviewmodelthenPeerReviewreportswillpublishwiththeﬁnalarticle.

©TheAuthor(s)2025.OpenAccessThisarticleislicensedunderaCreativeCommonsAttribution4.0InternationalLicense,whichpermitsuse,sharing,adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicence,andindicateifchangesweremade.Theimagesorotherthirdpartymaterialinthisarticleareincludedinthearticle’sCreativeCommonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthearticle’sCreativeCommonslicenceandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder.Toviewacopyofthislicence,visit

http://

/licenses/by/4.0/

ASurveyonLargeLanguageModelsinBiologyandChemistry

IslambekAshyrmamatov1t,SuJiGwak2t,Su-YoungJin2,IkhyeongJun3,UmitV.Ucak1*,Jay-YoonLee2*andJuyongLee1,3*

1ResearchInstituteofPharmaceuticalScience,CollegeofPharmacy,SeoulNationalUniversity,1

Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea

2GraduateSchoolofDataScience,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea

3DepartmentofMolecularMedicineandBiopharmaceuticalSciences,GraduateSchoolofConvergenceScienceandTechnology,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,Republicof

Korea

†Theseauthorshavecontributedequallytothiswork.

Correspondingauthors:braket@snu.ac.kr,lee.jayyoon@snu.ac.kr,nicole23@snu.ac.kr

Abstract

Artificialintelligence(AI)isreshapingbiomedicalresearchbyprovidingscalablecomputationalframeworkssuitedtothecomplexityofbiologicalsystems.Centraltothisrevolutionarebio/chemicallanguagemodels(LMs),includingLargeLanguageModels(LLMs),whicharere-conceptualizingmolecularstructuresasaformof"language"amenabletoadvancedcomputationaltechniques.Thisreviewcriticallyexaminestheroleofthesemodelsinbiologyandchemistry,tracingtheirevolutionfrommolecularrepresentationtomoleculargenerationandoptimization.Thisreviewcoverskeymolecularrepresentationstrategiesforbothbiologicalmacromoleculesandsmallorganiccompounds—rangingfromproteinandnucleotidesequencestosingle-celldata,string-basedchemicalformats,graph-basedencodings,and3Dpointclouds—highlightingtheirrespectiveadvantagesandinherentlimitationsinAIapplications.Thediscussionfurtherexplorescoremodelarchitectures,suchasBERT-likeencoders,GPT-likedecoders,andencoder-decodertransformers,alongsidetheirsophisticatedpre-trainingstrategieslikeself-supervisedlearning,multi-tasklearning,andretrieval-augmentedgeneration.Keybiomedicalapplications,spanningproteinstructureandfunctionprediction,denovoproteindesign,genomicanalysis,molecularpropertyprediction,denovomoleculardesign,reactionprediction,andretrosynthesis,areexploredthroughrepresentativestudiesandemergingtrends.Finally,thereviewconsiderstheemerginglandscapeofagenticandinteractiveAIsystems,showcasingbrieflytheirpotentialtoautomateandacceleratescientificdiscoverywhileaddressingcriticaltechnical,ethical,andregulatoryconsiderationsthatwillshapethefuturetrajectoryofAIinbiomedicine.

1.Introduction

Largelanguagemodels(LLMs),builtondeepneuralarchitecturesandtrainedonmassivetextcorpora,haveachievedstate-of-the-artperformanceinlanguageunderstanding,generation,andreasoning.Althoughoriginallydevelopedfornaturallanguage,theircoremodelingprinciplesarebroadlytransferabletosymbolicscientificdata.ThishasspurredgrowinginterestinadaptingLLMstoscientificdomains,particularlyinchemistryandbiology.1,2

Scientificknowledgeandun

derstandingcriticallydependontheconstructionofformalrepresentationsthatencodethestructureandbehaviorofphysicalandbiologicalsystems.Theserepresentationsaredesignedforfidelityincapturingdomain-specificproperties,butrarelyalignwiththedistributionalandsyntacticpatternsoflanguagemodels.Thus,variousattemptshavebeensuggestedforbetteralignmentbetweenLLMsandscientificrepresentations.3,4

WhatenablesLLMstoperformsoeffectivelyisnotanunderstandingofindividualtokens,buttheirabilitytomodelthestatisticalstructurethatgovernstokencomposition.Inscientificdomains,amodel’sabilitytoinferpropertiesdependsonhowwelltheinputrepresentationencodesunderlyingstructure.Thus,representationaldesignisnotperipheralbutfundamentalfordevelopingscientificLLMs.Itdetermineswhatmodelscanlearn,generalize,andultimately,discover.Inaddition,itiswell-knownthatthescalesofmodelarchitectureandtrainingdataarecriticalinaccuracyandemergentbehaviorsofLLMs.5Thus,thesuccessofscientificLLMsrestsonbothscaleandarchitectureofthemodels,andhoweffectivelytherepresentationtranslatesadomainstructureintoalearnableentity.

RecentprogressinusingLLMsinbiologyandchemistryhasbeenacceleratedbythegrowthofcurated,domain-specificdatasets.Molecularandproteindatabases,alongwithscientificliterature,nowsupportdiversetrainingstrategies,fromself-supervisedobjectivestomultimodalintegration.However,muchofthisdevelopmentremainsfragmented,andsystematiccomparisonsacrosschemicalandbiologicaldomainsarestilllimited.

Inthisreview,weexaminehowLLMsarebeingadaptedtotheuniquedemandsofchemicalandbiologicaltopics.Wefocusonhowrepresentations,architectures,andtrainingregimesinfluencemodelperformanceacrossdomainsandtasks.Thefoundationalchallengeliesinconvertingcomplex,multi-dimensionalmolecularinformationintoformatsthatlanguagemodelscanprocess(Fig.1).Ourgoalistoclarifywhathasbeenachieved,whatremainschallenging,andhowthesemodelswillbetterservescientificunderstanding.

2.Biologicallanguagemodels

Theunprecedentedsuccessoflargelanguagemodels(LLMs)hasopenedanewparadigmindataanalysis.Inthefieldofbiology,theutilizationofvariousbiologicaldatasuchasproteinsequences,6structures,7nucleotides,8andspeciestaxonomy9hasbeenconsidered.TheapplicationofTransformerarchitecturestobiologicalproblemshasledtosignificantbreakthroughs,withAlphaFold2(AF2)10andRoseTTAFold(RF)11emergingaslandmarkmodelsinproteinstructureprediction.Inparallel,ongoingresearchisbeingconductedtodescribebiologicalcomplexitymoreaccuratelywithinthemodels(seeTable1).

2.1.Proteinlanguagemodels

Thesequentialnatureofproteinhasenabledtheapplicationoflanguagemodelingtechniquesfromnaturallanguageprocessing.EarlymodelssuchasProtBERT,12MSATransformer,13andProtTrans14leveragedcoretechniquesfromthedeeplanguagemodelswhileexploringvariationsinbothinputformats,e.g.,singlesequences,multiplesequencealignments(MSAs),andarchitectures,e.g.,unidirectionalandBERT-stylebidirectionalencoders.ESMFold2achievesAlphaFold2-levelaccuracyinproteinstructurepredictionwithoutrelyingonMSAs,capturingcontextualdependenciessolelythroughlanguagemodeling.Thescalingofmodelparametersandfasterstructurepredictionhighlightthepotentialoflanguagemodelswhentrainedonlarge-scalebiologicaldata.ProtMamba15alsoshowedthatproteinlanguagemodelingisfeasiblewithoutMSAs.ThemodeladoptsaMamba16basedstatespacearchitectureinsteadofattention-basedtohandlelong-rangesequences.

Proteindesignaimstogenerateproteinswithcompletelynewfunctionsandstructures,andgenerativemodelscanplayakeyroleintheprocess.ProGen17enablescontrolledproteinsequencegenerationbyincorporatingconditioningtagsintoanautoregressivetransformerarchitecture.ProGen218andProtGPT219furtherimproveuponpreviousmodelsbyleveragingmorecomplexconditioningtagstogeneratesequencesthatsatisfybothstructuralandfunctionalconstraints.Recently,diffusionarchitectures,developedforimagegenerationfromtextprompts,havebeenadaptedforproteinstructuregeneration.RFdiffusion20incorporatesspatialconstraintsthroughSE(3)equivariance,enablingmoreefficientandphysicallyconsistentsamplingofproteinstructures.Suchstructuralmodelinghasfacilitatedscaffoldingtasks,andtoolsincludingProteinMPNN21andFoldseek22haveacceleratedadvancesinproteindesign.

2.2.Proteinstructuremodels

Proteinstructuremodelspredictthetertiarystructuresofproteinsfromtheirprimaryaminoacidsequences.Traditionally,techniquessuchasX-raycrystallography,nuclearmagneticresonance(NMR)spectroscopy,andcryo-electronmicroscopy(cryo-EM)havebeenemployedtoelucidateproteinstructures.However,theseexperimentalmethodsareoftenconstrainedbyhighcosts,timerequirements,andtechnicallimitations,resultinginaconsiderablysloweraccumulationofstructuraldatacomparedtotherapidlyexpandingnumberofknownprotein

sequences.23Thissequence-structuredataimbalance(e.g.,betweenUniProtKB24andthePDB7)underscorestheneedforcomputationalpredictionapproachestocomplementexperimentalefforts.

AlphaFold(AF)25andAlphaFold2(AF2)10havedemonstratedoutstandingperformanceinthefieldofproteinstructureprediction,asevidencedbytheirsuccessinCriticalAssessmentofproteinStructurePrediction13(CASP13)andCASP14,respectively.AF2consistsoftwoprimarymodules:theEvoformerandthestructuremodule.UnlikeAF,whichemploysaResNet-basedconvolutionalneuralnetwork(CNN),AF2introducesanattention-basedEvoformer,enablingefficientprocessingofMSAsandpairwiseresidueinteractions.TheEvoformercanbeinterpretedasabiology-specifictransformer,whereMSAsaretreatedassequencesinnaturallanguage,capturingevolutionarypatternsacrosshomologousproteins.Thisapproachhasbeenmorefullyrealizedinproteinlanguagemodels(pLMs),whicharedesignedtoreplaceMSAsbyimplicitlymodelingevolutionaryinformation.Thestructuremoduleallowsforend-to-endlearningfromprimarysequenceto3Dstructuralreconstruction,achievingnearexperimentalaccuracy.

Severalplatformshavebeendevelopedtoextendtheapplicabilityandaccessibilityofproteinstructuremodels.ColabFold26leveragesametagenomicsequencedatabase(ColabFoldDB)toenhancethediversityandqualityofMSAs,anditisimplementedtorunonweb-basedGPUresourcesthroughGoogleColaboratory.Thisapproachimprovesaccessibilitytohigh-accuracyproteinstructurepredictionwhileeffectivelyreducingcomputationalresourceburdens.Phyre2.227isanupgradedplatformforproteinstructureandfunctionpredictionthatmaintainsauser-friendlyinterfacewhileintegratingAlphaFold-predictedstructuresasnewtemplates.Itenableslarge-scalestructuralanalysisbyutilizingabroaderrangeofstructuraltemplatesbeyondthoseavailableinthePDB.Furthermore,itsupportsdomain-leveloptimizationandbatch-modeprediction,therebyservingasacomputationalalternativethatcomplementsexperimentalstudies.

2.3.Nucleotidelanguagemodels

Unlikenaturallanguage,DNAdoesnotpossessaninherentconceptof"words,"anditscompositionislimitedtojustfournucleotides—adenine(A),thymine(T),guanine(G),andcytosine(C)—asopposedtoproteinsequences,whicharecomposedofapproximately20aminoacids.Thislimitedalphabetreducestheoverallinformationdensity,makingthedevelopmentofeffectiveDNAlanguagemodelsmorechallenging.

Earlierapproaches,suchasDeepSite,28utilizedCNNsandrecurrentneuralnetworks(RNNs)formodelingDNAsequences.However,CNNsoftenstrugglewithcapturinglong-rangedependencies,andRNNssufferfromcomputationalinefficiencyandscalabilityissues.Toaddresstheselimitations,DNABERT29adoptedamaskedlanguagemodeling(MLM)basedonbidirectionalencoderrepresentationsfromtransformers(BERT)usingk-mertokenization(a.k.a.n-gramincomputerscience),enablingmoreeffectivesequencerepresentation.Subsequentmodels,includingGROVER30andDNABERT2,31leveragedBytePairEncoding(BPE)32—tokenizationemployedbytheSentencePiece33framework—toflexiblydefinetokenunits.Thishelpedreducesequenceinformationlossandimprovedcomputationalefficiency.Asaresult,

transformer-basedmodelshavebeensuccessfullyappliedtotaskssuchasidentifyingpromotersandtranscriptionfactorbindingsites(TFBSs)directlyfromDNAsequences.Caduceus34employscharacter-level(base-pair)tokenization,whichensuresrobustnesstominorsequencevariations.Furthermore,bymodelingDNAsequencesbidirectionallyandincorporatingreversecomplement(RC)equivariance,Caduceusdemonstratessuperiorperformanceontaskssuchasregulatorysitepredictionandlong-rangeSNPeffectinference.Recently,researchhasbeenperformedbeyondmaskedlanguagemodelingtowardgenerativeapproaches,suchasMegaDNA,35atransformer-basedDNAsequencegenerationmodel.

GenSLM36isanRNAlanguagemodelcapableofmutationeffectpredictionbycapturingthedifferencesbetweenoriginalandmutatedRNAsequencesandpredictingtheirfunctionaleffects.Themodelusesacodon-levelvocabulary,whichavoidsframeshiftissues,fortokenizingRNAsequences.Thestudyaddressesinputlengthsthatexceedthestandardmaximumcapacityofthestandardtransformer.Thislimitationhasbeenidentifiedasafundamentalarchitecturalbottleneckinearlyfoundationmodelsdesignedfornucleotidesequenceanalysis.Evo,37HyenaDNA,38andCaduceus34haveadoptedspecializedarchitectures,suchasHyena39andMamba,tosupportlong-sequencemodeling.

2.4.Single-celllanguagemodels

Withtheaccumulationofhigh-dimensionalgeneexpressiondata,single-celllanguagemodelshaveemergedasanewfrontierinbiology.Whileproteinsandnucleotidesarenaturallysequential,single-cellgeneexpressiondataarenotuniversallysequential.Therefore,amethodofrankinggenesbasedontheirexpressionlevelshasbeenproposed.Geneswithinacellaretreatedaswordsinasentence,andTransformer-basedmodelsareappliedtocapturetheirunderlyingdependencies,asinotherbiologicallanguagemodelingtasks.

Recentadvancesinsingle-cellrepresentationlearninghavesurpassedtraditionalmarkergene-basedapproachesincapturingcellularheterogeneity.40scBERT41addressesthislimitationbyleveragingfullgeneexpressionprofiles,achievingstrongperformanceincelltypeannotation.Geneformer42handlesthenon-sequentialnatureofgeneexpressiondatabyorderinggenesbasedoncountstatistics,alsoshowingeffectivenessinclassificationtasks.Buildingonthis,scGPT43takesgeneembeddingsasinputtokensandoutputsacellembedding,jointlylearningrepresentationsatbothlevels.Itachievesstate-of-the-artresultsacrosstaskssuchascelltypeclassification,perturbationprediction,batchcorrection,andmulti-omicsintegration.Thesefindingsemphasizethevalueoflarge-scalesingle-celldatasets(e.g.,theHumanCellAtlas,44CellMarker45)andthepotentialofembeddingmodelstocapturecellularcomplexity.

Atthesametime,approacheshavebeenproposedtoleveragegeneral-purposeLLMsfordirectlyincorporatingpriorbiologicalknowledge,goingbeyondgenesequencemodelingalone.Forexample,despitebeingtrainedoncommonhumanlanguages,GPT-4hasshowntheabilitytoperformautomaticcelltypeannotationbasedontextpromptsdescribinggeneexpressionlevels.46Accordingly,GenePT47andscELMo48haveconstructedgene-andcell-levelembeddingsbyapplyingtextembeddingAPIsfromacorpusofbiomedicalliteratureincludingtheNCBIdatabase.Ithasbeenreportedtooutperformsomebiologicaldata-drivenmodelssuch

asGeneformer.42Inaddition,CancerGPT,49aGPT-350modelfine-tunedoncorporaoftext,predictsdrugresponsepairswithinraretissuetypesbyaligningtextualrepresentationswithcellularinformation.Developingdisease-specificmodelswithrefinedcellembeddingsmayfurtheradvanceprecisionmedicine.

2.5.Biomoleculerepresentations

Biologicalmacromoleculessuchasproteinsandnucleicacidscanberepresentedthroughdiversemodalitiestosupportmachinelearningapplications.Sequence-basedrepresentationsuseaminoacidornucleotidestringsandserveasthefoundationforproteinandgenomiclanguagemodelssuchasESM,2ProtBERT,12andDNABERT29,31.Structuralrepresentationscapturespatialinformationusingatomiccoordinates,contactmaps,ordistancematrices,whichareleveragedinstructuremodelslikeAFandESMFold.Graph-basedapproachesabstractbiomoleculesintonodesandedges,enablingtheuseofgeometricdeeplearningmodelssuchasSE(3)Transformer.51FunctionalrepresentationsincludeGeneOntologyterms,proteinfamilyannotations,andsubcellularlocalization,enrichingmodelswithbiologicalcontext.Atthecellularlevel,omicsdatalikescRNA-seqisencodedashigh-dimensionalexpressionvectors.

2.6.Tokenizationstrategies

Tokenizationmethodshaveevolvedfromtraditionalmachinelearningtechniques,includingk-merapproaches,52tobiomolecule-specializedstrategiessuchasstructure-andcodon-basedtokenization,53whicharecriticalforaccurateanddetailedbiomolecularmodeling.Inproteinandnucleotidemodels,k-mertokenization(e.g.,3-mer,6-mer)isusedtocapturelocalbiochemicalcontext,asseeninDNABERTandProtBERT.Somemodelsusebyte-pairencoding(BPE)orunigrammodelstrainedonlargecorporaofsequences,suchasDNABERT2,ESM,andProGen.Codon-basedorcodon-preservingtokenizationarealsoadoptedtoavoidframe-shiftartifactsinnucleotidemodeling.scBERTemploysthegene2vecapproachtogenerategeneembeddings,whichfacilitatestheapplicationoftheBERTarchitecturetosingle-cellRNAsequencingdata.Thesecustomizedstrategiesensureefficientrepresentationofbiologicalsyntaxandsemanticsinpretrainedlanguagemodels.

2.7.ApplicationofBLMsinBiomedicine

2.7.1.Integrativemodelingformolecularcellbiology

AF2demonstratedthestrengthofAIinproteinstructurepredictionandhassinceinspiredawiderangeoffollow-upstudies.ModelssuchasAlphaFold3,54RoseTTAFoldNA,55andRoseTTAFoldAll-Atom56extendtheirfocusbeyondproteinstoincludeotherbiologicallyrelevantmoleculessuchasRNA,DNA,andligands.Inparticular,all-atomstructurepredictionintroducescomputationalchallengesinaccuratelyreconstructing3Dcoordinates.Thisreflectsagrowingrecognitionthatstructuralaccuracyisessentialforunderstandingbiomolecularfunction

notonlyinproteins,butalsoinRNA,wherestructureplaysacriticalroleinregulatoryactivity.57Concurrently,largelanguagemodel(LLM)-basedmethodshavebeguntoincorporatestructuralinformation,movingbeyondsequencemodeling.ESM358jointlyembedssequence,structure,andfunctionmarkingatransitiontowardmultimodalrepresentation.SpecializedmodelssuchasESM-DBP59havealsobeendevelopedtopredictDNA-bindingproteins,adoptinghybridapproachesthatleveragebothsequenceandstructurefeatures.Inthecontextofunifiedmodelinginbiologicallanguagemodels,foundationmodelsaimtolearncomprehensivecellularrepresentationsbyintegratingdiversebiologicalmodalities.Theseincludeepigeneticmarks,spatialtranscriptomics,proteinexpressiondata,andperturbationsignatures,whichcanbeexploredtogainadeeperunderstandingintocellularfunction.60Theintegrationsignalsabroadershiftfrommodality-specificmodelstowardunifiedrepresentationsthatmorereasonablyreflecttheinherentcomplexityofbiologicalsystems.

2.7.2.Multimodalfoundationmodels

MultimodalLargeLanguageModels(MLLMs)offeraframeworkforaligningheterogeneousdatatypessuchasclinicalnotes,proteinsequences,andmolecularstructures.

BiomedGPT61alignsnaturallanguagewithbiomedicalmodalities,particularlyvisualrepresentations,toenablecross-modalreasoningforvisual-languagetasks.Itfocusesonapplicationssuchasdiagnosis,summarization,clinicaldecisionsupportthroughflexiblequeryanswering.However,suchmodelsstillexhibitlimitationsinreasoningacrosscomplexclinicalscenarios,includingtheinterpretationofradiologicalimagesandtheresolutionoftextualconflicts.MediConfusion62providesadiagnosticbenchmarkthatsystematicallyevaluatesfailuremodesofmultimodalmedicalLLMs.

Tx-LLM63leveragestheadvantagesoflarge-scalepretrainingondiversebiologicaldatasets.Specifically,Itistrainedonsequence-levelinformationencompassingRNA,DNA,proteinsequences,aswellasSMILES.Thiscomprehensiveapproachenablespositivetransferperformanceinend-to-enddrugdiscoverytasks,outperformingmodelsthatdonotincorporatebiologicalsequencedata.Similarly,BioMedGPT-10B64contributestodrugdiscoverybyspecializinginproteinandmoleculeQuestionandAnswering(QA),havingbeentrainedoncellsequences,proteinandmoleculestructures.TheseadvancementshighlightthepotentialofLLMstoserveasunifiedmultimodalplatformsinbiomedicine.(Fig.2).

3.Chemicallanguagemodels

ChemicalLanguageModels(CLMs)havebeensuggestedtolearnthestructure-activityrelationshipofsmallmoleculesfromlarge-scalechemicaldatausingvarioussequentialrepresentationsofmolecules,e.g.SimplifiedMolecularInputLineEntrySystem(SMILES).65

3.1.ModelsTypes

SimilartopLMs,mostCLMsleverageTransformerarchitectures,66akintothoseinnaturallanguageprocessing,tounderstand,generate,andmanipulatechemicalstructuresandreactions.Thesemodelsarebroadlycategorizedbasedontheirarchitecturaldesign,eachoptimizedfordistincttaskswithincheminformaticsanddrugdiscovery.Theprimarymodeltypesincludeencoder-only(BERT-like)models,decoder-only(GPT-like)models,andencoder-decoderarchitectures,aswellasemergingmulti-modalLLMsthatintegratediversedataformats(Fig.3).Thesearchitecturalchoicesdictatehowthemodelsprocessmolecularrepresentationsandperformtasksrangingfrompropertypredictiontodenovomoleculardesignandretrosynthesis.

3.1.1.Chemicalencoders

Encoder-onlytransformermodels,primarilyinspiredbyBERT,aredesignedtoextractcontextualrepresentationsofmoleculesandarewell-suitedforpropertypredictionandmolecularunderstanding.ChemBERTa67adaptstheRoBERTa68frameworkwithMLMandmultitaskregression,whereauxiliarypropertypredictiontasksaredefinedusingmolecularfeaturescomputedbyRDKit69.Mol-BERT70appliesMLMtolearnchemicallyinformedtoken-leveldependenciesandisfine-tunedfortaskssuchaspropertyclassificationandactivityprediction.MoLFormer71extendsthisapproachusinglinearattentionandrotaryembeddings,yieldingcompactrepresentationsusefulfordownstreamregressionandclassificationtasks,thoughitislimitedtorelativelysmallmolecules.Furtherencodervariantsrefinetokenrepresentationsorintegratestructuralpriors.MolRoPE-BERT72enhancespositionalencoding,whileMFBERT,73SELFormer,74andsemi-RoBERTa75introducearchitecturalmodificationsforgreaterchemicalexpressiveness.Graph-enhancedencoderslikeGROVER76incorporatetopologicalfeaturesdirectly,bridgingthegapbetweensequenceandgraphrepresentations.

3.1.2.Chemicaldecoders

Decoder-onlytransformermodels,followingtheGPTarchitecture,areoptimizedforautoregressivegenerationandhavebecomeessentialindenovomoleculardesign.MolGPT77prioritizescausalitytolearntoken-wisedependenciesandultimatelygeneratesnovelmolecules.Itsupportsconditionalgenerationstrategiestobiasoutputstowardspecificchemicalproperties.GP-MoLFormer78isadecoder-onlyadaptationofMoLFormer-XL71andoptimizedfortaskssuchasunconstrainedmoleculegeneration,scaffoldcompletion,andconditionalpropertyoptimization.OtherGPT-basedchemicalmodelsincludeSMILES-GPT79andiupacGPT,80bothadaptedfromGPT-281formolecularandnomenclaturesequencegeneration.cMolGPT82extendsthisframeworkforcontrollablegenerationunderpropertyorscaffoldconstraints.Taiga83combinesGPTmodelingwithreinforcementlearningtoguidemoleculesynthesistowardmulti-objectivegoals.

3.1

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry

文档简介

温馨提示

最新文档

评论

关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry

文档简介

温馨提示

最新文档

评论

相关文档