关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第1页
关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第2页
关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第3页
关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第4页
关于生物学和化学领域大型语言模型的综述 A survey on large language models in biology and chemistry_第5页
已阅读5页,还剩49页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Experimental&MolecularMedicine

/10.1038/s12276-025-01583-1

ArticleinPress

Asurveyonlargelanguagemodelsinbiologyandchemistry

Received:7July2025

Accepted:27August2025

publishedonline:15November2025

Citethisarticleas:IslambekAshyrmamatov,SuJiGwak,Su-YoungJinetal.Asurveyonlargelanguagemodelsinbiologyand

chemistryExpMolMed.(2025).

https://

/10.1038/s12276-025-01583-1

IslambekAshyrmamatov,SuJiGwak,Su-YoungJin,IkhyeongJun,UmitV.Ucak,Jay-YoonLee&JuyongLee

Weareprovidinganuneditedversionofthismanuscripttogiveearlyaccesstoitsfindings.Beforefinalpublication,themanuscriptwillundergofurtherediting.Pleasenotetheremaybeerrorspresentwhichaffectthecontent,andalllegaldisclaimersapply.

IfthispaperispublishingunderaTransparentPeerReviewmodelthenPeerReviewreportswillpublishwiththefinalarticle.

©TheAuthor(s)2025.OpenAccessThisarticleislicensedunderaCreativeCommonsAttribution4.0InternationalLicense,whichpermitsuse,sharing,adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicence,andindicateifchangesweremade.Theimagesorotherthirdpartymaterialinthisarticleareincludedinthearticle’sCreativeCommonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthearticle’sCreativeCommonslicenceandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder.Toviewacopyofthislicence,visit

http://

/licenses/by/4.0/

.

0

ASurveyonLargeLanguageModelsinBiologyandChemistry

IslambekAshyrmamatov1t,SuJiGwak2t,Su-YoungJin2,IkhyeongJun3,UmitV.Ucak1*,Jay-YoonLee2*andJuyongLee1,3*

1ResearchInstituteofPharmaceuticalScience,CollegeofPharmacy,SeoulNationalUniversity,1

Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea

2GraduateSchoolofDataScience,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea

3DepartmentofMolecularMedicineandBiopharmaceuticalSciences,GraduateSchoolofConvergenceScienceandTechnology,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,Republicof

Korea

†Theseauthorshavecontributedequallytothiswork.

*

Correspondingauthors:braket@snu.ac.kr,lee.jayyoon@snu.ac.kr,nicole23@snu.ac.kr

Abstract

Artificialintelligence(AI)isreshapingbiomedicalresearchbyprovidingscalablecomputationalframeworkssuitedtothecomplexityofbiologicalsystems.Centraltothisrevolutionarebio/chemicallanguagemodels(LMs),includingLargeLanguageModels(LLMs),whicharere-conceptualizingmolecularstructuresasaformof"language"amenabletoadvancedcomputationaltechniques.Thisreviewcriticallyexaminestheroleofthesemodelsinbiologyandchemistry,tracingtheirevolutionfrommolecularrepresentationtomoleculargenerationandoptimization.Thisreviewcoverskeymolecularrepresentationstrategiesforbothbiologicalmacromoleculesandsmallorganiccompounds—rangingfromproteinandnucleotidesequencestosingle-celldata,string-basedchemicalformats,graph-basedencodings,and3Dpointclouds—highlightingtheirrespectiveadvantagesandinherentlimitationsinAIapplications.Thediscussionfurtherexplorescoremodelarchitectures,suchasBERT-likeencoders,GPT-likedecoders,andencoder-decodertransformers,alongsidetheirsophisticatedpre-trainingstrategieslikeself-supervisedlearning,multi-tasklearning,andretrieval-augmentedgeneration.Keybiomedicalapplications,spanningproteinstructureandfunctionprediction,denovoproteindesign,genomicanalysis,molecularpropertyprediction,denovomoleculardesign,reactionprediction,andretrosynthesis,areexploredthroughrepresentativestudiesandemergingtrends.Finally,thereviewconsiderstheemerginglandscapeofagenticandinteractiveAIsystems,showcasingbrieflytheirpotentialtoautomateandacceleratescientificdiscoverywhileaddressingcriticaltechnical,ethical,andregulatoryconsiderationsthatwillshapethefuturetrajectoryofAIinbiomedicine.

1

1.Introduction

Largelanguagemodels(LLMs),builtondeepneuralarchitecturesandtrainedonmassivetextcorpora,haveachievedstate-of-the-artperformanceinlanguageunderstanding,generation,andreasoning.Althoughoriginallydevelopedfornaturallanguage,theircoremodelingprinciplesarebroadlytransferabletosymbolicscientificdata.ThishasspurredgrowinginterestinadaptingLLMstoscientificdomains,particularlyinchemistryandbiology.1,2

Scientificknowledgeandun

derstandingcriticallydependontheconstructionofformalrepresentationsthatencodethestructureandbehaviorofphysicalandbiologicalsystems.Theserepresentationsaredesignedforfidelityincapturingdomain-specificproperties,butrarelyalignwiththedistributionalandsyntacticpatternsoflanguagemodels.Thus,variousattemptshavebeensuggestedforbetteralignmentbetweenLLMsandscientificrepresentations.3,4

WhatenablesLLMstoperformsoeffectivelyisnotanunderstandingofindividualtokens,buttheirabilitytomodelthestatisticalstructurethatgovernstokencomposition.Inscientificdomains,amodel’sabilitytoinferpropertiesdependsonhowwelltheinputrepresentationencodesunderlyingstructure.Thus,representationaldesignisnotperipheralbutfundamentalfordevelopingscientificLLMs.Itdetermineswhatmodelscanlearn,generalize,andultimately,discover.Inaddition,itiswell-knownthatthescalesofmodelarchitectureandtrainingdataarecriticalinaccuracyandemergentbehaviorsofLLMs.5Thus,thesuccessofscientificLLMsrestsonbothscaleandarchitectureofthemodels,andhoweffectivelytherepresentationtranslatesadomainstructureintoalearnableentity.

RecentprogressinusingLLMsinbiologyandchemistryhasbeenacceleratedbythegrowthofcurated,domain-specificdatasets.Molecularandproteindatabases,alongwithscientificliterature,nowsupportdiversetrainingstrategies,fromself-supervisedobjectivestomultimodalintegration.However,muchofthisdevelopmentremainsfragmented,andsystematiccomparisonsacrosschemicalandbiologicaldomainsarestilllimited.

Inthisreview,weexaminehowLLMsarebeingadaptedtotheuniquedemandsofchemicalandbiologicaltopics.Wefocusonhowrepresentations,architectures,andtrainingregimesinfluencemodelperformanceacrossdomainsandtasks.Thefoundationalchallengeliesinconvertingcomplex,multi-dimensionalmolecularinformationintoformatsthatlanguagemodelscanprocess(Fig.1).Ourgoalistoclarifywhathasbeenachieved,whatremainschallenging,andhowthesemodelswillbetterservescientificunderstanding.

2

2.Biologicallanguagemodels

Theunprecedentedsuccessoflargelanguagemodels(LLMs)hasopenedanewparadigmindataanalysis.Inthefieldofbiology,theutilizationofvariousbiologicaldatasuchasproteinsequences,6structures,7nucleotides,8andspeciestaxonomy9hasbeenconsidered.TheapplicationofTransformerarchitecturestobiologicalproblemshasledtosignificantbreakthroughs,withAlphaFold2(AF2)10andRoseTTAFold(RF)11emergingaslandmarkmodelsinproteinstructureprediction.Inparallel,ongoingresearchisbeingconductedtodescribebiologicalcomplexitymoreaccuratelywithinthemodels(seeTable1).

2.1.Proteinlanguagemodels

Thesequentialnatureofproteinhasenabledtheapplicationoflanguagemodelingtechniquesfromnaturallanguageprocessing.EarlymodelssuchasProtBERT,12MSATransformer,13andProtTrans14leveragedcoretechniquesfromthedeeplanguagemodelswhileexploringvariationsinbothinputformats,e.g.,singlesequences,multiplesequencealignments(MSAs),andarchitectures,e.g.,unidirectionalandBERT-stylebidirectionalencoders.ESMFold2achievesAlphaFold2-levelaccuracyinproteinstructurepredictionwithoutrelyingonMSAs,capturingcontextualdependenciessolelythroughlanguagemodeling.Thescalingofmodelparametersandfasterstructurepredictionhighlightthepotentialoflanguagemodelswhentrainedonlarge-scalebiologicaldata.ProtMamba15alsoshowedthatproteinlanguagemodelingisfeasiblewithoutMSAs.ThemodeladoptsaMamba16basedstatespacearchitectureinsteadofattention-basedtohandlelong-rangesequences.

Proteindesignaimstogenerateproteinswithcompletelynewfunctionsandstructures,andgenerativemodelscanplayakeyroleintheprocess.ProGen17enablescontrolledproteinsequencegenerationbyincorporatingconditioningtagsintoanautoregressivetransformerarchitecture.ProGen218andProtGPT219furtherimproveuponpreviousmodelsbyleveragingmorecomplexconditioningtagstogeneratesequencesthatsatisfybothstructuralandfunctionalconstraints.Recently,diffusionarchitectures,developedforimagegenerationfromtextprompts,havebeenadaptedforproteinstructuregeneration.RFdiffusion20incorporatesspatialconstraintsthroughSE(3)equivariance,enablingmoreefficientandphysicallyconsistentsamplingofproteinstructures.Suchstructuralmodelinghasfacilitatedscaffoldingtasks,andtoolsincludingProteinMPNN21andFoldseek22haveacceleratedadvancesinproteindesign.

2.2.Proteinstructuremodels

Proteinstructuremodelspredictthetertiarystructuresofproteinsfromtheirprimaryaminoacidsequences.Traditionally,techniquessuchasX-raycrystallography,nuclearmagneticresonance(NMR)spectroscopy,andcryo-electronmicroscopy(cryo-EM)havebeenemployedtoelucidateproteinstructures.However,theseexperimentalmethodsareoftenconstrainedbyhighcosts,timerequirements,andtechnicallimitations,resultinginaconsiderablysloweraccumulationofstructuraldatacomparedtotherapidlyexpandingnumberofknownprotein

3

sequences.23Thissequence-structuredataimbalance(e.g.,betweenUniProtKB24andthePDB7)underscorestheneedforcomputationalpredictionapproachestocomplementexperimentalefforts.

AlphaFold(AF)25andAlphaFold2(AF2)10havedemonstratedoutstandingperformanceinthefieldofproteinstructureprediction,asevidencedbytheirsuccessinCriticalAssessmentofproteinStructurePrediction13(CASP13)andCASP14,respectively.AF2consistsoftwoprimarymodules:theEvoformerandthestructuremodule.UnlikeAF,whichemploysaResNet-basedconvolutionalneuralnetwork(CNN),AF2introducesanattention-basedEvoformer,enablingefficientprocessingofMSAsandpairwiseresidueinteractions.TheEvoformercanbeinterpretedasabiology-specifictransformer,whereMSAsaretreatedassequencesinnaturallanguage,capturingevolutionarypatternsacrosshomologousproteins.Thisapproachhasbeenmorefullyrealizedinproteinlanguagemodels(pLMs),whicharedesignedtoreplaceMSAsbyimplicitlymodelingevolutionaryinformation.Thestructuremoduleallowsforend-to-endlearningfromprimarysequenceto3Dstructuralreconstruction,achievingnearexperimentalaccuracy.

Severalplatformshavebeendevelopedtoextendtheapplicabilityandaccessibilityofproteinstructuremodels.ColabFold26leveragesametagenomicsequencedatabase(ColabFoldDB)toenhancethediversityandqualityofMSAs,anditisimplementedtorunonweb-basedGPUresourcesthroughGoogleColaboratory.Thisapproachimprovesaccessibilitytohigh-accuracyproteinstructurepredictionwhileeffectivelyreducingcomputationalresourceburdens.Phyre2.227isanupgradedplatformforproteinstructureandfunctionpredictionthatmaintainsauser-friendlyinterfacewhileintegratingAlphaFold-predictedstructuresasnewtemplates.Itenableslarge-scalestructuralanalysisbyutilizingabroaderrangeofstructuraltemplatesbeyondthoseavailableinthePDB.Furthermore,itsupportsdomain-leveloptimizationandbatch-modeprediction,therebyservingasacomputationalalternativethatcomplementsexperimentalstudies.

2.3.Nucleotidelanguagemodels

Unlikenaturallanguage,DNAdoesnotpossessaninherentconceptof"words,"anditscompositionislimitedtojustfournucleotides—adenine(A),thymine(T),guanine(G),andcytosine(C)—asopposedtoproteinsequences,whicharecomposedofapproximately20aminoacids.Thislimitedalphabetreducestheoverallinformationdensity,makingthedevelopmentofeffectiveDNAlanguagemodelsmorechallenging.

Earlierapproaches,suchasDeepSite,28utilizedCNNsandrecurrentneuralnetworks(RNNs)formodelingDNAsequences.However,CNNsoftenstrugglewithcapturinglong-rangedependencies,andRNNssufferfromcomputationalinefficiencyandscalabilityissues.Toaddresstheselimitations,DNABERT29adoptedamaskedlanguagemodeling(MLM)basedonbidirectionalencoderrepresentationsfromtransformers(BERT)usingk-mertokenization(a.k.a.n-gramincomputerscience),enablingmoreeffectivesequencerepresentation.Subsequentmodels,includingGROVER30andDNABERT2,31leveragedBytePairEncoding(BPE)32—tokenizationemployedbytheSentencePiece33framework—toflexiblydefinetokenunits.Thishelpedreducesequenceinformationlossandimprovedcomputationalefficiency.Asaresult,

4

transformer-basedmodelshavebeensuccessfullyappliedtotaskssuchasidentifyingpromotersandtranscriptionfactorbindingsites(TFBSs)directlyfromDNAsequences.Caduceus34employscharacter-level(base-pair)tokenization,whichensuresrobustnesstominorsequencevariations.Furthermore,bymodelingDNAsequencesbidirectionallyandincorporatingreversecomplement(RC)equivariance,Caduceusdemonstratessuperiorperformanceontaskssuchasregulatorysitepredictionandlong-rangeSNPeffectinference.Recently,researchhasbeenperformedbeyondmaskedlanguagemodelingtowardgenerativeapproaches,suchasMegaDNA,35atransformer-basedDNAsequencegenerationmodel.

GenSLM36isanRNAlanguagemodelcapableofmutationeffectpredictionbycapturingthedifferencesbetweenoriginalandmutatedRNAsequencesandpredictingtheirfunctionaleffects.Themodelusesacodon-levelvocabulary,whichavoidsframeshiftissues,fortokenizingRNAsequences.Thestudyaddressesinputlengthsthatexceedthestandardmaximumcapacityofthestandardtransformer.Thislimitationhasbeenidentifiedasafundamentalarchitecturalbottleneckinearlyfoundationmodelsdesignedfornucleotidesequenceanalysis.Evo,37HyenaDNA,38andCaduceus34haveadoptedspecializedarchitectures,suchasHyena39andMamba,tosupportlong-sequencemodeling.

2.4.Single-celllanguagemodels

Withtheaccumulationofhigh-dimensionalgeneexpressiondata,single-celllanguagemodelshaveemergedasanewfrontierinbiology.Whileproteinsandnucleotidesarenaturallysequential,single-cellgeneexpressiondataarenotuniversallysequential.Therefore,amethodofrankinggenesbasedontheirexpressionlevelshasbeenproposed.Geneswithinacellaretreatedaswordsinasentence,andTransformer-basedmodelsareappliedtocapturetheirunderlyingdependencies,asinotherbiologicallanguagemodelingtasks.

Recentadvancesinsingle-cellrepresentationlearninghavesurpassedtraditionalmarkergene-basedapproachesincapturingcellularheterogeneity.40scBERT41addressesthislimitationbyleveragingfullgeneexpressionprofiles,achievingstrongperformanceincelltypeannotation.Geneformer42handlesthenon-sequentialnatureofgeneexpressiondatabyorderinggenesbasedoncountstatistics,alsoshowingeffectivenessinclassificationtasks.Buildingonthis,scGPT43takesgeneembeddingsasinputtokensandoutputsacellembedding,jointlylearningrepresentationsatbothlevels.Itachievesstate-of-the-artresultsacrosstaskssuchascelltypeclassification,perturbationprediction,batchcorrection,andmulti-omicsintegration.Thesefindingsemphasizethevalueoflarge-scalesingle-celldatasets(e.g.,theHumanCellAtlas,44CellMarker45)andthepotentialofembeddingmodelstocapturecellularcomplexity.

Atthesametime,approacheshavebeenproposedtoleveragegeneral-purposeLLMsfordirectlyincorporatingpriorbiologicalknowledge,goingbeyondgenesequencemodelingalone.Forexample,despitebeingtrainedoncommonhumanlanguages,GPT-4hasshowntheabilitytoperformautomaticcelltypeannotationbasedontextpromptsdescribinggeneexpressionlevels.46Accordingly,GenePT47andscELMo48haveconstructedgene-andcell-levelembeddingsbyapplyingtextembeddingAPIsfromacorpusofbiomedicalliteratureincludingtheNCBIdatabase.Ithasbeenreportedtooutperformsomebiologicaldata-drivenmodelssuch

5

asGeneformer.42Inaddition,CancerGPT,49aGPT-350modelfine-tunedoncorporaoftext,predictsdrugresponsepairswithinraretissuetypesbyaligningtextualrepresentationswithcellularinformation.Developingdisease-specificmodelswithrefinedcellembeddingsmayfurtheradvanceprecisionmedicine.

2.5.Biomoleculerepresentations

Biologicalmacromoleculessuchasproteinsandnucleicacidscanberepresentedthroughdiversemodalitiestosupportmachinelearningapplications.Sequence-basedrepresentationsuseaminoacidornucleotidestringsandserveasthefoundationforproteinandgenomiclanguagemodelssuchasESM,2ProtBERT,12andDNABERT29,31.Structuralrepresentationscapturespatialinformationusingatomiccoordinates,contactmaps,ordistancematrices,whichareleveragedinstructuremodelslikeAFandESMFold.Graph-basedapproachesabstractbiomoleculesintonodesandedges,enablingtheuseofgeometricdeeplearningmodelssuchasSE(3)Transformer.51FunctionalrepresentationsincludeGeneOntologyterms,proteinfamilyannotations,andsubcellularlocalization,enrichingmodelswithbiologicalcontext.Atthecellularlevel,omicsdatalikescRNA-seqisencodedashigh-dimensionalexpressionvectors.

2.6.Tokenizationstrategies

Tokenizationmethodshaveevolvedfromtraditionalmachinelearningtechniques,includingk-merapproaches,52tobiomolecule-specializedstrategiessuchasstructure-andcodon-basedtokenization,53whicharecriticalforaccurateanddetailedbiomolecularmodeling.Inproteinandnucleotidemodels,k-mertokenization(e.g.,3-mer,6-mer)isusedtocapturelocalbiochemicalcontext,asseeninDNABERTandProtBERT.Somemodelsusebyte-pairencoding(BPE)orunigrammodelstrainedonlargecorporaofsequences,suchasDNABERT2,ESM,andProGen.Codon-basedorcodon-preservingtokenizationarealsoadoptedtoavoidframe-shiftartifactsinnucleotidemodeling.scBERTemploysthegene2vecapproachtogenerategeneembeddings,whichfacilitatestheapplicationoftheBERTarchitecturetosingle-cellRNAsequencingdata.Thesecustomizedstrategiesensureefficientrepresentationofbiologicalsyntaxandsemanticsinpretrainedlanguagemodels.

2.7.ApplicationofBLMsinBiomedicine

2.7.1.Integrativemodelingformolecularcellbiology

AF2demonstratedthestrengthofAIinproteinstructurepredictionandhassinceinspiredawiderangeoffollow-upstudies.ModelssuchasAlphaFold3,54RoseTTAFoldNA,55andRoseTTAFoldAll-Atom56extendtheirfocusbeyondproteinstoincludeotherbiologicallyrelevantmoleculessuchasRNA,DNA,andligands.Inparticular,all-atomstructurepredictionintroducescomputationalchallengesinaccuratelyreconstructing3Dcoordinates.Thisreflectsagrowingrecognitionthatstructuralaccuracyisessentialforunderstandingbiomolecularfunction

6

notonlyinproteins,butalsoinRNA,wherestructureplaysacriticalroleinregulatoryactivity.57Concurrently,largelanguagemodel(LLM)-basedmethodshavebeguntoincorporatestructuralinformation,movingbeyondsequencemodeling.ESM358jointlyembedssequence,structure,andfunctionmarkingatransitiontowardmultimodalrepresentation.SpecializedmodelssuchasESM-DBP59havealsobeendevelopedtopredictDNA-bindingproteins,adoptinghybridapproachesthatleveragebothsequenceandstructurefeatures.Inthecontextofunifiedmodelinginbiologicallanguagemodels,foundationmodelsaimtolearncomprehensivecellularrepresentationsbyintegratingdiversebiologicalmodalities.Theseincludeepigeneticmarks,spatialtranscriptomics,proteinexpressiondata,andperturbationsignatures,whichcanbeexploredtogainadeeperunderstandingintocellularfunction.60Theintegrationsignalsabroadershiftfrommodality-specificmodelstowardunifiedrepresentationsthatmorereasonablyreflecttheinherentcomplexityofbiologicalsystems.

2.7.2.Multimodalfoundationmodels

MultimodalLargeLanguageModels(MLLMs)offeraframeworkforaligningheterogeneousdatatypessuchasclinicalnotes,proteinsequences,andmolecularstructures.

BiomedGPT61alignsnaturallanguagewithbiomedicalmodalities,particularlyvisualrepresentations,toenablecross-modalreasoningforvisual-languagetasks.Itfocusesonapplicationssuchasdiagnosis,summarization,clinicaldecisionsupportthroughflexiblequeryanswering.However,suchmodelsstillexhibitlimitationsinreasoningacrosscomplexclinicalscenarios,includingtheinterpretationofradiologicalimagesandtheresolutionoftextualconflicts.MediConfusion62providesadiagnosticbenchmarkthatsystematicallyevaluatesfailuremodesofmultimodalmedicalLLMs.

Tx-LLM63leveragestheadvantagesoflarge-scalepretrainingondiversebiologicaldatasets.Specifically,Itistrainedonsequence-levelinformationencompassingRNA,DNA,proteinsequences,aswellasSMILES.Thiscomprehensiveapproachenablespositivetransferperformanceinend-to-enddrugdiscoverytasks,outperformingmodelsthatdonotincorporatebiologicalsequencedata.Similarly,BioMedGPT-10B64contributestodrugdiscoverybyspecializinginproteinandmoleculeQuestionandAnswering(QA),havingbeentrainedoncellsequences,proteinandmoleculestructures.TheseadvancementshighlightthepotentialofLLMstoserveasunifiedmultimodalplatformsinbiomedicine.(Fig.2).

3.Chemicallanguagemodels

ChemicalLanguageModels(CLMs)havebeensuggestedtolearnthestructure-activityrelationshipofsmallmoleculesfromlarge-scalechemicaldatausingvarioussequentialrepresentationsofmolecules,e.g.SimplifiedMolecularInputLineEntrySystem(SMILES).65

7

3.1.ModelsTypes

SimilartopLMs,mostCLMsleverageTransformerarchitectures,66akintothoseinnaturallanguageprocessing,tounderstand,generate,andmanipulatechemicalstructuresandreactions.Thesemodelsarebroadlycategorizedbasedontheirarchitecturaldesign,eachoptimizedfordistincttaskswithincheminformaticsanddrugdiscovery.Theprimarymodeltypesincludeencoder-only(BERT-like)models,decoder-only(GPT-like)models,andencoder-decoderarchitectures,aswellasemergingmulti-modalLLMsthatintegratediversedataformats(Fig.3).Thesearchitecturalchoicesdictatehowthemodelsprocessmolecularrepresentationsandperformtasksrangingfrompropertypredictiontodenovomoleculardesignandretrosynthesis.

3.1.1.Chemicalencoders

Encoder-onlytransformermodels,primarilyinspiredbyBERT,aredesignedtoextractcontextualrepresentationsofmoleculesandarewell-suitedforpropertypredictionandmolecularunderstanding.ChemBERTa67adaptstheRoBERTa68frameworkwithMLMandmultitaskregression,whereauxiliarypropertypredictiontasksaredefinedusingmolecularfeaturescomputedbyRDKit69.Mol-BERT70appliesMLMtolearnchemicallyinformedtoken-leveldependenciesandisfine-tunedfortaskssuchaspropertyclassificationandactivityprediction.MoLFormer71extendsthisapproachusinglinearattentionandrotaryembeddings,yieldingcompactrepresentationsusefulfordownstreamregressionandclassificationtasks,thoughitislimitedtorelativelysmallmolecules.Furtherencodervariantsrefinetokenrepresentationsorintegratestructuralpriors.MolRoPE-BERT72enhancespositionalencoding,whileMFBERT,73SELFormer,74andsemi-RoBERTa75introducearchitecturalmodificationsforgreaterchemicalexpressiveness.Graph-enhancedencoderslikeGROVER76incorporatetopologicalfeaturesdirectly,bridgingthegapbetweensequenceandgraphrepresentations.

3.1.2.Chemicaldecoders

Decoder-onlytransformermodels,followingtheGPTarchitecture,areoptimizedforautoregressivegenerationandhavebecomeessentialindenovomoleculardesign.MolGPT77prioritizescausalitytolearntoken-wisedependenciesandultimatelygeneratesnovelmolecules.Itsupportsconditionalgenerationstrategiestobiasoutputstowardspecificchemicalproperties.GP-MoLFormer78isadecoder-onlyadaptationofMoLFormer-XL71andoptimizedfortaskssuchasunconstrainedmoleculegeneration,scaffoldcompletion,andconditionalpropertyoptimization.OtherGPT-basedchemicalmodelsincludeSMILES-GPT79andiupacGPT,80bothadaptedfromGPT-281formolecularandnomenclaturesequencegeneration.cMolGPT82extendsthisframeworkforcontrollablegenerationunderpropertyorscaffoldconstraints.Taiga83combinesGPTmodelingwithreinforcementlearningtoguidemoleculesynthesistowardmulti-objectivegoals.

8

3.1

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论