版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Experimental&MolecularMedicine
/10.1038/s12276-025-01583-1
ArticleinPress
Asurveyonlargelanguagemodelsinbiologyandchemistry
Received:7July2025
Accepted:27August2025
publishedonline:15November2025
Citethisarticleas:IslambekAshyrmamatov,SuJiGwak,Su-YoungJinetal.Asurveyonlargelanguagemodelsinbiologyand
chemistryExpMolMed.(2025).
https://
/10.1038/s12276-025-01583-1
IslambekAshyrmamatov,SuJiGwak,Su-YoungJin,IkhyeongJun,UmitV.Ucak,Jay-YoonLee&JuyongLee
Weareprovidinganuneditedversionofthismanuscripttogiveearlyaccesstoitsfindings.Beforefinalpublication,themanuscriptwillundergofurtherediting.Pleasenotetheremaybeerrorspresentwhichaffectthecontent,andalllegaldisclaimersapply.
IfthispaperispublishingunderaTransparentPeerReviewmodelthenPeerReviewreportswillpublishwiththefinalarticle.
©TheAuthor(s)2025.OpenAccessThisarticleislicensedunderaCreativeCommonsAttribution4.0InternationalLicense,whichpermitsuse,sharing,adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicence,andindicateifchangesweremade.Theimagesorotherthirdpartymaterialinthisarticleareincludedinthearticle’sCreativeCommonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthearticle’sCreativeCommonslicenceandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder.Toviewacopyofthislicence,visit
http://
/licenses/by/4.0/
.
0
ASurveyonLargeLanguageModelsinBiologyandChemistry
IslambekAshyrmamatov1t,SuJiGwak2t,Su-YoungJin2,IkhyeongJun3,UmitV.Ucak1*,Jay-YoonLee2*andJuyongLee1,3*
1ResearchInstituteofPharmaceuticalScience,CollegeofPharmacy,SeoulNationalUniversity,1
Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea
2GraduateSchoolofDataScience,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea
3DepartmentofMolecularMedicineandBiopharmaceuticalSciences,GraduateSchoolofConvergenceScienceandTechnology,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,Republicof
Korea
†Theseauthorshavecontributedequallytothiswork.
*
Correspondingauthors:braket@snu.ac.kr,lee.jayyoon@snu.ac.kr,nicole23@snu.ac.kr
Abstract
Artificialintelligence(AI)isreshapingbiomedicalresearchbyprovidingscalablecomputationalframeworkssuitedtothecomplexityofbiologicalsystems.Centraltothisrevolutionarebio/chemicallanguagemodels(LMs),includingLargeLanguageModels(LLMs),whicharere-conceptualizingmolecularstructuresasaformof"language"amenabletoadvancedcomputationaltechniques.Thisreviewcriticallyexaminestheroleofthesemodelsinbiologyandchemistry,tracingtheirevolutionfrommolecularrepresentationtomoleculargenerationandoptimization.Thisreviewcoverskeymolecularrepresentationstrategiesforbothbiologicalmacromoleculesandsmallorganiccompounds—rangingfromproteinandnucleotidesequencestosingle-celldata,string-basedchemicalformats,graph-basedencodings,and3Dpointclouds—highlightingtheirrespectiveadvantagesandinherentlimitationsinAIapplications.Thediscussionfurtherexplorescoremodelarchitectures,suchasBERT-likeencoders,GPT-likedecoders,andencoder-decodertransformers,alongsidetheirsophisticatedpre-trainingstrategieslikeself-supervisedlearning,multi-tasklearning,andretrieval-augmentedgeneration.Keybiomedicalapplications,spanningproteinstructureandfunctionprediction,denovoproteindesign,genomicanalysis,molecularpropertyprediction,denovomoleculardesign,reactionprediction,andretrosynthesis,areexploredthroughrepresentativestudiesandemergingtrends.Finally,thereviewconsiderstheemerginglandscapeofagenticandinteractiveAIsystems,showcasingbrieflytheirpotentialtoautomateandacceleratescientificdiscoverywhileaddressingcriticaltechnical,ethical,andregulatoryconsiderationsthatwillshapethefuturetrajectoryofAIinbiomedicine.
1
1.Introduction
Largelanguagemodels(LLMs),builtondeepneuralarchitecturesandtrainedonmassivetextcorpora,haveachievedstate-of-the-artperformanceinlanguageunderstanding,generation,andreasoning.Althoughoriginallydevelopedfornaturallanguage,theircoremodelingprinciplesarebroadlytransferabletosymbolicscientificdata.ThishasspurredgrowinginterestinadaptingLLMstoscientificdomains,particularlyinchemistryandbiology.1,2
Scientificknowledgeandun
derstandingcriticallydependontheconstructionofformalrepresentationsthatencodethestructureandbehaviorofphysicalandbiologicalsystems.Theserepresentationsaredesignedforfidelityincapturingdomain-specificproperties,butrarelyalignwiththedistributionalandsyntacticpatternsoflanguagemodels.Thus,variousattemptshavebeensuggestedforbetteralignmentbetweenLLMsandscientificrepresentations.3,4
WhatenablesLLMstoperformsoeffectivelyisnotanunderstandingofindividualtokens,buttheirabilitytomodelthestatisticalstructurethatgovernstokencomposition.Inscientificdomains,amodel’sabilitytoinferpropertiesdependsonhowwelltheinputrepresentationencodesunderlyingstructure.Thus,representationaldesignisnotperipheralbutfundamentalfordevelopingscientificLLMs.Itdetermineswhatmodelscanlearn,generalize,andultimately,discover.Inaddition,itiswell-knownthatthescalesofmodelarchitectureandtrainingdataarecriticalinaccuracyandemergentbehaviorsofLLMs.5Thus,thesuccessofscientificLLMsrestsonbothscaleandarchitectureofthemodels,andhoweffectivelytherepresentationtranslatesadomainstructureintoalearnableentity.
RecentprogressinusingLLMsinbiologyandchemistryhasbeenacceleratedbythegrowthofcurated,domain-specificdatasets.Molecularandproteindatabases,alongwithscientificliterature,nowsupportdiversetrainingstrategies,fromself-supervisedobjectivestomultimodalintegration.However,muchofthisdevelopmentremainsfragmented,andsystematiccomparisonsacrosschemicalandbiologicaldomainsarestilllimited.
Inthisreview,weexaminehowLLMsarebeingadaptedtotheuniquedemandsofchemicalandbiologicaltopics.Wefocusonhowrepresentations,architectures,andtrainingregimesinfluencemodelperformanceacrossdomainsandtasks.Thefoundationalchallengeliesinconvertingcomplex,multi-dimensionalmolecularinformationintoformatsthatlanguagemodelscanprocess(Fig.1).Ourgoalistoclarifywhathasbeenachieved,whatremainschallenging,andhowthesemodelswillbetterservescientificunderstanding.
2
2.Biologicallanguagemodels
Theunprecedentedsuccessoflargelanguagemodels(LLMs)hasopenedanewparadigmindataanalysis.Inthefieldofbiology,theutilizationofvariousbiologicaldatasuchasproteinsequences,6structures,7nucleotides,8andspeciestaxonomy9hasbeenconsidered.TheapplicationofTransformerarchitecturestobiologicalproblemshasledtosignificantbreakthroughs,withAlphaFold2(AF2)10andRoseTTAFold(RF)11emergingaslandmarkmodelsinproteinstructureprediction.Inparallel,ongoingresearchisbeingconductedtodescribebiologicalcomplexitymoreaccuratelywithinthemodels(seeTable1).
2.1.Proteinlanguagemodels
Thesequentialnatureofproteinhasenabledtheapplicationoflanguagemodelingtechniquesfromnaturallanguageprocessing.EarlymodelssuchasProtBERT,12MSATransformer,13andProtTrans14leveragedcoretechniquesfromthedeeplanguagemodelswhileexploringvariationsinbothinputformats,e.g.,singlesequences,multiplesequencealignments(MSAs),andarchitectures,e.g.,unidirectionalandBERT-stylebidirectionalencoders.ESMFold2achievesAlphaFold2-levelaccuracyinproteinstructurepredictionwithoutrelyingonMSAs,capturingcontextualdependenciessolelythroughlanguagemodeling.Thescalingofmodelparametersandfasterstructurepredictionhighlightthepotentialoflanguagemodelswhentrainedonlarge-scalebiologicaldata.ProtMamba15alsoshowedthatproteinlanguagemodelingisfeasiblewithoutMSAs.ThemodeladoptsaMamba16basedstatespacearchitectureinsteadofattention-basedtohandlelong-rangesequences.
Proteindesignaimstogenerateproteinswithcompletelynewfunctionsandstructures,andgenerativemodelscanplayakeyroleintheprocess.ProGen17enablescontrolledproteinsequencegenerationbyincorporatingconditioningtagsintoanautoregressivetransformerarchitecture.ProGen218andProtGPT219furtherimproveuponpreviousmodelsbyleveragingmorecomplexconditioningtagstogeneratesequencesthatsatisfybothstructuralandfunctionalconstraints.Recently,diffusionarchitectures,developedforimagegenerationfromtextprompts,havebeenadaptedforproteinstructuregeneration.RFdiffusion20incorporatesspatialconstraintsthroughSE(3)equivariance,enablingmoreefficientandphysicallyconsistentsamplingofproteinstructures.Suchstructuralmodelinghasfacilitatedscaffoldingtasks,andtoolsincludingProteinMPNN21andFoldseek22haveacceleratedadvancesinproteindesign.
2.2.Proteinstructuremodels
Proteinstructuremodelspredictthetertiarystructuresofproteinsfromtheirprimaryaminoacidsequences.Traditionally,techniquessuchasX-raycrystallography,nuclearmagneticresonance(NMR)spectroscopy,andcryo-electronmicroscopy(cryo-EM)havebeenemployedtoelucidateproteinstructures.However,theseexperimentalmethodsareoftenconstrainedbyhighcosts,timerequirements,andtechnicallimitations,resultinginaconsiderablysloweraccumulationofstructuraldatacomparedtotherapidlyexpandingnumberofknownprotein
3
sequences.23Thissequence-structuredataimbalance(e.g.,betweenUniProtKB24andthePDB7)underscorestheneedforcomputationalpredictionapproachestocomplementexperimentalefforts.
AlphaFold(AF)25andAlphaFold2(AF2)10havedemonstratedoutstandingperformanceinthefieldofproteinstructureprediction,asevidencedbytheirsuccessinCriticalAssessmentofproteinStructurePrediction13(CASP13)andCASP14,respectively.AF2consistsoftwoprimarymodules:theEvoformerandthestructuremodule.UnlikeAF,whichemploysaResNet-basedconvolutionalneuralnetwork(CNN),AF2introducesanattention-basedEvoformer,enablingefficientprocessingofMSAsandpairwiseresidueinteractions.TheEvoformercanbeinterpretedasabiology-specifictransformer,whereMSAsaretreatedassequencesinnaturallanguage,capturingevolutionarypatternsacrosshomologousproteins.Thisapproachhasbeenmorefullyrealizedinproteinlanguagemodels(pLMs),whicharedesignedtoreplaceMSAsbyimplicitlymodelingevolutionaryinformation.Thestructuremoduleallowsforend-to-endlearningfromprimarysequenceto3Dstructuralreconstruction,achievingnearexperimentalaccuracy.
Severalplatformshavebeendevelopedtoextendtheapplicabilityandaccessibilityofproteinstructuremodels.ColabFold26leveragesametagenomicsequencedatabase(ColabFoldDB)toenhancethediversityandqualityofMSAs,anditisimplementedtorunonweb-basedGPUresourcesthroughGoogleColaboratory.Thisapproachimprovesaccessibilitytohigh-accuracyproteinstructurepredictionwhileeffectivelyreducingcomputationalresourceburdens.Phyre2.227isanupgradedplatformforproteinstructureandfunctionpredictionthatmaintainsauser-friendlyinterfacewhileintegratingAlphaFold-predictedstructuresasnewtemplates.Itenableslarge-scalestructuralanalysisbyutilizingabroaderrangeofstructuraltemplatesbeyondthoseavailableinthePDB.Furthermore,itsupportsdomain-leveloptimizationandbatch-modeprediction,therebyservingasacomputationalalternativethatcomplementsexperimentalstudies.
2.3.Nucleotidelanguagemodels
Unlikenaturallanguage,DNAdoesnotpossessaninherentconceptof"words,"anditscompositionislimitedtojustfournucleotides—adenine(A),thymine(T),guanine(G),andcytosine(C)—asopposedtoproteinsequences,whicharecomposedofapproximately20aminoacids.Thislimitedalphabetreducestheoverallinformationdensity,makingthedevelopmentofeffectiveDNAlanguagemodelsmorechallenging.
Earlierapproaches,suchasDeepSite,28utilizedCNNsandrecurrentneuralnetworks(RNNs)formodelingDNAsequences.However,CNNsoftenstrugglewithcapturinglong-rangedependencies,andRNNssufferfromcomputationalinefficiencyandscalabilityissues.Toaddresstheselimitations,DNABERT29adoptedamaskedlanguagemodeling(MLM)basedonbidirectionalencoderrepresentationsfromtransformers(BERT)usingk-mertokenization(a.k.a.n-gramincomputerscience),enablingmoreeffectivesequencerepresentation.Subsequentmodels,includingGROVER30andDNABERT2,31leveragedBytePairEncoding(BPE)32—tokenizationemployedbytheSentencePiece33framework—toflexiblydefinetokenunits.Thishelpedreducesequenceinformationlossandimprovedcomputationalefficiency.Asaresult,
4
transformer-basedmodelshavebeensuccessfullyappliedtotaskssuchasidentifyingpromotersandtranscriptionfactorbindingsites(TFBSs)directlyfromDNAsequences.Caduceus34employscharacter-level(base-pair)tokenization,whichensuresrobustnesstominorsequencevariations.Furthermore,bymodelingDNAsequencesbidirectionallyandincorporatingreversecomplement(RC)equivariance,Caduceusdemonstratessuperiorperformanceontaskssuchasregulatorysitepredictionandlong-rangeSNPeffectinference.Recently,researchhasbeenperformedbeyondmaskedlanguagemodelingtowardgenerativeapproaches,suchasMegaDNA,35atransformer-basedDNAsequencegenerationmodel.
GenSLM36isanRNAlanguagemodelcapableofmutationeffectpredictionbycapturingthedifferencesbetweenoriginalandmutatedRNAsequencesandpredictingtheirfunctionaleffects.Themodelusesacodon-levelvocabulary,whichavoidsframeshiftissues,fortokenizingRNAsequences.Thestudyaddressesinputlengthsthatexceedthestandardmaximumcapacityofthestandardtransformer.Thislimitationhasbeenidentifiedasafundamentalarchitecturalbottleneckinearlyfoundationmodelsdesignedfornucleotidesequenceanalysis.Evo,37HyenaDNA,38andCaduceus34haveadoptedspecializedarchitectures,suchasHyena39andMamba,tosupportlong-sequencemodeling.
2.4.Single-celllanguagemodels
Withtheaccumulationofhigh-dimensionalgeneexpressiondata,single-celllanguagemodelshaveemergedasanewfrontierinbiology.Whileproteinsandnucleotidesarenaturallysequential,single-cellgeneexpressiondataarenotuniversallysequential.Therefore,amethodofrankinggenesbasedontheirexpressionlevelshasbeenproposed.Geneswithinacellaretreatedaswordsinasentence,andTransformer-basedmodelsareappliedtocapturetheirunderlyingdependencies,asinotherbiologicallanguagemodelingtasks.
Recentadvancesinsingle-cellrepresentationlearninghavesurpassedtraditionalmarkergene-basedapproachesincapturingcellularheterogeneity.40scBERT41addressesthislimitationbyleveragingfullgeneexpressionprofiles,achievingstrongperformanceincelltypeannotation.Geneformer42handlesthenon-sequentialnatureofgeneexpressiondatabyorderinggenesbasedoncountstatistics,alsoshowingeffectivenessinclassificationtasks.Buildingonthis,scGPT43takesgeneembeddingsasinputtokensandoutputsacellembedding,jointlylearningrepresentationsatbothlevels.Itachievesstate-of-the-artresultsacrosstaskssuchascelltypeclassification,perturbationprediction,batchcorrection,andmulti-omicsintegration.Thesefindingsemphasizethevalueoflarge-scalesingle-celldatasets(e.g.,theHumanCellAtlas,44CellMarker45)andthepotentialofembeddingmodelstocapturecellularcomplexity.
Atthesametime,approacheshavebeenproposedtoleveragegeneral-purposeLLMsfordirectlyincorporatingpriorbiologicalknowledge,goingbeyondgenesequencemodelingalone.Forexample,despitebeingtrainedoncommonhumanlanguages,GPT-4hasshowntheabilitytoperformautomaticcelltypeannotationbasedontextpromptsdescribinggeneexpressionlevels.46Accordingly,GenePT47andscELMo48haveconstructedgene-andcell-levelembeddingsbyapplyingtextembeddingAPIsfromacorpusofbiomedicalliteratureincludingtheNCBIdatabase.Ithasbeenreportedtooutperformsomebiologicaldata-drivenmodelssuch
5
asGeneformer.42Inaddition,CancerGPT,49aGPT-350modelfine-tunedoncorporaoftext,predictsdrugresponsepairswithinraretissuetypesbyaligningtextualrepresentationswithcellularinformation.Developingdisease-specificmodelswithrefinedcellembeddingsmayfurtheradvanceprecisionmedicine.
2.5.Biomoleculerepresentations
Biologicalmacromoleculessuchasproteinsandnucleicacidscanberepresentedthroughdiversemodalitiestosupportmachinelearningapplications.Sequence-basedrepresentationsuseaminoacidornucleotidestringsandserveasthefoundationforproteinandgenomiclanguagemodelssuchasESM,2ProtBERT,12andDNABERT29,31.Structuralrepresentationscapturespatialinformationusingatomiccoordinates,contactmaps,ordistancematrices,whichareleveragedinstructuremodelslikeAFandESMFold.Graph-basedapproachesabstractbiomoleculesintonodesandedges,enablingtheuseofgeometricdeeplearningmodelssuchasSE(3)Transformer.51FunctionalrepresentationsincludeGeneOntologyterms,proteinfamilyannotations,andsubcellularlocalization,enrichingmodelswithbiologicalcontext.Atthecellularlevel,omicsdatalikescRNA-seqisencodedashigh-dimensionalexpressionvectors.
2.6.Tokenizationstrategies
Tokenizationmethodshaveevolvedfromtraditionalmachinelearningtechniques,includingk-merapproaches,52tobiomolecule-specializedstrategiessuchasstructure-andcodon-basedtokenization,53whicharecriticalforaccurateanddetailedbiomolecularmodeling.Inproteinandnucleotidemodels,k-mertokenization(e.g.,3-mer,6-mer)isusedtocapturelocalbiochemicalcontext,asseeninDNABERTandProtBERT.Somemodelsusebyte-pairencoding(BPE)orunigrammodelstrainedonlargecorporaofsequences,suchasDNABERT2,ESM,andProGen.Codon-basedorcodon-preservingtokenizationarealsoadoptedtoavoidframe-shiftartifactsinnucleotidemodeling.scBERTemploysthegene2vecapproachtogenerategeneembeddings,whichfacilitatestheapplicationoftheBERTarchitecturetosingle-cellRNAsequencingdata.Thesecustomizedstrategiesensureefficientrepresentationofbiologicalsyntaxandsemanticsinpretrainedlanguagemodels.
2.7.ApplicationofBLMsinBiomedicine
2.7.1.Integrativemodelingformolecularcellbiology
AF2demonstratedthestrengthofAIinproteinstructurepredictionandhassinceinspiredawiderangeoffollow-upstudies.ModelssuchasAlphaFold3,54RoseTTAFoldNA,55andRoseTTAFoldAll-Atom56extendtheirfocusbeyondproteinstoincludeotherbiologicallyrelevantmoleculessuchasRNA,DNA,andligands.Inparticular,all-atomstructurepredictionintroducescomputationalchallengesinaccuratelyreconstructing3Dcoordinates.Thisreflectsagrowingrecognitionthatstructuralaccuracyisessentialforunderstandingbiomolecularfunction
6
notonlyinproteins,butalsoinRNA,wherestructureplaysacriticalroleinregulatoryactivity.57Concurrently,largelanguagemodel(LLM)-basedmethodshavebeguntoincorporatestructuralinformation,movingbeyondsequencemodeling.ESM358jointlyembedssequence,structure,andfunctionmarkingatransitiontowardmultimodalrepresentation.SpecializedmodelssuchasESM-DBP59havealsobeendevelopedtopredictDNA-bindingproteins,adoptinghybridapproachesthatleveragebothsequenceandstructurefeatures.Inthecontextofunifiedmodelinginbiologicallanguagemodels,foundationmodelsaimtolearncomprehensivecellularrepresentationsbyintegratingdiversebiologicalmodalities.Theseincludeepigeneticmarks,spatialtranscriptomics,proteinexpressiondata,andperturbationsignatures,whichcanbeexploredtogainadeeperunderstandingintocellularfunction.60Theintegrationsignalsabroadershiftfrommodality-specificmodelstowardunifiedrepresentationsthatmorereasonablyreflecttheinherentcomplexityofbiologicalsystems.
2.7.2.Multimodalfoundationmodels
MultimodalLargeLanguageModels(MLLMs)offeraframeworkforaligningheterogeneousdatatypessuchasclinicalnotes,proteinsequences,andmolecularstructures.
BiomedGPT61alignsnaturallanguagewithbiomedicalmodalities,particularlyvisualrepresentations,toenablecross-modalreasoningforvisual-languagetasks.Itfocusesonapplicationssuchasdiagnosis,summarization,clinicaldecisionsupportthroughflexiblequeryanswering.However,suchmodelsstillexhibitlimitationsinreasoningacrosscomplexclinicalscenarios,includingtheinterpretationofradiologicalimagesandtheresolutionoftextualconflicts.MediConfusion62providesadiagnosticbenchmarkthatsystematicallyevaluatesfailuremodesofmultimodalmedicalLLMs.
Tx-LLM63leveragestheadvantagesoflarge-scalepretrainingondiversebiologicaldatasets.Specifically,Itistrainedonsequence-levelinformationencompassingRNA,DNA,proteinsequences,aswellasSMILES.Thiscomprehensiveapproachenablespositivetransferperformanceinend-to-enddrugdiscoverytasks,outperformingmodelsthatdonotincorporatebiologicalsequencedata.Similarly,BioMedGPT-10B64contributestodrugdiscoverybyspecializinginproteinandmoleculeQuestionandAnswering(QA),havingbeentrainedoncellsequences,proteinandmoleculestructures.TheseadvancementshighlightthepotentialofLLMstoserveasunifiedmultimodalplatformsinbiomedicine.(Fig.2).
3.Chemicallanguagemodels
ChemicalLanguageModels(CLMs)havebeensuggestedtolearnthestructure-activityrelationshipofsmallmoleculesfromlarge-scalechemicaldatausingvarioussequentialrepresentationsofmolecules,e.g.SimplifiedMolecularInputLineEntrySystem(SMILES).65
7
3.1.ModelsTypes
SimilartopLMs,mostCLMsleverageTransformerarchitectures,66akintothoseinnaturallanguageprocessing,tounderstand,generate,andmanipulatechemicalstructuresandreactions.Thesemodelsarebroadlycategorizedbasedontheirarchitecturaldesign,eachoptimizedfordistincttaskswithincheminformaticsanddrugdiscovery.Theprimarymodeltypesincludeencoder-only(BERT-like)models,decoder-only(GPT-like)models,andencoder-decoderarchitectures,aswellasemergingmulti-modalLLMsthatintegratediversedataformats(Fig.3).Thesearchitecturalchoicesdictatehowthemodelsprocessmolecularrepresentationsandperformtasksrangingfrompropertypredictiontodenovomoleculardesignandretrosynthesis.
3.1.1.Chemicalencoders
Encoder-onlytransformermodels,primarilyinspiredbyBERT,aredesignedtoextractcontextualrepresentationsofmoleculesandarewell-suitedforpropertypredictionandmolecularunderstanding.ChemBERTa67adaptstheRoBERTa68frameworkwithMLMandmultitaskregression,whereauxiliarypropertypredictiontasksaredefinedusingmolecularfeaturescomputedbyRDKit69.Mol-BERT70appliesMLMtolearnchemicallyinformedtoken-leveldependenciesandisfine-tunedfortaskssuchaspropertyclassificationandactivityprediction.MoLFormer71extendsthisapproachusinglinearattentionandrotaryembeddings,yieldingcompactrepresentationsusefulfordownstreamregressionandclassificationtasks,thoughitislimitedtorelativelysmallmolecules.Furtherencodervariantsrefinetokenrepresentationsorintegratestructuralpriors.MolRoPE-BERT72enhancespositionalencoding,whileMFBERT,73SELFormer,74andsemi-RoBERTa75introducearchitecturalmodificationsforgreaterchemicalexpressiveness.Graph-enhancedencoderslikeGROVER76incorporatetopologicalfeaturesdirectly,bridgingthegapbetweensequenceandgraphrepresentations.
3.1.2.Chemicaldecoders
Decoder-onlytransformermodels,followingtheGPTarchitecture,areoptimizedforautoregressivegenerationandhavebecomeessentialindenovomoleculardesign.MolGPT77prioritizescausalitytolearntoken-wisedependenciesandultimatelygeneratesnovelmolecules.Itsupportsconditionalgenerationstrategiestobiasoutputstowardspecificchemicalproperties.GP-MoLFormer78isadecoder-onlyadaptationofMoLFormer-XL71andoptimizedfortaskssuchasunconstrainedmoleculegeneration,scaffoldcompletion,andconditionalpropertyoptimization.OtherGPT-basedchemicalmodelsincludeSMILES-GPT79andiupacGPT,80bothadaptedfromGPT-281formolecularandnomenclaturesequencegeneration.cMolGPT82extendsthisframeworkforcontrollablegenerationunderpropertyorscaffoldconstraints.Taiga83combinesGPTmodelingwithreinforcementlearningtoguidemoleculesynthesistowardmulti-objectivegoals.
8
3.1
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年昆明市寻甸县公安局招聘警务辅助人员(37人)参考考试题库附答案解析
- 零售户经营安全培训课件
- 2026贵州贵阳市某事业单位劳务派遣工作人员招聘备考考试试题附答案解析
- 2026年上半年云南省发展和改革委员会所属事业单位招聘人员(4人)参考考试试题附答案解析
- 2026广西柳州事业单位招聘1111人参考考试试题附答案解析
- 2026年上半年黑龙江事业单位联考省教育厅招聘1人备考考试试题附答案解析
- 2026年沂南县部分事业单位公开招聘综合类岗位工作人员28人参考考试试题附答案解析
- 2026辽宁省文物考古研究院招聘3人参考考试题库附答案解析
- 2026云南昆明市晋宁区人民政府办公室招聘编外人员2人参考考试试题附答案解析
- 2026江苏南京大学XZ2026-012化学学院科研人员招聘备考考试题库附答案解析
- (一诊)重庆市九龙坡区区2026届高三学业质量调研抽测(第一次)物理试题
- 2026年榆能集团陕西精益化工有限公司招聘备考题库完整答案详解
- 2026广东省环境科学研究院招聘专业技术人员16人笔试参考题库及答案解析
- 2026年保安员理论考试题库
- 2025年人保保险业车险查勘定损人员岗位技能考试题及答案
- 被动关节活动训练
- GB/T 5781-2025紧固件六角头螺栓全螺纹C级
- 教师心理素养对学生心理健康的影响研究-洞察及研究
- DGTJ08-10-2022 城镇天然气管道工程技术标准
- 公路工程质量管理制度范本
- 广东省广州市八区联考2025-2026学年生物高二上期末调研试题含解析
评论
0/150
提交评论