2025年数据和AI驱动的生物设计报告（英文版）--Rand兰德

上传人：1*** IP属地：山西上传时间：2025-08-30 格式：DOCX 页数：76 大小：162.76KB 积分：15 举报 版权申诉

已阅读5页，还剩71页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

June2025

ExpertInsights

PERSPECTIVEONATIMELYPOLICYISSUE

ALLISONBERKE,FORRESTW.CRAWFORD,TOBYWEBSTER,JAMESSMITH,SANAZAKARIA,SELLANEVO

DataandAI-

EnabledBiological

Design

RisksRelatedtoBiologicalTrainingDataand

OpportunitiesforGovernance

PE-A3886-1

Formoreinformationonthispublication,visit

/t/PEA3886-1.

AboutRAND

RANDisaresearchorganizationthatdevelopssolutionstopublicpolicychallengestohelpmakecommunitiesthroughouttheworldsaferandmoresecure,healthierandmoreprosperous.RANDisnonprofit,nonpartisan,andcommittedtothepublicinterest.TolearnmoreaboutRAND,visit

ResearchIntegrity

Ourmissiontohelpimprovepolicyanddecisionmakingthroughresearchandanalysisisenabledthroughourcorevaluesofqualityandobjectivityandourunwaveringcommitmenttothehighestlevelofintegrityandethicalbehavior.Tohelpensureourresearchandanalysisarerigorous,objective,andnonpartisan,wesubjectourresearchpublicationstoarobustandexactingquality-assuranceprocess;avoidboththeappearanceandrealityoffinancialandotherconflictsofinterestthroughstafftraining,projectscreening,andapolicyofmandatorydisclosure;andpursuetransparencyinourresearchengagementsthroughourcommitmenttotheopenpublicationofourresearchfindingsandrecommendations,disclosureofthesourceoffundingofpublishedresearch,andpoliciestoensureintellectualindependence.Formoreinformation,visit

/about/research-integrity

RAND’spublicationsdonotnecessarilyreflecttheopinionsofitsresearchclientsandsponsors.

PublishedbytheRANDCorporation,SantaMonica,Calif.

RANDeisaregisteredtrademark.

LimitedPrintandElectronicDistributionRights

Thispublicationandtrademark(s)containedhereinareprotectedbylaw.ThisrepresentationofRANDintellectualpropertyisprovidedfornoncommercialuseonly.Unauthorizedpostingofthispublicationonlineisprohibited;linkingdirectlytoitswebpageonisencouraged.PermissionisrequiredfromRANDtoreproduce,orreuseinanotherform,anyofitsresearchproductsforcommercialpurposes.Forinformationonreprintandreusepermissions,visit

/about/publishing/permissions.

iii

AboutThisPaper

Inthispaper,weassesscurrentknowledgeaboutthelinkbetweenbiologicaldataandthe

capabilitiesofartificialintelligencemodelstrainedonlargevolumesofbiologicaldata(AI-biomodels),describetheanticipatedimpactsofnewbiologicaldatasources,andoutlinepotentiallydangerous

capabilitiesthatcouldcomefrombroadavailabilityofcertaintypesofbiologicaldata+Wethen

recommendstrategiestolimitthepotentiallydangerouscapabilitiesarisingfrombiologicaldata,includingoptionsforgovernanceofexperimentsanddatacreation,governanceofcurationand

aggregationsofdata,controlsonaccesstocollectionsofdata,andgovernanceoftheuseofdataformodeltraining+

TheaudienceforthispaperincludesAIsafetyandsecurityinstitutesintheUnitedStatesandabroad,developersofAImodels,organizationsthatcompileandmanagelargebiologicaldatasets(particularlythoseavailablepubliclythatmaybeusedtotrainAImodels),andpolicymakers

interestedinbiosecurity+

MeselsonCenter

RANDGlobalandEmergingRisksisadivisionofRANDthatdeliversrigorousandobjectivepublicpolicyresearchonthemostconsequentialchallengestocivilizationandglobalsecurity+Thisworkwasundertakenbythedivision’sMeselsonCenter,whichisdedicatedtoreducingrisksfrombiologicalthreatsandemergingtechnologies+Thecentercombinespolicyresearchwithtechnicalresearchtoprovidepolicymakerswiththeinformationneededtoprevent,preparefor,andmitigatelarge-scalecatastrophes+Formoreinformation,contact

meselson@rand+org+

Funding

ThisresearchwasindependentlyinitiatedandconductedwithintheMeselsonCenterusinggiftsforresearchatRAND’sdiscretionfromphilanthropicsupporterOpenPhilanthropy,aswellasgiftsfromotherRANDsupportersandincomefromoperations+RANDdonorsandgrantorshaveno

influenceoverresearchfindingsorrecommendations+

Acknowledgments

WearegratefultoRogerBrent,GeraldEpstein,JosephFair,AlisonHottes,JeffreyLee,Greg

McKelvey,BriaPersaud,AdelineWilliams,peerreviewersSarahCarterandTaylorFrey,andHenryWillisforhelpfulcomments+WethankEpochforsharingdataonAI-biomodelcapabilities,aswellasresearchpartners,supportstaff,andpublishingstaff+

Summary

Modernartificialintelligence(AI)modelstrainedonlargevolumesofbiologicaldata(AI-bio

models)exhibitstrikingnewcapabilities,includingpredictionofproteinfoldingandbindingbehavior,sequencegeneration,andpredictionofhigher-orderfunctionalproperties+Thereisarecenttrendofcapabilityimprovementthatisalsolikelytocontinue+Decreasingcostsofgenomicsequencingand

ongoingeffortstoexpandenvironmentalbiosurveillancecoulddramaticallyexpandtheamountof

biologicaldatathatareusedtotrainadvancedAImodels+Similarly,decreasingcostsofcomputationalresources,newAImodelarchitectures,andnewmodeltrainingalgorithmscouldpermitAImodelstobetrainedonmuchlargerdatasets+

AI-biomodelscouldhavemanypositiveimpactsonscientificresearchandhealth,including

assistingindiscoveryofnewtherapeuticsorpredictingcomplexmolecularfoldingandbinding

behaviorinsupportofbasicscientificresearchgoals+ButsomeAI-biomodelsmaybedualuse,

providingbothbeneficialandpotentiallydangerouscapabilities+Potentiallydangerouscapabilities

includetheabilitiestodesigntoxins,modifyexistingpathogensforincreasedvirulence,ordenovo

designavirus,asexploredinarecentreportfromtheNationalAcademiesofSciences,Engineering,andMedicine(2025a)+AnefariousactorwithaccesstoafrontierAI-biomodelmightbeabletouseittodesignapathogenwithharmfulphenotypiccharacteristicsthatenhancetransmissibility+1

ResearchershaveassessedthecurrentgenerationofAI-biomodelsfordangerouscapabilitiesandprovidedrecommendationsthatmaypreventAI-biomodelsfrombeingusedformalignpurposes+Butmodelcapabilitiesarecloselylinkedtothedatausedtotrainthem,andmuchlessattentionhasbeendevotedtotherelationshipbetweendangerouscapabilitiesandbiologicaltrainingdata+Thedatathatareincluded(orexcluded)inmodeltrainingheavilyinfluencesthemodels’capabilitiesand

limitations.Trainingondiversebiologicaldatasets—sequence,structure,andfunctionaldata,suchasviralgenomesorproteinthree-dimensional(3D)structures—canexpandamodel’sbiological

capabilities,whereasmissingorincompletedatacancreateblindspots+GovernanceofdatausedtotrainAI-biomodelscouldbeausefulwaytoallowbeneficialscientificresearchwhilesafeguardingagainstpotentiallydangerouscapabilities+Potentiallydangerouscapabilitiesincludetheabilitytodesigntoxins,modifyexistingpathogensforincreasedvirulence,ordenovodesignavirus,

Inthispaper,weassesscurrentknowledgeaboutthelinkbetweenbiologicaldataandAI-bio

modelcapabilities,describetheanticipatedimpactsofnewbiologicaldatasources,andoutline

potentiallydangerouscapabilitiesthatcouldcomefrombroadavailabilityofcertaintypesofbiologicaldata+Wethenrecommendstrategiestolimitthepotentiallydangerouscapabilitiesarisingfrom

biologicaldata,includingoptionsforgovernanceofexperimentsanddatacreation,governanceof

1AfrontierAI–biomodelisastate-of-the-artfoundationalandgeneralpurposemodel,asopposedtoamodelnarrowlytunedtoaccomplishaparticulartasklikesequencealignment+

curationandaggregationsofdata,controlsonaccesstocollectionsofdata,andgovernanceoftheuseofdataformodeltraining+

interestedinbiosecurity+

KeyRecommendations

1+AImodeldevelopersshouldexploreandcharacterizetherelationshipbetweentrainingdataquantityandtypeandmodelcapabilities+Potentialimplementationmeasuresinclude

a+producingbetterperformancemetricsandbenchmarksfordangerouscapabilitiestohelpmeasuretheevolutionofAI-biomodelcapabilities

b+validatinglinksbetweenbiologicaldatatypeandvolumeanddangerouscapabilitiestopredictfuturetrendsincapabilities

c+conductingteststodeterminehowrestrictingsubsetsoftrainingdataaffectsmodel

capabilities,includingbeneficialandconcerningorpotentiallydangerouscapabilities

d+consideringtheeffectsoflarge-scaledatacollectionfromnewmetagenomicbiosurveillance

programsandpotentialdataaggregationsolutionsandaccesscontrolsfortheseprograms+

2+AImodeldevelopersshouldconsiderimplementingdata-focusedmitigationsaspartofaportfolioofapproachestoreducethepotentialmisuseofAI-enabledbiology+Potentialimplementationmeasuresinclude

a+monitoringcollectionandaggregationofpathogensequence,structure,andfunctional

datatoprovidesituationalawarenessaboutthevolumeandtypeofdatathatcouldbeusedtotraindual-useAI-biomodels

b+monitoringdatasourcesusedtotraingeneral-purposeanddedicatedmodelsthatcanpredictordesignfunctionalpropertiesrelatedtopathogenicity(e+g+,activity,binding)+

3+PolicymakersshoulddevelopusageguidelinesforanypersonorentitythatusesU+S+

government–fundedbiologicaldatasetsthatcouldbeusedtotrainAImodels+Potentialimplementationmeasuresinclude

a+evaluatingthecostsandbenefitsofimplementingmeasurestocontroltheuseoffederallyfundeddataandensuringtheirsecurityandintegrity

b+advisingtheresearchersusingfederallyfundeddatatotrainAImodelsaboutusage

guidelinesforthecapabilitiesoftheresultantAImodels,includingadviceonavoiding

dual-usecapabilitiesandavoidingcreatingthepotentialformisusebytrainingmodelsonpathogendata

c+consideringdevelopingguidanceforfederalagenciestoimplementstewardshipof

biologicaldatasetsthatprotectstheuseandqualityofthosedatasetswithregardtoAItraining+

4+AIdevelopersandpolicymakersshouldconductacapabilityassessmentwhencollectingandaggregatingpathogendataandwhentrainingmodelsonpathogendata+Thiscapability

assessmentcouldincludepredictedmodelcapabilitiesandanassessmentoftheconsequencesofmakingfunctionalpathogendatapublicandwidelyavailable+Potentialimplementation

measuresinclude

a+developingguidelinesforaccesscontrolmeasuresforfutureconcerningorpotentiallydangerousdatasets;theseguidelinescouldincludereviewsofplansforanyAImodeltrainingpriortoaccess

b+conductingriskassessmentswhencollectingandaggregatingpathogendataorwhenusingthosedatatotrainAImodels,includingthedegreetowhichsuchdatasetsoraggregationsmayalreadyhavebeenmirroredinrepositoriesthatareoutsidedirectcontrol,and

monitoringdatasourcesthatcanpredictordesignpropertiesrelatedtopathogenicity+

vii

Contents

AboutThisPaper+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++iiiSummary++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ivTables+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++viii

CHAPTER1 1

Introduction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1

DangerousCapabilitiesEnabledbyAI-EnabledBiology++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++2

CHAPTER2 4

BiologicalDataandAI-BioModels++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++4TheDevelopmentandIdentificationofDangerousCapabilitiesinAI-BioModels+++++++++++++++++++++++++++++++++++++++++++9BiosecurityConcernsfromDataUsedtoTrainLeadingModels+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++10CharacteristicsContributingtoDataThatConferDangerousCapabilities++++++++++++++++++++++++++++++++++++++++++++++++++++++15

CHAPTER319

DataGovernanceOptionsforDangerousCapabilityReduction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++19GovernanceofExperimentsandDataCreation+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++20GovernanceofAccess,Curation,andUseofExistingData+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++22

CHAPTER425

Recommendations+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++25RecommendationsforDevelopers+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++25RecommendationsforPolicymakers++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++26RecommendationforBothDevelopersandPolicymakers++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++27

Abbreviations+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++29References+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++30

viii

Tables

Table2.1.CommonBiologicalDatasetsandTheirDevelopment,byDataType 5

Table2.2.CharacteristicsofBiologicalAIModelsTrainedonLargeDatasets 6

Table2.3.Dual-UseAI-EnabledBiologicalDesignCapabilities 12

Table2.4.Dual-UseDataandItsUseinTrainingAIModels 13

Table2.5.ExamplesofDualUseJustificationsforIncludingPotentiallyHazardousPathogenDataand

ExampleSources 16

Chapter1

Introduction

Artificialintelligence(AI)modelsarebeingtrainedonlarge-scalebiologicaldata,including

sequences,structure,andannotations,mostofwhicharederivedfromphysicalexperiments+AnAI-biomodelisanAImodelthathasaspecificpurposeinthebiologicalsciencesandthathasbeentrainedonbiologicaldata+2Examplesofmodelstrainedbylarge-scalebiologicaldataincludeAlphaFold,

ESM-2,Evo,andProGen2,asdescribedinaMarch2025NationalAcademiesofSciences,

Engineering,andMedicine(NASEM)report(2025a)+DatasourcesusedtotraintheseAI-biomodelsincludetheNationalCenterforBiotechnologyInformation(NCBI)9s(undated)GenBank,the

ProteinDataBank(PDB)(Sussmanetal+,1998),andtheSwissBioinformaticsResourcePortal

Expasy(Gasteigeretal+,2003),andmanyothersthatspansequences,structures,andfunctionaldata(NASEM,2025a)+Therehasbeenmajorgrowthinbiologicaldataoverthepastfewdecades,andthegrowthrateremainsrapid(GohandWong,2020)+Abroadrangeofpubliclyavailabledataand

privatedataareusedformodeltraining+Thisincludespathogendatabases,dataontoxins,andotherdataonpotentiallyconcerningordangerousorganismsorbiologicalmaterials+Forthepurposesofthiswork,weconsiderconcerningordangerousorganismsormaterialstobecontrolledpathogens(e+g+,

pathogenswhosepossessioniscontrolledbytheFederalSelectAgentProgramorbyexportcontrols)andneworengineered“transmissiblebiologicalagentswithepidemicorpandemicpotential”

(NASEM,2025a,p+2)+

ThecurrentcohortofAI-biomodels,beginningwithAlphaFoldin2018(Shi,2024),has

demonstratedexcellentperformanceatsuchtasksasproteinfoldingandstructureprediction(for

example,in2018and2020,AlphaFoldwontheCriticalAssessmentofproteinStructurePrediction[CASP],aninternationalprotein-foldingpredictioncompetition,andin2022themajorityofentrantsusedAlphaFold[Jumperetal+,2021;Callaway,2023])+Modelscanalsodesignandpredictsequences,mutationalfunctions,andbindingrelationships+Itcouldbethecasethatsmallermodelstrainedwithlesscomputecandisplaycapabilitiesthataremoreconcerningthanthoseoflargermodels+In

2TheU+S+governmentprovidesthefollowingrelevantdefinitionsinExecutiveOrder(EO)14110:

“[A]rtificialintelligence”or“AI”hasthemeaningsetforthin15U+S+C+9401(3):amachine-basedsystem

thatcan,foragivensetofhuman-definedobjectives,makepredictions,recommendations,ordecisions

influencingrealorvirtualenvironments+Artificialintelligencesystemsusemachine-andhuman-basedinputstoperceiverealandvirtualenvironments;abstractsuchperceptionsintomodelsthroughanalysisinan

automatedmanner;andusemodelinferencetoformulateoptionsforinformationoraction”(EO14110,2023,Section3)+

“AImodel”meansacomponentofaninformationsystemthatimplementsAItechnologyanduses

computational,statistical,ormachine-learningtechniquestoproduceoutputsfromagivensetofinputs(EO14110,2023,Section3)+

particular,thiscouldbethecaseifthesmallermodelsaregivenspecifictraining,includingaccesstotrainingdatadirectlyrelatedtodual-usecapabilities+

Modelcapabilitiesareprimarilymeasuredthroughperformanceonbenchmarksandevaluations,suchasCASP(Kryshtafovychetal+,2023)(forproteinfolding),BioLP-bench(Ivanov,2024)(for

biologicalprotocolgeneration),LAB-bench(Laurentetal+,2024)(forpracticalscientificknowledge),orGUANinE(RobsonandIoannidis,2024)(forfunctionalpredictionfromsequencedata)+3

Accuracycomparedwithknowndataorexperimentalstandards,lackofhallucinationsanderrors,efficiency,andspecificityarecomponentsofmodelperformanceontheseevaluations+

IncreasingthequantityoftrainingdatausedbyAI-biomodelsislikelytoimprovethecapabilitiesofthesemodels,buttherelationshipbetweenamountortypeoftrainingdataandcapability

developmentisnotwellunderstood+Wedonotcurrentlyhaveevidencedemonstratinghowbiologicalmodels’performanceimproveswithincreasedtrainingdata+Insteadofstrictlyfocusingontraining

dataquantity,trainingdatadiversitymayalsocontributetothecapabilitiesofAI-biomodels+Asa

resultofshotgunsequencingandothermassivelyparalleltechniquesthatspeedtheprocessof

sequencinglargequantitiesofnucleicacids(Satametal+,2023),sequencedataaregrowingmore

rapidlythanstructureorannotationdata+Annotationdata,separatefromsequencedata,include

informationaboutthemolecularproperties,transcriptomicproperties,andepigenomicpropertiesofagivensequenceorstructure+Thesepropertiescanincludefunctionalinformation,suchasexpressionandactivitydataforaparticularsequence+However,themajorityoftheapplicationsofAI-biomodelsinvolvepredictingstructureandfunction(asacomponentofannotation),meaningthatadditional

sequencedatamaynotdirectlycorrelatetoadditionalfunctionalityforAI-biomodels+

Inthispaper,weassesscurrentknowledgeaboutthelinkbetweenbiologicaldataandAI-bio

modelcapabilities,describetheanticipatedimpactsofnewsourcesofbiologicaldata,andoutline

dangerousAImodelcapabilitiesthatcouldarisefrombroadavailabilityofcertaintypesofbiologicaldata+Werecommendstrategiestolimitthepotentiallydangerouscapabilitiesarisingfrombiologicaldata,includingoptionsforgovernanceofexperimentsanddatacreation,governanceofcurationandaggregationsofdata,controlsonaccesstocollectionsofdata,andgovernanceoftheuseofdatafor

modeltraining+Thispaperdoesnotidentifyspecifictypesofpathogendatathatshouldberestricted,nordoesitdevelopquantitativerelationshipsbetweentypesandquantitiesoftrainingdataandmodelcapabilities+

DangerousCapabilitiesMadePossiblebyAI-EnabledBiology

AI-biomodelsenablepeoplewithbiologicalexpertiseandeducationtoperformbiological

predictionanddesigntasksmorequicklyandefficiently,andtheycouldallowpeoplewithlittleornobiologicaleducationtoperformbiologicalanalysisandmodelingthatwouldotherwiselikelyrequiregreaterexpertise(Pannuetal+,2024;Bloomfieldetal+,2024)+AI-biomodelshavegreatlyimprovedthestateofproteinfoldingandstructurepredictionabilitiessincethedevelopmentofAlphaFold,aprogramthatenablestheinsilicodesignofproteinsandpeptideswithdesiredstructuralorsequence

3BioLP-bench=BiologicalLabProtocolbenchmark;LAB-bench=LanguageAgentBiologyBenchmark;GUANinE=GenomeUnderstandingandANnotationinsilicoEvaluation+

characteristics(Jumperetal+,2021;EuropeanMolecularBiologyLaboratory–European

BioinformaticsInstitute[EMBL-EBI],2025)+TheimprovementinAIbiotoolcapabilitieshasbeenhypothesizedtoenablethedesignofnovelorenhancedpathogens,biotoxins,andotherpotentiallydangerousbiologicalmaterials(Hunter,2024)+TheuseofAI-biomodelstoperformphenotypic

predictionfromasequencecanallowfortheefficientdesignofbiologicalmaterialsthatpossessa

certainphenotype,enablingthedesignofmoleculesthatevadeimmuneorantibioticbinding,orhavepredictedbindingcapabilities(Luetal+,2023;Sandbrinketal+,2023;Abramsonetal+,2024)+AI-biomodelshavesubstantialandexcitingbeneficialuses,suchasindrugdiscoveryorbasicresearch,buttheirpotentialforharmshouldnotgounexplored(Wheeler,2025)+

Inthispaper,weprovidecapabilityassessmentguidance+Thisguidanceisintendedtobeappliedtotypesofbiologicaldatathatcouldbeusedtotraindual-useAI-biomodels+Thisworkdrawsontheauthors’experienceandexpertiseinbiology,biotechnology,andgovernanceofdual-useresearch+Inaddition,weconductedareviewofrelevantresearchandgrayliterature(includingthinktankand

industryreports,commentaries,orrecommendations)relatedtoAI-enabledbiology,biologicaldata,andgovernanceofdata+

Wefocusonprovidingoptionsandrecommendationsongovernanceofbiologicaldatausedto

trainAI-biomodelstopreventdeliberatemisuseofAI-biomodelsinthedesignofbiologicalweapons+Inthispaper,wedonotconsiderwaysthatAI-biomodelscouldbeusedtoassistdevelopersof

bioweaponsacrossthebioweapondevelopmentprocess,includinginstagesearlierthanthedesignofaweapon(suchasexperimentationaroundboththedevelopmentandinclusionoffunctional

capabilitiesinanon-weaponizedbiologicalconstruct)(Mouton,Lucas,andGuest,2023)+

Chapter2

BiologicalDataandAI-BioModels

Therearemanywaystocategorizebiologicaldata(WooleyandLin,2005),andseveralare

relevanttoourpurposes+Wehavechosentousethreesimplecategories(sequence,structure,and

functionaldata)thatwefeelbestreflecthowdataareusedtotrainAImodelsandhowthatrelatestothebiologicalcapabilitiesofthesemodels,especiallythosemodelsthatarerelatedtopathogens+Table2+1describescommonsourcesofsequence,structure,andfunctionaldatausedtotrainAI-biomodels+Manyofthesedatasourcesarepublicandgrowing,therebyprovidingadditionaltrainingdataforAI-biomodels+

Modelcapabilitiesareunderstoodtoincreaseasthequantityoftrainingdataincreases,butthe

relationshipisnotdirectlylinear,andthisisthoughttoholdtrueforbothbeneficialmodelcapabilitiesanddangerouscapabilities(Maug,O’Gara,andBesiroglu,2024)+Itwouldbeusefultobeableto

predicthowthecapabilitiesofAImodelswillchangeovertime+Forlargelanguagemodels,whicharetrainedondatathatarenotexclusivelybiologicaldataandhavemoregeneralfunctionalitytorespondtonatural-languagequeriesandgeneratetextorimages,wecanobserveandcharacterizerelationshipsbetweenmodelperformanceonagiventask;amountofcomputeusedfortraining;andthetype,

quality,andvolumeofdatausedtotrainit+High-levelrelationshipshavebeenobservedinthiswayforlargelanguagemodelsandareoftenreferredtoas“scalinglaws”(Kaplanetal+,2020)+Forlanguage

models,increasingtrainingcompute,datavolume,andmodelsizegenerallyimprovesperformance+

Thereisearlyevidenceshowingthatsomebiologicalmodels’performancesimprovewithincreased

trainingcompute(Workman,2024)andmodelsize(Hesslow,2022)+Asofthetimeofwriting,thereisnoevidencedemonstratinghowbiologicalmodels’performancesimprovewithincreasedtraining

data+Ifoneobservesthatbiologicalmodels’performancesdoscalewithincreasedtrainingdataorthattrainingdatavolumebecomesalimitingfactorwhenscalingcomputeormodelsize,theninterventionsfocusedonthegeneration,availability,oruseofdatamayinfluencemodeldevelopment+

Table2.1.CommonBiologicalDatasetsandTheirDevelopment,byDataType

DataType

ExampleDatasets

Numberof

Entries(Asof

February

2025)

Annual

GrowthRate

(Asof2025)

MethodofDataGeneration

Accessibility

DNA

sequence

Genbank,NCBI,USA(NCBI,undated),est.1982

3.4billionsequences

31.3%

DNAsequencing

Public

EuropeanNucleotide

Archive,Europe(EuropeanNucleotideArchive,

undated),est.1982

5billion

sequences

23%

DNAsequencing

Public

DNADataBankofJapan,Japan(BioinformationandDNADataBankofJapan

Center,undated),est.1986

193trillion

sequences

9.3%

DNAsequencing

Public

Protein

seq

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

2025年数据和AI驱动的生物设计报告（英文版）--Rand兰德

文档简介

温馨提示

最新文档

评论

2025年数据和AI驱动的生物设计报告（英文版）--Rand兰德

文档简介

温馨提示

最新文档

评论

相关文档