版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
June2025
ExpertInsights
PERSPECTIVEONATIMELYPOLICYISSUE
ALLISONBERKE,FORRESTW.CRAWFORD,TOBYWEBSTER,JAMESSMITH,SANAZAKARIA,SELLANEVO
DataandAI-
EnabledBiological
Design
RisksRelatedtoBiologicalTrainingDataand
OpportunitiesforGovernance
PE-A3886-1
Formoreinformationonthispublication,visit
/t/PEA3886-1.
AboutRAND
RANDisaresearchorganizationthatdevelopssolutionstopublicpolicychallengestohelpmakecommunitiesthroughouttheworldsaferandmoresecure,healthierandmoreprosperous.RANDisnonprofit,nonpartisan,andcommittedtothepublicinterest.TolearnmoreaboutRAND,visit
.
ResearchIntegrity
Ourmissiontohelpimprovepolicyanddecisionmakingthroughresearchandanalysisisenabledthroughourcorevaluesofqualityandobjectivityandourunwaveringcommitmenttothehighestlevelofintegrityandethicalbehavior.Tohelpensureourresearchandanalysisarerigorous,objective,andnonpartisan,wesubjectourresearchpublicationstoarobustandexactingquality-assuranceprocess;avoidboththeappearanceandrealityoffinancialandotherconflictsofinterestthroughstafftraining,projectscreening,andapolicyofmandatorydisclosure;andpursuetransparencyinourresearchengagementsthroughourcommitmenttotheopenpublicationofourresearchfindingsandrecommendations,disclosureofthesourceoffundingofpublishedresearch,andpoliciestoensureintellectualindependence.Formoreinformation,visit
/about/research-integrity
.
RAND’spublicationsdonotnecessarilyreflecttheopinionsofitsresearchclientsandsponsors.
PublishedbytheRANDCorporation,SantaMonica,Calif.
©2025RANDCorporation
RANDeisaregisteredtrademark.
LimitedPrintandElectronicDistributionRights
Thispublicationandtrademark(s)containedhereinareprotectedbylaw.ThisrepresentationofRANDintellectualpropertyisprovidedfornoncommercialuseonly.Unauthorizedpostingofthispublicationonlineisprohibited;linkingdirectlytoitswebpageonisencouraged.PermissionisrequiredfromRANDtoreproduce,orreuseinanotherform,anyofitsresearchproductsforcommercialpurposes.Forinformationonreprintandreusepermissions,visit
/about/publishing/permissions.
iii
AboutThisPaper
Inthispaper,weassesscurrentknowledgeaboutthelinkbetweenbiologicaldataandthe
capabilitiesofartificialintelligencemodelstrainedonlargevolumesofbiologicaldata(AI-biomodels),describetheanticipatedimpactsofnewbiologicaldatasources,andoutlinepotentiallydangerous
capabilitiesthatcouldcomefrombroadavailabilityofcertaintypesofbiologicaldata+Wethen
recommendstrategiestolimitthepotentiallydangerouscapabilitiesarisingfrombiologicaldata,includingoptionsforgovernanceofexperimentsanddatacreation,governanceofcurationand
aggregationsofdata,controlsonaccesstocollectionsofdata,andgovernanceoftheuseofdataformodeltraining+
TheaudienceforthispaperincludesAIsafetyandsecurityinstitutesintheUnitedStatesandabroad,developersofAImodels,organizationsthatcompileandmanagelargebiologicaldatasets(particularlythoseavailablepubliclythatmaybeusedtotrainAImodels),andpolicymakers
interestedinbiosecurity+
MeselsonCenter
RANDGlobalandEmergingRisksisadivisionofRANDthatdeliversrigorousandobjectivepublicpolicyresearchonthemostconsequentialchallengestocivilizationandglobalsecurity+Thisworkwasundertakenbythedivision’sMeselsonCenter,whichisdedicatedtoreducingrisksfrombiologicalthreatsandemergingtechnologies+Thecentercombinespolicyresearchwithtechnicalresearchtoprovidepolicymakerswiththeinformationneededtoprevent,preparefor,andmitigatelarge-scalecatastrophes+Formoreinformation,contact
meselson@rand+org+
Funding
ThisresearchwasindependentlyinitiatedandconductedwithintheMeselsonCenterusinggiftsforresearchatRAND’sdiscretionfromphilanthropicsupporterOpenPhilanthropy,aswellasgiftsfromotherRANDsupportersandincomefromoperations+RANDdonorsandgrantorshaveno
influenceoverresearchfindingsorrecommendations+
Acknowledgments
WearegratefultoRogerBrent,GeraldEpstein,JosephFair,AlisonHottes,JeffreyLee,Greg
McKelvey,BriaPersaud,AdelineWilliams,peerreviewersSarahCarterandTaylorFrey,andHenryWillisforhelpfulcomments+WethankEpochforsharingdataonAI-biomodelcapabilities,aswellasresearchpartners,supportstaff,andpublishingstaff+
iv
Summary
Modernartificialintelligence(AI)modelstrainedonlargevolumesofbiologicaldata(AI-bio
models)exhibitstrikingnewcapabilities,includingpredictionofproteinfoldingandbindingbehavior,sequencegeneration,andpredictionofhigher-orderfunctionalproperties+Thereisarecenttrendofcapabilityimprovementthatisalsolikelytocontinue+Decreasingcostsofgenomicsequencingand
ongoingeffortstoexpandenvironmentalbiosurveillancecoulddramaticallyexpandtheamountof
biologicaldatathatareusedtotrainadvancedAImodels+Similarly,decreasingcostsofcomputationalresources,newAImodelarchitectures,andnewmodeltrainingalgorithmscouldpermitAImodelstobetrainedonmuchlargerdatasets+
AI-biomodelscouldhavemanypositiveimpactsonscientificresearchandhealth,including
assistingindiscoveryofnewtherapeuticsorpredictingcomplexmolecularfoldingandbinding
behaviorinsupportofbasicscientificresearchgoals+ButsomeAI-biomodelsmaybedualuse,
providingbothbeneficialandpotentiallydangerouscapabilities+Potentiallydangerouscapabilities
includetheabilitiestodesigntoxins,modifyexistingpathogensforincreasedvirulence,ordenovo
designavirus,asexploredinarecentreportfromtheNationalAcademiesofSciences,Engineering,andMedicine(2025a)+AnefariousactorwithaccesstoafrontierAI-biomodelmightbeabletouseittodesignapathogenwithharmfulphenotypiccharacteristicsthatenhancetransmissibility+1
ResearchershaveassessedthecurrentgenerationofAI-biomodelsfordangerouscapabilitiesandprovidedrecommendationsthatmaypreventAI-biomodelsfrombeingusedformalignpurposes+Butmodelcapabilitiesarecloselylinkedtothedatausedtotrainthem,andmuchlessattentionhasbeendevotedtotherelationshipbetweendangerouscapabilitiesandbiologicaltrainingdata+Thedatathatareincluded(orexcluded)inmodeltrainingheavilyinfluencesthemodels’capabilitiesand
limitations.Trainingondiversebiologicaldatasets—sequence,structure,andfunctionaldata,suchasviralgenomesorproteinthree-dimensional(3D)structures—canexpandamodel’sbiological
capabilities,whereasmissingorincompletedatacancreateblindspots+GovernanceofdatausedtotrainAI-biomodelscouldbeausefulwaytoallowbeneficialscientificresearchwhilesafeguardingagainstpotentiallydangerouscapabilities+Potentiallydangerouscapabilitiesincludetheabilitytodesigntoxins,modifyexistingpathogensforincreasedvirulence,ordenovodesignavirus,
Inthispaper,weassesscurrentknowledgeaboutthelinkbetweenbiologicaldataandAI-bio
modelcapabilities,describetheanticipatedimpactsofnewbiologicaldatasources,andoutline
potentiallydangerouscapabilitiesthatcouldcomefrombroadavailabilityofcertaintypesofbiologicaldata+Wethenrecommendstrategiestolimitthepotentiallydangerouscapabilitiesarisingfrom
biologicaldata,includingoptionsforgovernanceofexperimentsanddatacreation,governanceof
1AfrontierAI–biomodelisastate-of-the-artfoundationalandgeneralpurposemodel,asopposedtoamodelnarrowlytunedtoaccomplishaparticulartasklikesequencealignment+
v
curationandaggregationsofdata,controlsonaccesstocollectionsofdata,andgovernanceoftheuseofdataformodeltraining+
TheaudienceforthispaperincludesAIsafetyandsecurityinstitutesintheUnitedStatesandabroad,developersofAImodels,organizationsthatcompileandmanagelargebiologicaldatasets(particularlythoseavailablepubliclythatmaybeusedtotrainAImodels),andpolicymakers
interestedinbiosecurity+
KeyRecommendations
1+AImodeldevelopersshouldexploreandcharacterizetherelationshipbetweentrainingdataquantityandtypeandmodelcapabilities+Potentialimplementationmeasuresinclude
a+producingbetterperformancemetricsandbenchmarksfordangerouscapabilitiestohelpmeasuretheevolutionofAI-biomodelcapabilities
b+validatinglinksbetweenbiologicaldatatypeandvolumeanddangerouscapabilitiestopredictfuturetrendsincapabilities
c+conductingteststodeterminehowrestrictingsubsetsoftrainingdataaffectsmodel
capabilities,includingbeneficialandconcerningorpotentiallydangerouscapabilities
d+consideringtheeffectsoflarge-scaledatacollectionfromnewmetagenomicbiosurveillance
programsandpotentialdataaggregationsolutionsandaccesscontrolsfortheseprograms+
2+AImodeldevelopersshouldconsiderimplementingdata-focusedmitigationsaspartofaportfolioofapproachestoreducethepotentialmisuseofAI-enabledbiology+Potentialimplementationmeasuresinclude
a+monitoringcollectionandaggregationofpathogensequence,structure,andfunctional
datatoprovidesituationalawarenessaboutthevolumeandtypeofdatathatcouldbeusedtotraindual-useAI-biomodels
b+monitoringdatasourcesusedtotraingeneral-purposeanddedicatedmodelsthatcanpredictordesignfunctionalpropertiesrelatedtopathogenicity(e+g+,activity,binding)+
3+PolicymakersshoulddevelopusageguidelinesforanypersonorentitythatusesU+S+
government–fundedbiologicaldatasetsthatcouldbeusedtotrainAImodels+Potentialimplementationmeasuresinclude
a+evaluatingthecostsandbenefitsofimplementingmeasurestocontroltheuseoffederallyfundeddataandensuringtheirsecurityandintegrity
b+advisingtheresearchersusingfederallyfundeddatatotrainAImodelsaboutusage
guidelinesforthecapabilitiesoftheresultantAImodels,includingadviceonavoiding
dual-usecapabilitiesandavoidingcreatingthepotentialformisusebytrainingmodelsonpathogendata
c+consideringdevelopingguidanceforfederalagenciestoimplementstewardshipof
biologicaldatasetsthatprotectstheuseandqualityofthosedatasetswithregardtoAItraining+
vi
4+AIdevelopersandpolicymakersshouldconductacapabilityassessmentwhencollectingandaggregatingpathogendataandwhentrainingmodelsonpathogendata+Thiscapability
assessmentcouldincludepredictedmodelcapabilitiesandanassessmentoftheconsequencesofmakingfunctionalpathogendatapublicandwidelyavailable+Potentialimplementation
measuresinclude
a+developingguidelinesforaccesscontrolmeasuresforfutureconcerningorpotentiallydangerousdatasets;theseguidelinescouldincludereviewsofplansforanyAImodeltrainingpriortoaccess
b+conductingriskassessmentswhencollectingandaggregatingpathogendataorwhenusingthosedatatotrainAImodels,includingthedegreetowhichsuchdatasetsoraggregationsmayalreadyhavebeenmirroredinrepositoriesthatareoutsidedirectcontrol,and
monitoringdatasourcesthatcanpredictordesignpropertiesrelatedtopathogenicity+
vii
Contents
AboutThisPaper+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++iiiSummary++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ivTables+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++viii
CHAPTER1 1
Introduction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++1
DangerousCapabilitiesEnabledbyAI-EnabledBiology++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++2
CHAPTER2 4
BiologicalDataandAI-BioModels++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++4TheDevelopmentandIdentificationofDangerousCapabilitiesinAI-BioModels+++++++++++++++++++++++++++++++++++++++++++9BiosecurityConcernsfromDataUsedtoTrainLeadingModels+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++10CharacteristicsContributingtoDataThatConferDangerousCapabilities++++++++++++++++++++++++++++++++++++++++++++++++++++++15
CHAPTER319
DataGovernanceOptionsforDangerousCapabilityReduction+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++19GovernanceofExperimentsandDataCreation+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++20GovernanceofAccess,Curation,andUseofExistingData+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++22
CHAPTER425
Recommendations+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++25RecommendationsforDevelopers+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++25RecommendationsforPolicymakers++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++26RecommendationforBothDevelopersandPolicymakers++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++27
Abbreviations+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++29References+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++30
viii
Tables
Table2.1.CommonBiologicalDatasetsandTheirDevelopment,byDataType 5
Table2.2.CharacteristicsofBiologicalAIModelsTrainedonLargeDatasets 6
Table2.3.Dual-UseAI-EnabledBiologicalDesignCapabilities 12
Table2.4.Dual-UseDataandItsUseinTrainingAIModels 13
Table2.5.ExamplesofDualUseJustificationsforIncludingPotentiallyHazardousPathogenDataand
ExampleSources 16
1
Chapter1
Introduction
Artificialintelligence(AI)modelsarebeingtrainedonlarge-scalebiologicaldata,including
sequences,structure,andannotations,mostofwhicharederivedfromphysicalexperiments+AnAI-biomodelisanAImodelthathasaspecificpurposeinthebiologicalsciencesandthathasbeentrainedonbiologicaldata+2Examplesofmodelstrainedbylarge-scalebiologicaldataincludeAlphaFold,
ESM-2,Evo,andProGen2,asdescribedinaMarch2025NationalAcademiesofSciences,
Engineering,andMedicine(NASEM)report(2025a)+DatasourcesusedtotraintheseAI-biomodelsincludetheNationalCenterforBiotechnologyInformation(NCBI)9s(undated)GenBank,the
ProteinDataBank(PDB)(Sussmanetal+,1998),andtheSwissBioinformaticsResourcePortal
Expasy(Gasteigeretal+,2003),andmanyothersthatspansequences,structures,andfunctionaldata(NASEM,2025a)+Therehasbeenmajorgrowthinbiologicaldataoverthepastfewdecades,andthegrowthrateremainsrapid(GohandWong,2020)+Abroadrangeofpubliclyavailabledataand
privatedataareusedformodeltraining+Thisincludespathogendatabases,dataontoxins,andotherdataonpotentiallyconcerningordangerousorganismsorbiologicalmaterials+Forthepurposesofthiswork,weconsiderconcerningordangerousorganismsormaterialstobecontrolledpathogens(e+g+,
pathogenswhosepossessioniscontrolledbytheFederalSelectAgentProgramorbyexportcontrols)andneworengineered“transmissiblebiologicalagentswithepidemicorpandemicpotential”
(NASEM,2025a,p+2)+
ThecurrentcohortofAI-biomodels,beginningwithAlphaFoldin2018(Shi,2024),has
demonstratedexcellentperformanceatsuchtasksasproteinfoldingandstructureprediction(for
example,in2018and2020,AlphaFoldwontheCriticalAssessmentofproteinStructurePrediction[CASP],aninternationalprotein-foldingpredictioncompetition,andin2022themajorityofentrantsusedAlphaFold[Jumperetal+,2021;Callaway,2023])+Modelscanalsodesignandpredictsequences,mutationalfunctions,andbindingrelationships+Itcouldbethecasethatsmallermodelstrainedwithlesscomputecandisplaycapabilitiesthataremoreconcerningthanthoseoflargermodels+In
2TheU+S+governmentprovidesthefollowingrelevantdefinitionsinExecutiveOrder(EO)14110:
“[A]rtificialintelligence”or“AI”hasthemeaningsetforthin15U+S+C+9401(3):amachine-basedsystem
thatcan,foragivensetofhuman-definedobjectives,makepredictions,recommendations,ordecisions
influencingrealorvirtualenvironments+Artificialintelligencesystemsusemachine-andhuman-basedinputstoperceiverealandvirtualenvironments;abstractsuchperceptionsintomodelsthroughanalysisinan
automatedmanner;andusemodelinferencetoformulateoptionsforinformationoraction”(EO14110,2023,Section3)+
“AImodel”meansacomponentofaninformationsystemthatimplementsAItechnologyanduses
computational,statistical,ormachine-learningtechniquestoproduceoutputsfromagivensetofinputs(EO14110,2023,Section3)+
2
particular,thiscouldbethecaseifthesmallermodelsaregivenspecifictraining,includingaccesstotrainingdatadirectlyrelatedtodual-usecapabilities+
Modelcapabilitiesareprimarilymeasuredthroughperformanceonbenchmarksandevaluations,suchasCASP(Kryshtafovychetal+,2023)(forproteinfolding),BioLP-bench(Ivanov,2024)(for
biologicalprotocolgeneration),LAB-bench(Laurentetal+,2024)(forpracticalscientificknowledge),orGUANinE(RobsonandIoannidis,2024)(forfunctionalpredictionfromsequencedata)+3
Accuracycomparedwithknowndataorexperimentalstandards,lackofhallucinationsanderrors,efficiency,andspecificityarecomponentsofmodelperformanceontheseevaluations+
IncreasingthequantityoftrainingdatausedbyAI-biomodelsislikelytoimprovethecapabilitiesofthesemodels,buttherelationshipbetweenamountortypeoftrainingdataandcapability
developmentisnotwellunderstood+Wedonotcurrentlyhaveevidencedemonstratinghowbiologicalmodels’performanceimproveswithincreasedtrainingdata+Insteadofstrictlyfocusingontraining
dataquantity,trainingdatadiversitymayalsocontributetothecapabilitiesofAI-biomodels+Asa
resultofshotgunsequencingandothermassivelyparalleltechniquesthatspeedtheprocessof
sequencinglargequantitiesofnucleicacids(Satametal+,2023),sequencedataaregrowingmore
rapidlythanstructureorannotationdata+Annotationdata,separatefromsequencedata,include
informationaboutthemolecularproperties,transcriptomicproperties,andepigenomicpropertiesofagivensequenceorstructure+Thesepropertiescanincludefunctionalinformation,suchasexpressionandactivitydataforaparticularsequence+However,themajorityoftheapplicationsofAI-biomodelsinvolvepredictingstructureandfunction(asacomponentofannotation),meaningthatadditional
sequencedatamaynotdirectlycorrelatetoadditionalfunctionalityforAI-biomodels+
Inthispaper,weassesscurrentknowledgeaboutthelinkbetweenbiologicaldataandAI-bio
modelcapabilities,describetheanticipatedimpactsofnewsourcesofbiologicaldata,andoutline
dangerousAImodelcapabilitiesthatcouldarisefrombroadavailabilityofcertaintypesofbiologicaldata+Werecommendstrategiestolimitthepotentiallydangerouscapabilitiesarisingfrombiologicaldata,includingoptionsforgovernanceofexperimentsanddatacreation,governanceofcurationandaggregationsofdata,controlsonaccesstocollectionsofdata,andgovernanceoftheuseofdatafor
modeltraining+Thispaperdoesnotidentifyspecifictypesofpathogendatathatshouldberestricted,nordoesitdevelopquantitativerelationshipsbetweentypesandquantitiesoftrainingdataandmodelcapabilities+
DangerousCapabilitiesMadePossiblebyAI-EnabledBiology
AI-biomodelsenablepeoplewithbiologicalexpertiseandeducationtoperformbiological
predictionanddesigntasksmorequicklyandefficiently,andtheycouldallowpeoplewithlittleornobiologicaleducationtoperformbiologicalanalysisandmodelingthatwouldotherwiselikelyrequiregreaterexpertise(Pannuetal+,2024;Bloomfieldetal+,2024)+AI-biomodelshavegreatlyimprovedthestateofproteinfoldingandstructurepredictionabilitiessincethedevelopmentofAlphaFold,aprogramthatenablestheinsilicodesignofproteinsandpeptideswithdesiredstructuralorsequence
3BioLP-bench=BiologicalLabProtocolbenchmark;LAB-bench=LanguageAgentBiologyBenchmark;GUANinE=GenomeUnderstandingandANnotationinsilicoEvaluation+
3
characteristics(Jumperetal+,2021;EuropeanMolecularBiologyLaboratory–European
BioinformaticsInstitute[EMBL-EBI],2025)+TheimprovementinAIbiotoolcapabilitieshasbeenhypothesizedtoenablethedesignofnovelorenhancedpathogens,biotoxins,andotherpotentiallydangerousbiologicalmaterials(Hunter,2024)+TheuseofAI-biomodelstoperformphenotypic
predictionfromasequencecanallowfortheefficientdesignofbiologicalmaterialsthatpossessa
certainphenotype,enablingthedesignofmoleculesthatevadeimmuneorantibioticbinding,orhavepredictedbindingcapabilities(Luetal+,2023;Sandbrinketal+,2023;Abramsonetal+,2024)+AI-biomodelshavesubstantialandexcitingbeneficialuses,suchasindrugdiscoveryorbasicresearch,buttheirpotentialforharmshouldnotgounexplored(Wheeler,2025)+
Inthispaper,weprovidecapabilityassessmentguidance+Thisguidanceisintendedtobeappliedtotypesofbiologicaldatathatcouldbeusedtotraindual-useAI-biomodels+Thisworkdrawsontheauthors’experienceandexpertiseinbiology,biotechnology,andgovernanceofdual-useresearch+Inaddition,weconductedareviewofrelevantresearchandgrayliterature(includingthinktankand
industryreports,commentaries,orrecommendations)relatedtoAI-enabledbiology,biologicaldata,andgovernanceofdata+
Wefocusonprovidingoptionsandrecommendationsongovernanceofbiologicaldatausedto
trainAI-biomodelstopreventdeliberatemisuseofAI-biomodelsinthedesignofbiologicalweapons+Inthispaper,wedonotconsiderwaysthatAI-biomodelscouldbeusedtoassistdevelopersof
bioweaponsacrossthebioweapondevelopmentprocess,includinginstagesearlierthanthedesignofaweapon(suchasexperimentationaroundboththedevelopmentandinclusionoffunctional
capabilitiesinanon-weaponizedbiologicalconstruct)(Mouton,Lucas,andGuest,2023)+
4
Chapter2
BiologicalDataandAI-BioModels
Therearemanywaystocategorizebiologicaldata(WooleyandLin,2005),andseveralare
relevanttoourpurposes+Wehavechosentousethreesimplecategories(sequence,structure,and
functionaldata)thatwefeelbestreflecthowdataareusedtotrainAImodelsandhowthatrelatestothebiologicalcapabilitiesofthesemodels,especiallythosemodelsthatarerelatedtopathogens+Table2+1describescommonsourcesofsequence,structure,andfunctionaldatausedtotrainAI-biomodels+Manyofthesedatasourcesarepublicandgrowing,therebyprovidingadditionaltrainingdataforAI-biomodels+
Modelcapabilitiesareunderstoodtoincreaseasthequantityoftrainingdataincreases,butthe
relationshipisnotdirectlylinear,andthisisthoughttoholdtrueforbothbeneficialmodelcapabilitiesanddangerouscapabilities(Maug,O’Gara,andBesiroglu,2024)+Itwouldbeusefultobeableto
predicthowthecapabilitiesofAImodelswillchangeovertime+Forlargelanguagemodels,whicharetrainedondatathatarenotexclusivelybiologicaldataandhavemoregeneralfunctionalitytorespondtonatural-languagequeriesandgeneratetextorimages,wecanobserveandcharacterizerelationshipsbetweenmodelperformanceonagiventask;amountofcomputeusedfortraining;andthetype,
quality,andvolumeofdatausedtotrainit+High-levelrelationshipshavebeenobservedinthiswayforlargelanguagemodelsandareoftenreferredtoas“scalinglaws”(Kaplanetal+,2020)+Forlanguage
models,increasingtrainingcompute,datavolume,andmodelsizegenerallyimprovesperformance+
Thereisearlyevidenceshowingthatsomebiologicalmodels’performancesimprovewithincreased
trainingcompute(Workman,2024)andmodelsize(Hesslow,2022)+Asofthetimeofwriting,thereisnoevidencedemonstratinghowbiologicalmodels’performancesimprovewithincreasedtraining
data+Ifoneobservesthatbiologicalmodels’performancesdoscalewithincreasedtrainingdataorthattrainingdatavolumebecomesalimitingfactorwhenscalingcomputeormodelsize,theninterventionsfocusedonthegeneration,availability,oruseofdatamayinfluencemodeldevelopment+
5
Table2.1.CommonBiologicalDatasetsandTheirDevelopment,byDataType
DataType
ExampleDatasets
Numberof
Entries(Asof
February
2025)
Annual
GrowthRate
(Asof2025)
MethodofDataGeneration
Accessibility
DNA
sequence
Genbank,NCBI,USA(NCBI,undated),est.1982
3.4billionsequences
31.3%
DNAsequencing
Public
EuropeanNucleotide
Archive,Europe(EuropeanNucleotideArchive,
undated),est.1982
5billion
sequences
23%
DNAsequencing
Public
DNADataBankofJapan,Japan(BioinformationandDNADataBankofJapan
Center,undated),est.1986
193trillion
sequences
9.3%
DNAsequencing
Public
Protein
seq
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- (2025年)大兴安岭地区呼中区招聘协管员考试练习题及答案
- 2026辽宁大连医科大学附属第一医院合同制护理人员招聘20人笔试模拟试题及答案详解
- 2026黑龙江哈尔滨工业大学国内合作处技术转移中心招聘6人笔试参考题库及答案详解
- 2026吉林工商学院招聘博士人才38人(2号)笔试参考题库及答案详解
- 2026四川泸州老窖股份有限公司暑期实习生招聘24人笔试备考题库及答案详解
- 2026云南大理州劳动力市场后勤服务部招聘2人笔试备考试题及答案详解
- 2026年安徽医科大学大分子药物与规模化制备全国重点实验室合肥分中心科研助理招聘1名笔试模拟试题及答案详解
- 2026四川南充营山县拓广人力资源服务有限公司招见习生2人笔试备考题库及答案详解
- 2026年莆田城厢区顶墩实验学校中小学编外教师自主招聘笔试参考题库及答案详解
- 2206广西钦州人才市场招聘公益性岗位人员1人笔试备考试题及答案详解
- 风力小车专业知识培训课件
- 产品生产过程质量检查记录表
- 2025年银行、金融反诈骗预防措施知识考试题库(含答案)
- 水力发电企业知识培训课件
- DBJT15-140-2018 广东省市政基础设施工程施工安全管理标准
- 综治中心建设汇报
- T-GDWHA 0020-2025 一体化泵闸设计制造安装及验收规范
- 混凝土胶凝材料化学降碳剂
- 安全生产举报培训
- 防洪防汛隐患排查台账
- 2025年中国邮政集团有限公司湖北省分公司招聘笔试备考试题及参考答案详解1套
评论
0/150
提交评论