




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、.Data Provenance Support in Relational Databases for Stored ProceduresWinly Jurnawan and Uwe RhmSchool of Information Technologies, University of Sydney,Sydney NSW 2006, AustraliaAbstractThe increasing amounts of data produced by automated scientific instruments require scalable data management plat
2、forms for storing, transforming and analyzing scientific data. At the same time, it is paramount for scientific applications to keep track of the provenance information for quality control purposes and to be able to re-trace workflow steps. Relational database systems are designed to efficiently man
3、age and analyze large data volumes, and modern extensible database systems can also host complex data transformations as stored procedures. However, the relational model does not naturally support data provenance or lineage tracking. In this paper, we focus on providing data provenance management in
4、 relational databases for stored procedures. Our approach, called PSP, leverages the XML capabilities of SQL:2003 to keep track of the lineage of the data that has been processed by any stored procedure in a relational database as part of a scientific workflow. We show how this approach can be imple
5、mented in a state-of-the-art DBMS and discuss how the captured provenance data can be efficiently queried and analyzed.Keywords: Provenance, Stored Procedure, Relational Database.1 Introduction Scientific research is currently experiencing a digital revolution. Computer-supported scientific instrume
6、nts such as next-generation DNA sequencers or radio telescopes can conduct large experimental series automatically in a 24x7 setting and generate massive amounts of data. For instance in bioinformatics, the 1000 Genomes Project employs next-gen DNA sequencing technology that generates approximately
7、75 Terabyte of data weekly in just one of three participating labs. Another example, is the exponential growth of the NCBI GenBank data since 1998 which grew to morethan 56 billion base pairs (approx 56Gb) by 2005. And in Physics, a prominent example is the Large Hadron Collider (LHC) project that e
8、xpects data volumes to hit the Exabyte (1018B) scale. These large-scale scientific experiments depend on a reliable and scalable production environment. For example, the 1000 Genome project operates more than 30 next-gen DNA sequencers in parallel, producing TB of raw sequencing data automatically w
9、hich has to be further processed, analyzed and archived. Given the high experimental data throughput generated by various sequencer machines, supervised by many scientists and running in 24x7 operation mode, it is impossible to keep track on the experimentprocess manually. The traditional role of la
10、b log book recording does fail in such an environment because it is outperformed by the throughput of the data. Hence, a computer support to keep track of these processes is needed; this support is often referred as data provenance. Provenance in brief is the history of a data which is collected thr
11、oughout itslifetime. The history refers to the process that produced, manipulated and affected the data . E-science research uses provenance information in various ways depends on the purpose of research and the requirements of historical information. For instance in human genome sequencing, the amo
12、unt of chemical used in web lab is crucial to ensure the reliability of the sequence results. There have been several attempts to manage scientific data with relationaldatabases and to implement scientific processes inside a DBMS using existingextensibility features. For instance, dbBLAST for alignm
13、ent searches on large gene data bases, SDSS in astronomy research, and SciDB for the Physics community. Recently looked into supporting a whole DNA sequencing pipeline for high-throughput genomics inside a relational database using CLR-based stored procedures. These attempts indicate that large-scal
14、e experimental research, which is still mostly file-centric organized, will slowly move towards database-centric approaches. This phenomenon brings new challenge to the database community, because there is no common model for capturing and representing data provenance for database-centric e-science
15、researches. In this paper, propose a conceptual provenance model to track the workflow provenance for e-science tools which is implemented in relational database. We implemented Provenance for Stored Procedure (PSP) which is the proof-of-concept for our provenance model. It is implemented inside the
16、 relational database using CLR stored procedure with provenance data represented in SQL/XML format. In this initial work, we focus on the provenance for data which is manipulated by stored procedure. Our main contributions are:1. A conceptual provenance model for relational databases that consists o
17、f three main components: data, agent and process.2. We propose an XML-based representation for provenance data in relational databases. XML format is chosen to accommodate the versatility of the provenance data requirement which change over the time.3. We implemented two variants of PSP as proof-of-
18、concept: Nave PSP and Centralize PSP which differs in the provenance data storage scheme. We evaluated both approaches using dbBLAST as agent and the human genome as data set. The structure of the remaining paper is as follows. Section 2 presents the other works on provenance which are related to ou
19、r research. The conceptual foundation of our provenance model in relational database is presented in the Section 3. In Section 4, we present the implementation of PSP and instance of our approaches. Section 5 presents the experimental result and its analysis. And in section 6 and 7 present our futur
20、e work and conclusions.2 Related Work There are many existing provenance approaches which are implemented on top of various technologies such as scripting architectures, service-oriented architecture, and relational database architecture. In this section, we focus more on the provenance approaches t
21、hat are mostly related to relational database architectures. There are two major categories of provenance which are kept in scientific research, the data flow provenance and data workflow provenance. The data flow provenance focus on recording how or where the data has copied and moved throughout da
22、tabases . Similarly, describe the data flow provenance as “where provenance” which mainly keep track on the source location(s) of current data. Data flow provenance captures and stores information such as, the source of data, type of operation done to retrieve the data (i.e. copy, move, insert), who
23、 accomplished the operations, dates and version of the data. On the other hand, the data workflow provenance focuses more on recording the histories of processes that has been done to particular data until current stage. It could be consider as the recipe to produce the current state of that particu
24、lardata. Workflow provenance is usually used in research experiments and explorations, where the provenance information gives the firm verifications and explanations of results produced. However, scientist uses provenance information in various ways depends on the purpose of research and the require
25、ments of historical information. Moreover this requirement of provenance information will vary over the time. Following are some of the related work of the existing works that are related to our research. CPDB (Copy Paste Database) is the implementation of provenance concept on where provenance whic
26、h was introduced in. The main focus on the implementation of CPDB is on the provenance for curated database where most of its content are copied or derived from other sources. Thus, CPDB focus on data provenance which records the source of the data instead of the workflow provenance that record the
27、series of process or event that affect the data. The actual provenance information is represented in the XML data and stored separately from the actual data, in CPDB it is store separately inauxiliary table. CPDB is still implemented outside the relational database using Java application. Another im
28、plementation is REDUX which is a provenance managementsystem which captures the provenance information for the workflow which is built based-on the Windows Workflow Foundation (WinWF). REDUX stores it provenance information in the RDBMS which support better data management and data queries. However
29、it is still an implementation outside the relational database system which makes use of relational database to store the provenance data. Another feature provide by REDUX, is smart replay which enables the user to replay on what has happened to a particular data before, the history of the data. Ther
30、e are implementations of provenance capturing in the database which are DBNotes and Trio. DBNotes is originally the implementation of annotation management for a data in relational databases. The annotations of annotated data are propagated when the particular data is transformed. DBNotes takethe ad
31、vantage of the annotations that are propagated during transformation as the provenance trace. The authors also define provenance annotations where all the data is annotated with its address. Hence, as the annotations are propagated, one of the annotations information contains the original address of
32、 the data. DBNotes can be categorized as where provenance, it traces the origin of data instead of the workflow . On the other hand, Trio is an implementation of provenance tracking which trace the information for view data and its transformation in the data warehouse environment through query inver
33、sion method. Trio operates in the RDBMS environment where the data are queried, copied, moved, etc. It is a data provenance because it concerns more on keeping track of the source or origin of particular tuple in a view table. It uses the inversion model to automatically trace the source of the data
34、 for set of tuples created by view queries. These inverse queries are recorded at thegranularity of a tuple and in the table called Lineage table. This related work section shows us that there are existing provenance approaches which are implemented inside the database, but it is only for data flow
35、provenance or annotation-based provenance. Even though REDUX captures the workflow provenance, but the implementation of actual system still not in the relational database, it only uses relational database as data storage. We believe implementing the workflow provenance approaches purely inside the
36、relational database is the gap that we can fill for this research area.3 Provenance Model In this section we present the provenance model which track workflow provenance in a relational database for data which is manipulated by stored procedures - written in either SQL or CLR-based (Common Language
37、Runtime). In our approach we assume all the processes are carried out inside the relational database by an agent which in our case is a stored procedure. These processes are the process that accept input data, manipulate data and generate output results. The object which is manipulated by agent is t
38、he data which refers to the actual/result data (records in the relational database table). It can be both data with provenance or data without any provenance attached to it. In this model provenance data is generated by facilitator, with contentssuch as, the agent who does the process, the execution
39、 time, the user who invokes the agent, the input query used in this process, etc.3.1 Provenance Model There are three main components in our provenance model which are facilitator, data, and agent as depicted in Fig 1. facilitator is the central part of our model because it facilitates the process w
40、hich carried out by agent to data. Since facilitator facilitates the process, it has the authority to collect the information of the process, agent who participates in the process, and data which is also participate in the process. Below are the short descriptions of each component:1. data is the ac
41、tual data or records in the relational database, it refers to the records in a tables or rows which is an object that participate is processed by agent. The data can be the data itself or a data with provenance attaches to it the provenance must be the provenance which is complied with our provenanc
42、e model.Fig. 1. Provenance for Stored Procedure (PSP) Model2. agent is a stored procedure in a relational database; it is an active component that processes data. The agent could be a Transact-SQL stored procedure or CLR stored procedure. It accepts input data, manipulate data and generate output re
43、sults (data). We assume all the data which is generated by agent is in relational table form.3. facilitator is the central component of our model, because it facilitates the processes which are participated by data and agent, and collects and generates provenance to it. facilitator accepts user inpu
44、t which invoke agent to process data, and while it facilitates the process it collects the provenance pertaining to the process and generates the workflow provenance on-the-fly. Finally it returns the result (output data that is generated by agent) and adds provenance data to the result in form of e
45、xtra column called provenance (refer to Fig. 3). If the previous data contains provenance, it will simply appends the new provenance to the existing provenance with a condition of the existing provenance data must comply with our provenance model. In this case, we are able to trace back, all process
46、 history of this particular data.The output of our model is exactly the same as the result of agent (in database table form), but one provenance column in XML format is appended to the result (refer to Fig. 3).3.2 Provenance Data RepresentationProvenance defines a value of particular data, each scie
47、ntific domain has its own interests and view on the provenance information. For example in bioinformatics, the information of data origins (where does the data derived from?) is a very precious piece information, while in commerce area the time (which user first purchase this share?) that describe a
48、 certain transaction is very crucial. Therefore, provenance data vary from one domain to another, which compel us to design provenance data representation that is flexible for every domain. We have considered two common options of data representation in relational database which are relational table
49、 and SQL/XML representation. The data representation in relational table have a rigid structure where all the columns has to be predefined, and extending extra column might take much efforts. On the other hand, XML as a semi-structured representation is more flexible. Extending the XML schema (i.e.
50、new information required to be captured) would not affect the existing provenance data, which does not work quite as well with a relational table representation. Accommodating the versatility of provenance requirement which changes over a time, XML representation is a better choice for us.Although X
51、ML representation is very flexible, we are compelled to control it. We created the XML schema to define what sort of structure to follow and what kind of data type should be use for a particular data. The purposes of this restriction are to ensure the correct XML data are created and to simplify the
52、 provenance data retrieval. Note that the XML schema can be modified anytime to cope with the provenance requirement changes.Fig. 2. The Sample of PSP Provenance Data with two nodes3.3 Provenance Mapping ApproachThe provenance data which generated by PSP should be mapped to its result data which ret
53、urned by agent, otherwise it would void the purpose of provenance. In this paper we present two provenance mapping approach, Nave mapping approach and Centralize mapping approach which map the results data to its provenance.3.3.1 Nave Provenance Mapping ApproachNave provenance mapping approach is a
54、straightforward mapping, because it simply attaches the provenance data at the end of each tuple. The provenance data is in XML format which is attached by adding extra column (provenance) to the result table. Fig. 3 depicts the series of data returned by agent with additional provenance in the form
55、 column attached to it. GI_NUM, MSP_que_s, Score, and compliment columns which are the result returned by agent dbBLAST. Nave provenance mapping approach simply attaches the provenance as a new column (provenance) to the result.3.3.2 Centralize Provenance Mapping ApproachCentralize mapping approach
56、is an optimization of the Nave approach which does not store the provenance in tightly coupled manner. Instead of mapping the provenance data by adding extra provenance column to the result data, it stores provenance data in the central table (called ProvenanceSystem) and use a unique ID to map betw
57、een the provenance data in central table and the result data returned by agent. Fig. 4 depicts the series of data returned by agent and the ProvenanceSystem table which are linked by the unique ID. This mapping approach reduces the redundancy of provenance data, if there are results which share the
58、common provenance. For instance, in Fig. 4 the results with GI_NUM 2230-2231 share the same provenance which is mapped by ID 3.3.4 Provenance with Series of ProcessesIn many cases, the data is processed in a workflow pipeline which consists ofmultiple processes. In this paper, we assume all the processes are carried outsequentially in the relational database. Our provenance model handle the multiple processes by simply attaches t
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 第37讲细胞工程-高考生物一轮复习精讲课件
- 数字智慧方案食品安全追溯系统建设方案
- 2024年家具、建筑用金属附件及架座投资申请报告代可行性研究报告
- 供电防护员练习试题(一)
- 金榜题名感谢恩师升学宴十年寒窗苦读一朝金榜题名
- 2023年高考全国甲卷数学(理)真题
- 盈余管理视角下的股权激励与公司绩效的实证研究
- 职业资格-公路水运公共基础真题库-2
- 会计政策选择的影响因素考察试题及答案
- 法律考研试题及答案
- 形势与政策(贵州财经大学)知到智慧树章节答案
- DL∕T 904-2015 火力发电厂技术经济指标计算方法
- DL-T 1476-2023 电力安全工器具预防性试验规程
- 中国戏曲剧种鉴赏智慧树知到期末考试答案章节答案2024年上海戏剧学院等跨校共建
- 六年级趣味数学思维拓展题50道及答案
- 水泥混凝土路面翻修施工方案完整
- 怡口软水机中央净水机安装调试指南.
- 暖通毕业设计外文翻译
- 小黄瓜种植观察日记(课堂PPT)
- 浮吊作业施工方案范文
- 【2021更新】;国家开放大学电大专科《Dreamweaver网页设计》网络核心课形考任务8及9试题及答案
评论
0/150
提交评论