Python.doc_第1页
Python.doc_第2页
Python.doc_第3页
Python.doc_第4页
Python.doc_第5页
已阅读5页,还剩3页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

哈尔滨工业大学(威海)软件学院威软俱乐部2009年纳新测试题PYTHON部分试题1. 题目说明本次纳新试题Python部分共有1道大题,其中包含5个子题目。其中每个子题目占Python部分的20%分数。上交答案时,设计文档要求使用Word2003格式保存,要求文档书写规范。程序代码均以Python源代码文档(*.py)格式上交,代码头部请注明运行环境说明已经Python版本要求。2. 题目背景在日常生活中,自然语言材料都是人工书写和阅读的。随着计算机和网络技术的发展,越来越多的应用都需要有一种使用计算机来处理自然语言的方法,自然语言处理(Natural Language Processing,简称NLP)就是为此产生的。生物信息学是一门正在迅速兴起的边缘学科,它位于生物、计算机、数学等多个领域的交叉点上,其研究目标是揭示“基因组信息结构的复杂性及遗传语言的根本规律”。本次的试题是某个科研项目的子课题,也是有关生物信息学和自然语言处理的问题。本文档试图尽可能清晰的描述题目本身,并尽可能详细的给出解决问题所需要的背景知识(包括生物信息学、算法分析与设计、自然语言处理以及问题补充材料),这些背景知识的补充材料将被上传到CMS模块中或者FTP上,请同学们自行下载参考。请注意:仅利用本文档所介绍的背景知识即可完成题目中要求的内容,背景知识补充材料仅作为系统优化和课外兴趣阅读,是可选内容。3. 题目涉及的实验数据介绍3.1. OMIM人类孟德尔遗传学,(英文:Mendelian Inheritance in Man,简称MIM)是一个数据库将现时所知的遗传病分类,并且连接相关的人类基因组中的基因。这个数据库出版了名为孟德尔遗传定律说明,最新的版本是第12版。它亦有网上版本,称为在线人类孟德尔遗传(英文:Online Mendelian Inheritance in Man,简称OMIM)。可以通过网址/sites/entrez?db=omim来访问。记录中*, #, %, ,各代表不同意义的记录。 在本系统中,OMIM使我们的研究对象之一,为了方便大家进行程序设计,我们在学院FTP服务器上传了OMIM的电子版拷贝,同学们可以从00/开发文档/俱乐部纳新/OMIM/中下载。大家引用数据请注意版权问题。3.2. MeSH鉴于在OMIM(或者其他的自然语言材料)中,经常出现同义词问题,例如:“病人A发热3天”与“患者A发烧3日”,从语义上来说应为同义,但是,因为同义词与干扰,致使不能够用程序来直接比较。MeSH是美国国家发布的一份医学主题词词典(参见/mesh/MBrowser.html ),将医学用词的同义词以及常用缩略语都列出了。伴随着这个词典同时给出的还有关键词之间的上下级关系,见具体文件。文件格式以及使用方法请自行参考网上资料。在00/开发文档/俱乐部纳新/Mesh/处已经放置了经过预处理的词典。请使用者注意版权问题。4. 题目描述4.1:使用Python语言,将Mesh词典整理成一个按照alpha-beta顺序排列的以tab制表符分割的词条列表。其格式为:词条1 tab 编号1词条2 tab 编号2这样,经过整理的文件,在使用词条查询时,速度会有一定提升。请编写程序完成以上功能。注意:因为原始文件的文件大小较大,所以处理速度是一个判断程序优劣的关键因素。要求:所有源代码及文档打包成为q1.rar然后上传。4.2:在完成以上操作后,为了进一步提高关键词的查询性能,需要对文件进行进一步的优化操作。在这样的优化操作中,索引操作是一种常用技术。即:将文件组织成文类似字典的格式,文件头部放置部分目录和索引信息,将常用的查询内容的位置提前标记出来。这样,在查询关键词时,查询速度可以优于二分查找算法。请设计一个小程序,完成上述功能,并于上一个程序集成在一起。要求:所有源代码和文档打包成为q2.rar然后上传。4.3:在完成以上操作后,利用程序分析OMIM数据,需要利用整理好的Mesh词典,将OMIM中的同义词替换为标准拼写。请编写一个小程序,将OMIM记录中的所有医学词汇统一为Mesh中的标准拼写。要求:所有源代码和文档打包成为q3.rar然后上传。4.4:在完成以上操作后,我们开始分析OMIM的文本。在研究过程中,我们需要一种自动的判断两个OMIM记录相似程度的方法。评分的范围为01。完全一致的记录评分为1,完全不相关的记录评分为0.部分相关的记录评分在0和1之间。常用的评分方法有许多种:例如A记录中经常提到关键词“发热”,而B记录中也有提到,则证明A、B记录具有一定的相似性。另外,具有上下级关系的词汇也具有一定的潜在关系,例如:“手指头”和“手”即是如此。这样的上下级关系在Mesh词典中有所表示,见相关文件。在此,我给出一种参考方法来确定记录的相似性评分原则,如算法以下所示:MethodsThe OMIM databaseThe OMIM database contains record-based textual information, one gene or one genetic disorder per record. OMIM also contains literature references and links to other databases. We have used the fulltext (TX) and clinical synopsis (CS) fields of all records that describe genetic disorders. We will refer to this combination of the TX and CS fields as a record. OMIM is a rich data set containing 16 357 TX records of which 5132 describe a disease phenotype (semi-automatically selected, manually verified). The remaining records contain variation, mutation, gene/protein, or other information. OMIM was originally designed to be read by humans, not by computer. We have automatically extracted the phenotypic features from OMIM using text analysis techniques.Creation of feature vectorsWe used the anatomy (A) and the disease (C) sections of the medical subject headings vocabulary (MeSH) to extract terms from OMIM. MeSH terms and their plurals and components are concepts. MeSH provides a standardized way to retrieve information that uses different terminology to refer to the same concepts. Its size and internal hierarchical structure make it a rich dictionary, which is needed to match the OMIM texts. MeSH concepts serve as phenotype features characterizing OMIM records: every entry in the feature vectors represents an MeSH concept. The number of times the terms for a given concept are found in an OMIM record reflects the concepts relevance to the phenotype. Nonspecific concepts like syndrome or disease were excluded. Refinement of the feature vectors MeSH concepts can be very broad like Eye or more specific like Retina. MeSH includes a concept hierarchy that describes relationship such as EyeRetinaPhotoreceptors. Eye is called a hypernym of Retina, which in turn is a hypernym of Photoreceptors, etc. Conversely, Retina is called a hyponym of Eye. To ensure that the concepts Eye and Retina are recognized as similar, we use the MeSH hierarchy to encode this similarity in the feature vectors by increasing the value of all hypernyms as described in (Figure 1)For any concept c, its relevance rc becomes the actual count of the concept in a document rc, counted plus the relevance sum of the concepts hyponyms rhypos. This sum is divided by the number of hyponyms nhypo, c. This equation is applied iteratively from the most detailed level in the MeSH tree, till the highest hypernym level is reached. Not all concepts in the OMIM records are equally informative. For example, retina pigment epithelium occurs rarely, and thus provides more specific information than very frequently occurring terms such as Brain. We allowed fordifferences in the importance of concept frequencies by using the inverse document frequency measure15The inverse document frequency or global weight of concept c (gwc) is the logarithm of the total number of records analyzed (N; N5080) divided by the number of records that contain concept c, nc. Not all OMIM records contain equally extensive descriptions. These differences will make a comparison between records difficult because the diversity and the frequency of concepts in the larger records will exceed those in the smaller records. Equation(3) was used to (partly) correct for these record sizedifferences.15The local weight of concept c in a record is a function of the concepts frequency rc divided by the frequency of the most occurring MeSH concept in that record, rmf. The three feature vector corrections were applied in the order of equations (1)(3).Comparing OMIM recordsThe similarity between OMIM records can be quantified by comparing the feature vectors that are expanded and corrected (equations (1)(3). Similarities between feature vectors were determined by the cosines of their angles (equation (4).16The similarity between the feature vectors X and Y (s(X, Y) is a function of their respective concept fr

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论