已阅读5页,还剩39页未读 继续免费阅读




基于语料库的词汇学习 (方法与资源 ),内容提纲,语料库概念简介(3-5) 国内外主要语料库(6,7) 语料库在外语教学与学习中的应用(8-29) 免费在线语料库简介(COCA, BNC, Lextutor)(30-37) 软件工具(38-43) 资源分享,What is a corpus?,Corpus = “a body of naturally occurring text” The texts were not produced without the creator knowing that they would be used for linguistic analysis Newspapers, magazine articles, short stories, academic journals, etc Good genre balance (spoken, fiction, magazines, newspaper, academic) Current: not 100-year-old novels,3,Large: at least 100 million words More words than you would see / hear in a whole lifetime Annotated: tagged for part of speech and lemma (e.g. the beat, they beat, and beat as) 语料(corpus)是指收集的未经组织和加工过的语言材料和素材。 戴炜栋,1999 语料(corpus)又称为素材,是自然发生的语言材料(包括书面语和口语)的集合。可以用来作为描述一种语言的出发点或用于证实有关一种语言的假设的手段。 陈建生,1989,语料库按照特定目的与方法建立起来的存储语言材 料的“仓库”。 语料库是按照一定的语言原则,运用随机抽样方法,收集自然出现的连续的语言,运用文本或话语片段而建成的具有一定容量的大型电子文本库。从其本质上来说,语料库实际上是通过对自然语言运用的随机抽样,以一定大小的语言样本代表某一研究中所确定的语言运用总体。 杨惠中,2002,国外主要的语料库,Brown (1963 64) 布朗大学当代美国英语标准语料 库( The Brown University Corpus of Present Day American English) 。含100 万1961 年前后的书面 英语。由Francis 与Kucera 主持完成。 COBUILDJohn Sinclair 主持,迄今最大的语料库之一; 含的语料超过5 亿词。 COCA 美国当代英语语料库,收词四亿多,1990-2010 BNC英语国家语料库,收词一亿多,牛津大学/朗文/ 钱伯斯-哈洛普出版公司。 ICE国际英语语料库,口语和书面语各一库,收词1 百万 The Bank of English英语库,收词2.5亿。朗文/柯林 斯/伯明翰大学。,国内英语学习者语料库,名称 类型 建设单位 母语背景 容量(万词) HKUST 书面语 香港科技大学 广东话 2500 TSLC 书面语 香港大学 广东话 300 CLEC 书面语 广东外语外贸大学等 汉语 100 COLSEC 口语 上海交通大学等 汉语 50 MSEE 书面语/口语 华南师范大学 汉语 87. 6 SWECCL 书面语/口语 南京大学 汉语 200,中国英语学习者语料库CLEC (桂诗春、杨惠中, 2003) 我国中学生、大学英语4、6级、英语专业低年级和高年级学生在内的100多 万词的书面英语语料库, 是一部含有言语失误标注的英语学习者语料库。,中国英语学生口笔语语料库SWECCL 由“中国学生英语口语语料库”( Spoken English Corpus of Chinese Learners, (SECCL) 和 “中国学生英语笔语语料库”(Written English Corpus of Chinese Learners,(WECCL)二个子项目组成。总规模为200 万词。南京大学主持, (文秋 芳、王立非、梁茂成2005: 2),JDEST20世纪80年代,中国第一个语料库,上海交大,桂诗春、杨惠中,学术,语料库在外语教学与学习中的应用,基于规则和基于概率的实际应用:比如 机阅作文;机器翻译等 语料库用于目的语和中介语研究 词典编撰:如 Collins Cobuild Advanced Learners English Dictionary 测试 教材编写 翻译研究 语料库用于语言学习:基于大量真实语言输入的自主性、研究型的语言学习 比如:近义词辨析;语义韵;类联结;搭配研究;句法分析;话 语分析等。 应用举例,Quiz: order by frequency,vigilant flabbergasted lost rinky-dink miserable,9,Quiz: order by frequency,lost (#2691) miserable (#5841, “sad, hopeless”) vigilant (#11831, “watching over”) flabbergasted (#21701, “extremely surprised”) rinky-dink (#44681; “small, cheap, worthless”),10,11,Obvious errors: not in corpus,12,Corpus of Contemporary American English (COCA) fall down carefully: no occurrences,13,“unrecycling”,Google: unrecycling (100 hits: lot / little?; they refer to that trashcan picture),15,Corpus of Contemporary American English (COCA): no occurrences,16,COCA: other words with *recycl* (recycling, nonrecyclable, etc),x* recyclable: negative words before recyclable,18,Problems: civilized visitor | set up the ecosystem | ecosystem scenery,19,*set up the ecosystem: verbs with ecosystem as an object,20,21,no virtuous near duck,22,Word meaning: collocates: slippery near crafty,23,slippery near crafty: no occurrences,24,adjectives near slippery: dangerous,25,arouse,26,collocates (nearby words) near arouse: suspicions, sexually, anger,外语学习的四大难点,native-like pronunciation native way of thinking discrimination of synonyms idiomatic collocation,近义词辨析,近义词的辨析可以从意义的不同类型入手: 语法意义(grammatical meaning) 词汇意义(lexical meaning) 概念意义(denotative meaning) 联想意义(associative meaning) 内涵意义(connotative meaning) 语体意义(stylistic meaning) 情感意义(affective meaning) 搭配意义(collocative meaning),语料库方法在教学中的应用举例,高级英语词汇自主学习的语料库方法 SketchEngine工具在词汇搭配和同义词辨析教学上的应用 基于在线语料库的动_名搭配教学的实证研究,免费在线语料库 简介,COCA BNC Lextutor,Corpus of Contemporary American English (COCA; ),410+ million words (cf. British National Corpus , 100m) More words than average speaker will hear in a lifetime From more than 160,000 texts 20 million words each year from 1990-2010 Balanced across spoken, fiction, popular magazines, newspapers, and academic journals (20% in each genre each year) Freely available online since March 2008 60,000-70,000 unique users each month Complete, context-sensitive help files online,31,A good article to learn about COCA (in Chinese): Wang, Xingfu, Liu Guohui, Mark Davies (2008) “The Corpus of Contemporary American English - A Useful Tool for English Teaching and Research“. Computer-Assisted Foreign Language Education in China. 5:24-31,32,Composition of COCA 410+ million words (1990-present): same composition each year,Spoken: (83 million words) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). Fiction: (79 million words) Short stories and plays from literary magazines, childrens magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts. Popular Magazines: (84 million words) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Mens Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.,33,Newspapers: (79 million words) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. Academic Journals: (79 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year,免费在线语料库COCA检索方法,/ COCA在线检索首页 COCA检索页 COCA在线检索seldom seldom检索结果(list形式) seldom检索结果(chart形式) COCA在线检索seldom扩展语境举例,免费在线语料库BNC检索方法, BNC首页 BNC检索页 BNC在线检索outcome BNC在线检索outcome检索结果(list)


  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。


