GrammaticalComplexityOfSymbolicSequencesABrief_第1页
GrammaticalComplexityOfSymbolicSequencesABrief_第2页
GrammaticalComplexityOfSymbolicSequencesABrief_第3页
GrammaticalComplexityOfSymbolicSequencesABrief_第4页
GrammaticalComplexityOfSymbolicSequencesABrief_第5页
已阅读5页,还剩47页未读 继续免费阅读

付费下载

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Grammatical ComplexityOf Symbolic Sequences:A Brief IntroductonBailin HaoT-Life Research Center, Fudan UniversityInstitute of Theoretical Physics, Academia SinicaThe Santa Fe Institute, New Mexico, USAhttp:/ Paradigms in TheoreticalDescription of Nature Deterministic: based on periodicities and recu

2、rrences, from Kepler to Yang-Mills Stochastic: based on randomness, from Brownian motion to MSR field theory of hydrodynamics and molecular motors Fractal, self-similar, scale invariant: from phase transitions and critical phenomena to chaotic dynamics Finiteness is the unifying Physics: languages语言

3、学语言学(language 而非 philology)方法方法 统计语言学 “字”的频度和关联 Zipf 定律 代数语言学:生成语法和语法复杂性 串行生成:Chomsky体系 平行生成:Lindenmayer 体系(来自发育生物学) 可因式化语言(Factorizable language)自然语言与遗传语言自然语言与遗传语言相似处:相似处:多义性 冗余度 容错和纠错 长程关联 均基于离散的排列组合系统有某些语法,但不能完全生成方言、个体差异性演化、突变、灭绝历史“垃圾”、古语、“化石”外来语、横向交换相异处:相异处: 标点符号和间隔不同 两种语言的相互作用 二维、三维的相互作用 重复序列的数

4、目和作用An Observation u d c s b tcharge, mass, flavor, charm, p n e charge, mass, spin, magnetic momentum, H C N O P atomic number, ion radius, valence, affinity, H2O NO CO2 molecular weight, polarity, a c g t A D E F G H W Y VBRCA1 PDGFA PROGRAMME:Coarse-Grained Description of NatureUse of Symbols and

5、 Symbolic StringsLanguageGrammar and Complexity (Chomsky, Lindenmayer, etc.)So far this programme has been best realized in the study of dynamics by using Symbolic Dynamics.There have been preliminary attempts in analyzing biological sequences.It may not be a coincidence that the two systems in the

6、universe that most impress us with their open-ended complex design life and mind are based on discrete combinatorial systems. Many biologists believe that if inheritance were not discrete, evolution as we know it could not have taken place.S. Pinker, The Language Instinct (1995)Simple ExamplesAt the

7、 level of words: DOG GOD At sentence level: Dog bites Man Man bites DogN C EGF (Epidermal GF)N C Chymotrypsin (胰凝乳蛋白酶) N C Urokinase (UK) (尿激酶)N C Factor IX (凝血因子IX, X-mas抗血友病因子)N C Plasminogen (纤维蛋白融酶原) 几种丝氨酸蛋白酶的几种丝氨酸蛋白酶的domain组合组合 B.Alberts 等,Mol.Biology of the Cell 第三版 1994. P.123Ca 结合蛋白含3个-s-s-G

8、C 语法复杂性 字母表 例1. = a, c, g, t 例2. = A, C, D W, Y 例3. = a, z, A, Z, +, , 字母表中各种字母组成的一切字母串 (包括空串) * *的任何子集是基于的一种语言语法 = 字母表,初始字母,产生规则 基于该语法的语言Classification of Formal LanguagesChomsky Hierarchy Sequential production rulesLindenmayer SystemsParallel production rulesGenerative Grammar S Sentence NP Noun P

9、hraseVP Verb PhraseAdj AdjectiveArt ArticleS if S then SS either S or SNon-Terminal and Terminal SymbolsN boy | girl | scientist | V sees | believes | loves | eats | Adj young | good | beautiful | Art a | one | theS NP VPVP V NPNP (Art) Adj* NChomsky 语法层次 N 非终结字母集(工作用符号) T 终结字母集 S N 起始字母 P = 生成规则(x

10、y)的集合 x, y 为字母串 关于 x, y 的不同规定导致不同语法 语法 G = (N, T, P, S)0 类语法 x (N T)* N(N T)* y (N T)*至少含有一个非终结字母1 类语法 上下文有关语法 x = t1 a t2 t1, t2 T* a N 2 类语法 上下文无关语法 x = a N 3 类语法 正规语法 x = a y = b 或 bc a, c N b = 空 或 b T形式语言的Chomsky层次层语言计算机存储要求0递归可数REL图灵机(万能计算机)无根1上下文有关CSL线性有界自动机比例於输入字长2上下文无关CFL下推自动机下推区(堆栈)3正规RGL有

11、限自动机不要求 RLRRRLRRab(i)(ii)RLabcbcdA transfer function (a, R) = bA Finite State Automaton(FSA)FSA: Finite State Automata Deterministic FSA Non-Deterministic SFA Equivalence of DFSA and NDFSA: subset construction Minimal DFSA Myhill-Nerode theorem (1958): number of nodes in minDFSAA Pushdown AutomatonP

12、ushdown list StackFirst In Last Out (FILO)A Turing MachineAlan M. Turing (1912-1954)FSA + R/W tapeChurch-Turing Thesis (1936):Any effective (mechanical) computation canbe carried out by a Turing machineTerminals = a, b, cNon-terminal = A, BSequential rules: B aBAc | abc bA bb cA Ac B abc B aBAc aabc

13、Ac aabAcc B abAc aaBAcAc aaBAAc aaabcAAc aaabAcAc aaabbAcc Example: ai b ici | i0 CSL Rules to Generate Gene-Like Sequences( according to David Searls )gene upstream transcript downstreamtranscript 5-untranslated-region start-codon coding-region 3-untranslated-regioncoding-region codon coding-region

14、 | stop-codon | splice | coding regioncodon lys | asn | thr | met | glu | his | pro | asp | ala | gly | tyr | trp | phe | leu | ile | ser | arg | gln | val | cysstart-codon metstop-codon taa | tag | tga leu tt purine | ct base (6)ser ag pyrimidine | tc base (6)arg ag purine | cg base (6)val gt base

15、pro cc base (4)ala gc base gly gg base (4)thr ac base (4) ile at pyrimidine | ata (3)lys aa purine asn aa pyrimidine (2)gln ca purine his ca pyrimidine (2)glu ga purine cys tg pyrimidine (2)phe tt pyrimidine tyr ta pyrimidine (2)asp ga pyrimidine (2)met atg trp tggbase m a | c | g | t purine a | gpr

16、imidine c | tsplice intron intron gt | intron-body | agsplice a a intron splice c c intronsplice t t intron splice g g intron a splice intron a c splice intron c t splice intron t g splice intron gupstream enhancer promotor enhancer enhancer promotor silencer isolator These rules are capable to gene

17、rate an unlimitedset of gene-like sequences, mostly biological nonsense. They may be used to recognize gene-like segments in long DNA sequences.Syntax versus Semantics: texts vs. grammar. Physics behind this coarse-grained description:stereochemistry, interaction between proteins andDNA chains, meta

18、llic ions etc.Symbolic Dynamics Languages19911999谢惠民定理和猜测谢惠民定理和猜测 单峰映射揉序列对应的语言中的单峰映射揉序列对应的语言中的正规语言只有周期序列和最终周正规语言只有周期序列和最终周期序列两种期序列两种 如何走向比正规语言高一级的上如何走向比正规语言高一级的上下文无关语言?下文无关语言? 猜测:单峰映射揉序列对应的语猜测:单峰映射揉序列对应的语言中没有上下文无关语言言中没有上下文无关语言Subintervals determined by the periodic kneadingSequence (RLRRC) Order of

19、visits in the periodic kneadingSequence (RLLRC)Transformations of subintervals a c + d (on reading L) b d (on reading R) c b + c (on reading R) d a (on reading R)InputLRRRqabcdd1100c1010b0010a0001Transfer FunctionsRLac, dbdcb, cdaRLa,b,c,d a,b,c,d c,dc,da,b,ca,b,cb,c,dc,db,c,da,b,c,dStefan matrix fo

20、r 256P in Feigenbaum cascadeStefan matrix for F13=233; Case (a)Stefan matrix for F13=233. Case (b)Stefan matrix for F13=233. Case (c)Stefan matrix for F13=233. Case (d)Symbolic Dynamics Languages19911999 br bl ar al albr blar Alphabet: S = ar, al, br, bl Production rules: Initial symbol (axiom) = ar

21、 Grammar: G = (S, P, ) Language: L (G) S*Development of Anabaena catenula (串珠藻项圈藻属) br ar ar albr bl al al blarP = Lindenmayer SystemsParallel production rules. Finer classification D0L Deterministic, no interaction, i.e., context-free0L non-deterministic, no interaction IL non-deterministic, with I

22、nteraction, i.e., context sensitiveT0L with Table of production rulesTIL E0L Extended to non-terminal symbolsET0L EIL REL of ChomskyRGL Regular CFL Context-FreeCSL Context-Sensitive REL Recursively Enumerable CSL CFL RGL FINDOLRELChomsky LindenmayerIndexed0:REL1:CSLINDET0LE0L2:CFL3:RGLILT0L0LD0LEILL

23、 = aibici | i 0 CSLG = (S, T, ) = abc S = a, b, c T = t1, t2 T1 = a aa, b bb, c cc T2 = a , b , c T0LExample a la Lindenmayer Dyck language: A language of nested parentheses Many types of parentheses Finite depth of nesting Context-free languageOur case: Only 3 types of parentheses Shallow nesting C

24、onjecture (Xie): may be regular language模糊语言学 形式推广不难:Z .G .Yu (喻祖国2001) 如何定量地引用生物知识 Consensus 序列和权重矩阵随机语法 隐马可夫链 = 随机正规语法 更高阶的随机语法? Factorizable Languages Symbolic dynamics leads to factorizable languages A complete genome defines a factorizable langauge An amino acid sequence with unique reconstruct

25、ion (at certain K) defines a factorizable language Modeling in Biology Cells Tissues Organs “Systems”: circulation, respiration, reproduction, neural, sensory, musclular, etc. Organisms, population, ecosystems Animals versus plants Plant development, morphology, physiology and pathologyModeling of Plant MorphologyBy using L-System P. Prusinkiewicz, J. Hanan, Lindenmayer Systems, Fractals, and Plants, LN in Biomath., vol. 79, Springer, 1989 P. Prusinkiewicz, A. Lindenmayer, The Algorithmic Beau

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论