An Improved Web Page Watermarking 英语论文.doc_第1页
An Improved Web Page Watermarking 英语论文.doc_第2页
An Improved Web Page Watermarking 英语论文.doc_第3页
An Improved Web Page Watermarking 英语论文.doc_第4页
An Improved Web Page Watermarking 英语论文.doc_第5页
已阅读5页,还剩3页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

an improved web page watermarkingzhu ping1, ding wei2, lu ming21. college of information and security engineering, wuhan, 4300702. college of computer science and technology, wuhan, 430070abstract: web page watermarking is a research branch of the text watermarking. it is relatively difficult to embed the watermarking into the web page. for the particularity of the web page, this paper proposes an effectively improved web page watermarking program which can not only protect the web pages completeness from tampering, but also can protect the whole web sites copyright, integrity and consistency. this program, which can be effectively used for copyright protection, is able to test whether the web page suffers tampering, and to locate tampering.keywords: text digit watermarking, web page watermarking, fragile watermarking, robust watermarking1. introductionthe digital watermarking technology, the basic idea of which derives from the early steganography technique, secretly embeds the specific markers into the digital content. this kind of markers usually is invisible, and only can be seen through the special detector or reader. according to the different types of the carrier, the digital watermarking technology can be divided into text digit watermarking, image watermarking, and video watermarking. the text usually consists of words, sentences, paragraphs, punctuations and other regular structures. it is not easy to embed the watermarking into the text and not to be found by invaders. with little redundant information in the text, the common text digital technology includes: the shift coding, word shift coding and feature coding and so on. the web page watermarking in this paper is one of the text digital watermarking.web page is different from the ordinary plain text document. html document is a non-formatted text with labels and web page information constituted. labels, not case-sensitive, is used to control the format and the display effect of web page information, and can be divided into single-label and dual-label. the single-label can be used alone with the format as ; the dual-label contains a start label and an end one with the format as web page information . at present, the program of embedding the text digital watermarking into the web page is based on characteristics of html grammar and labels. there are: method based on the invisible characters (for example, the web browser is non-sensitive to the extra tab characters and space characters in the html documents.), method based on the non-sensitive html labels and one based on the order of the label attributes.2. word watermarkingword watermarking is to generate a label string of every word by the hash algorithm, to calculate the accumulated value of all the characters asdii value in the label string, and finally with it as the seed of the pseudo-random function to generate the corresponding 0-1 coding of the word. however, in the web page, it is to extract all the words of the body parts between the html labels and generate the corresponding watermarking information of every word. first, it is to encode each word by the sha-1 security hash algorithm and generate the 160 bit-length binary sequence (lets suppose the key is key1), which means hex string with the length of 40 and from 0 to f; then to make the accumulative operation of the ascii value of each character in the hexadecimal string sequence to get the sum; with the sum as the seed of the pseudo-random sequence algorithm to generate the six binary 0-1 sequence that is the word watermarking of the word. later these 0-1 sequences should be converted into the corresponding “spaces-tabs” sequences, and embed them into the web page through the browsers invisibility. the generation process is shown below and expressed by the formula 1:ww(wi) = random(sum(hash(wi, key1), i = 1,2,m (1)each word sha-1encryption algorithm, hex string c1,c2,c3, to get the cs code value and make accumulative operationkey1word watermarking six 0-1 sequence, pseudo-random sequence generates algorithm random, accumulative sum.figure 1 the word watermarking generation process of the web pageamong these, wi means the no.i word in the body part of the web page; and ww(wi) indicates its corresponding watermarking information; key1 is the key in the hash algorithm; m is the number of the total words in the body parts; sum is the accumulative algorithm process.3. line watermarkingline watermarking is to generate the character string of each line through the hash algorithm and calculate the accumulative value of all the ascii value in the character string, then with it as the seed of the pseudo-random function to generate the corresponding 0-1 coding of the line.the below is the line watermarking generation process of the web page:first, all the words in the line should be extracted, and to generate the respective word watermarking by the above method, next to operate on these word watermarking, and finally the line watermarking of the line is generated. the generation process is shown below and expressed by the formula2:lw(li) = ww(wi1)ww(wi2)ww(wij)ww(win) (2)among these, li means the no.i line; and lw(li) indicates the line watermarking of the line; ww(wij) denotes the no.i line, no.j words word watermarking. m is the number of the total lines and n is the number of the total words. figure 2: the line watermarking generation process of the web page4. improved watermarking algorithm4.1 the generation of the watermarkingthe specific applications and the purposes of the text digital watermarking are different in each web page, so the demand on the function of the robust watermarking is different and the generation program of embedding the watermarking information into the web page is also different. since the word watermarking and the line watermarking is used for protecting the web page from tampering, they and the specific contents of the web page should be closely connected and are randomly embedded into the web page. the watermarking distributed in the web navigator and corresponded with the web page is used to identify the copyright information, thus the watermarking information is generated by the owner information of the web page, the serial number and icon and others. in order to facilitate description, so just the english web page is taken into consideration and this program can be easily applied to the web page with other character formats.the watermarking generation process of the web page is described as follows:each character in java can be expressed through the form of the binary, and the binary sequence can be generated by the method of string to get the bytes. to embed the adscription of the watermarking proof copyright, the author information or the serial number can be showed by the bytes, and then be encoded into the binary sequence, next the watermarking sequence is achieved: wm=wi(i=1,2,m), finally the encrypted watermarking information by operating through the key sequence key2=kj(i=1,2,n) is wm=wi(i=1,2,m). the specific operation process can be expressed by the equation 3:wi = wi kj, i = 1 m; j = i % n (3)among these, m is the length of the watermarking sequence and n is the length of the key. it should be used circularly when the length of the initial watermarking sequence is greater than that of the key.4.2 embedding of the watermarkingthe functions of the watermarking in each web page are different, thus, so are the specific embedding processes, methods and the selection of the embedding points of various kinds of watermarking information. for the different types of the watermarking, “0” in the watermarking sequence indicates the letters inserted into the spaces or labels should maintain the lowercase which is not to be replaced; “1” means the letters inserted into the spaces or labels should be capitalized. the specific inserting processes are as follows:1. the word watermarking inserting process of the web pagesuppose there are m words in the body part of some web page, which can generate m word watermarking. since spaces are used between the words in the english web page to represent interval and the web browser will automatically ignore the extra spaces between the words or the tabs, the watermarking information can be embedded behind each word with m positions. the initial position sequence is 1,2,m. in order to improve the invisibility for the purpose that the real effective information can not be obtained even if the watermarking is illegally extracted, the watermarking of the word is not embedded behind the word, but generates a new position sequence l1,l2,lm through the pseudo-random number algorithm (shuffling algorithm) controlled under the key3, then the word watermarking ww(wi) is converted into the sequence “spaces-tabs” which will be embedded into its new corresponding location li, that is to say, the watermarking generated by the no.i word is embedded behind the no. li.2. the line watermarking inserting process of the web pagesuppose there are n lines in the body part of certain web page, which can generate n line watermarking under the generation program of the line watermarking. likewise, these will generate a new sequence through the pseudo-random number algorithm controlled by the key4; then the lime watermarking information is inserted into the new position. the specific description is below:if the line watermarking encoding length is more than the character numbers of the html tabs in this line, the partial watermarking information will be embedded into the labels by changing the labels uppercase or lowercase. as for the excess watermarking coding, it can be transformed into the “spaces-tabs” sequence and embedded behind the line by the embedding method of the line watermarking.if the line watermarking encoding length is less than the character numbers of the html tabs in this line, the watermarking will be embedded into the labels through the circular embedding.if the line watermarking encoding length is equal to the character numbers of the html tabs in this line, all the watermarking information will be embedded into the labels.3. watermarking embedding process of the web navigation pagefor the english web page, the alphabets and chinese characters encoded with the utf-8 format respectively symbolize the 8 bit and 24 bit binary sequence. suppose there are n navigation pages and the identification information used to generate the watermarking contains m english letters, the length of the generated and encrypted watermarking information is 8m bit which should be evenly embedded into n web pages. thus, the successive lengths of the partial watermarking embedded in the n web pages are (8m/n, , 8m/n, 8m/n +8m%n).firstly, the html document should be made a pretreatment. all the english letters in the labels (regardless of the attribute value part in the labels) are initialized into the lowercase and count the number of all the labels alphabets in the start label and end label: n. a meaningless html label will be appended before the end label to contain all the watermarking information for this web page, such as, if the length of the watermarking information sequence embedded into the web page is more than n. thus, the carrier documents can contain all the watermarking information, meanwhile; have little impact on the visual effect of the carries documents. however, if the length of the watermarking information sequence embedded into the web page is less than or equal to n, the watermarking information will be circularly embedded into the web page until the label ends.next, to traverse the pretreated and to be embedded html documents from the start label to the end label . when the pointer points to the alphabet and the alphabet is html label, the replacing operation of the uppercase and lowercase format is below on the basis of the encrypted watermarking sequence wi(i=1,2,m) calculated through the formula 3.to improve the invisibility, the embedding positions of the word watermarking and the line watermarking should be disrupted through the pseudo-random number sequence generation algorithm, next the html documents will be traversed and the word watermarking will be embedded behind the disrupted position according to the english words stored in hashmap and the word watermarking key. next, the line watermarking stored in the number group lineline_num will be embedded according to the disrupted positions. finally, for the watermarking information in the navigation page, it is necessary to traverse the html documents from label and traverse the binary watermarking sequence wm=wi(i=1,2,m). lets suppose their pointer is i, when i points to the alphabet which is the name of html label or the attribute name, the alphabet should be changed into the uppercase if wi=1; it keeps the same if wi=0. it continues until im ends.4.3 the testing of the watermarkingthe key3 and key4 are needed to determine the embedded positions of word watermarking and line watermarking in order to detect the integrity of the web pages content. for the to-be-tested ordinary web pages, the new word watermarking and line watermarking are regenerated through the watermarking generation algorithm; meanwhile, the embedded position sequence of the watermarking information is calculated according to the key and the word watermarking and line watermarking are extracted from their corresponding positions in the web page. the extracting principles include extracting “0” from the spaces behind the word or line, extracting “1” from the tabs behind the word or line, extracting “0” from the lowercase in html labels and extracting “1” from the uppercase. finally, whether the word watermarking and line watermarking extracted from the to-be-tested web pages are the same with the newly-generated word watermarking and line watermarking is the way to prove whether the contents of the web page suffer tampering. if they are completely the same, they are not tampered, otherwise, it is tampered.to prove the copyright of the digital works, for the web page corresponding to the navigation bar of the web site, firstly, the length of the embedded watermarking information in each navigation page is calculated through the method of embedding algorithm. next, the html documents is to be traversed word by word from label to the one. if the pointer points to the alphabet which is the name of the label and is the lowercase, “0”is extracted; if it is uppercase, “1”is extracted, until the extracted watermarking binary sequence length is equal to the watermarking length embedded into the web page. finally, the partial watermarking information extracted from all the web pages corresponding to the navigation bar will be merged and the encrypted watermarking sequence is obtained. the non-encrypted binary sequence is generated according to the formula 5.3 and the key sequence key2=kj(i=1,2,n). the effective information (such as the author information of the web page or the serial number) used to generate watermarking is produced by the construction method of new string(byte) to prove the copyright.5. experimental analysisafter the detailed description of the algorithm of the web page watermarking, on the basis of which, the code should be composed for achieving related functions to verify the correctness of the algorithm. figure 3 is the screenshot of the source code of the original web page; figure 4 is the screenshot of the source code into which the fragile watermarking is embedded; figure 5 is the source coed of the tampered web page.figure 3 the source code of the original web pagefigure 4 the source code into which the fragile watermarking is embeddedfigure 5 the partial source code of the tampered web pages may suffer from three kinds of invasion:one, the content of the web page is tampered, but the watermarking is complete;two, the content of the web page is tampered and the watermarking is damaged;three, the content of the web page is tampered and embedded into the forged watermarking.it is easy to conclude that the web page suffers from tampering by comparing the newly-generated watermarking with the extracted one, since the invaders do not know the key of generating the watermarking and generating the embedded position, and the forged watermarking is definitely different from the real watermarking code, besides, the forged watermarking is not embedded into the real position.the watermarking information is evenly embedded into the web navigation page by using the web robust watermarking program. when the copyright is needed proving, partial watermarking information will be extracted from these navigation web pages and combined into the complete copyright information to identify the ownership. figure 6 is the source code embedded into the robust watermarking:figure 6 the source code embedded into the robust wa

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论