




已阅读5页,还剩11页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Web Log Pre-processingAbstractOver the past decade, with the rapid growth in Internet, especially Web2.0 era and BS application times, the arrival of blogs, virtual communities, online office, e-commerce, e-government, B2B and C2C and other Web applications are emerging, the Web has become one of the core elements of human life and work. How to enhance the value of the Web site, allowing users a better experience, and quickly find the information you need to find the users needs, improve the competitiveness of e-commerce applications, how to survive in the fierce war of the Internet, these issues require we find the answer in the vast amounts of Webdata. Thus, the combination of data mining technology and Internet applications constitute a very active and very important a field of study - Web mining.Have a similar structure and content of the access log file on each Web server, Web logs automatically become an important data source for Web mining and its mining has a universal and practical significance. However, the large amount of web log data, containing a lot of noise, not suitable for Web mining, must first of its pre-treatment. More than 50% of the data pre-processing the workload of the total workload. This paper introduces the Web log, the log pre-processing methods, and usually seek the maximum forward path and frequent traversal path algorithm based on the use of turkuamk.fi Web log data made a simple: analysis ECLF format logs, data pre-processing, the establishment of the Web log data warehousing, and obtained the largest forward distance and frequent path.Keywords: data mining, data pre-processing, Web logs, the Web miningCHAPTER 1 INTRODUCTION CHAPTER 2 WEB LOG PRE-PROCESSING2.1 Web log format 2.2 data cleaning2.3 The Subscriber Identity 2.4 session identification 2.5 path supplement 2.6 transaction identification 2.7 maximum forward paths 2.8 The frequent access paths 2.9 Web Log Pre-processing 2.9.1 The credibility of Log2.9.2 Dynamic Website 2.9.3 Other problems CHAPTER 3, THE IMPLEMENTATION OF A PRE-TREATMENT SYSTEM 3.1 The user clicks on the event model 3.2 Creating a data warehouse3.2.1 Data Sheet Introduction 3.2.2 The data loading process3.3 Based on pre-treatment of the data warehouse 3.3.1 The maximum forward path algorithm 3.3.2 Frequent traversal path found 3.3.3 Example 3.4 The experimental data analysis 3.4.1 Frequently accessed3.4.2 The maximum forward path3.4.3 Frequent traversal path CHAPTER 4 CONCLUSIONS ACKNOWLEDGEMENTS REFERENCES APPENDIX INTRODUCTIONOver the past decade, with the rapid growth in Internet, especially Web2.0 era and BS application times, the arrival of blogs, virtual communities, online office, e-commerce, e-government, B2B and C2C and other Web applications are emerging, the Web has become one of the core elements of human life and work. How to enhance the value of the Web site, allowing users a better experience, and quickly find the information you need to find the users needs, improve the competitiveness of e-commerce applications, how to survive in the fierce war of the Internet, these issues require we find the answer in the vast amounts of Web data. Thus, the combination of data mining technology and Internet applications constitute a very active and very important research area - Web mining (Web Mining).During the time of Web mining, Web applications are not the same, but on each Web server has a structure similar to the access log file, so its excavation has a general and realistic significance, mentioned in this article Web logs, without special instructions, refer to the Web server side of the access log.Of course, to carry out excavation for a specific Web application, best, most accurate method is to build Web applications, Web mining needs take after take into account the useful information through the Web application records, the custom format log However, this method does not have universal significance, so this is not to be discussed.By mining Web server log files, you can access the Web page content based on user and user groups to access the similarity between pages and user clustering analysis, found that frequent access path, and thus to optimize the access path to improve the site topology, provide personalized services. By mining Web log files for user interests, Web access mode personalized information for users interested in content and related links highlight the user module. Analysis and research access to the log law can be found in the interest of users and potential users, to help the site to constantly adjust the marketing strategy to gain greater competitive advantage. Therefore, the Web log mining technology has very important significance, specifically including the following aspects:Personalized service Personalized service from a single users browsing history, find the users interest to provide a personalized interface to each user. Simple process-based Web log mining: the current users session and usage patterns through Web log mining to match, get the current preferences of the users interest, then the current user interest in preference to recommend a group of users may be interested in links, advertising, products or other services.System to improveThe original design purpose Web server logs to provide statistics for Web site management Web site and system administrators. By analysing the log, to help better study the Web caching, network transmission, load balancing, data distribution strategy, leading to the conclusion for the Web system performance improvements. Web traffic behaviour analysis, and use it to the Web cache to achieve a balance of access, reduce congestion and optimize the transmission. In addition, the analysis of unusual large-scale traffic, frequent access error, can prevent the Web site intrusion, deception, remove invalid links.Web site structure designWeb log mining for Web site designers provides detailed user feedback, to help them to adjust the topology of the structure and content of the Web site, according to the actual users browsing, and optimize the Web site, in order to better serve users.To help business decision-makingFor e-commerce site by analysing the log, the user buying trends, studies of user psychology, business decisions, or through the analysis of the source URL, adjust your input, effectively increase site traffic.Search engine optimizationAnalyse the behaviour of Web Crawler Web log, a site of structural adjustment, the search engine optimization, site better indexed by search engines.Data mining, data pre-processing, because the data in the real world is mostly incomplete, noisy and inconsistent, and a variety of data formats. For data mining algorithms, incorrect input data may lead to wrong or inaccurate mining results; the same time, the data mining algorithms are usually dealing with a fixed-format data, the data exists in reality a wide range, we need to These data processing into the data mining algorithms to use. Data mining algorithms may be only part of the data in the database mining, and this, we need to extract useful data. How to fix the real world data is incomplete and inconsistent, how to remove noisy data, how the existing data into the format of data mining algorithms available, how to extract useful data, how to integrate multiple data sources, which are The data pre-processing tasks to be completed. Data pre-processing work accounted for 50% of the entire data mining process, the results of data pre-processing is the input of data mining algorithms, which directly affects the quality of the mining, data pre-processing in data mining research. At present, researchers have proposed many effective data pre-processing technology. Common data cleaning (Data Cleaning) to remove the noise in the data, correcting data inconsistencies; Data Integration (Data Integration) multiple data sources into a consistent data storage; data transformation (Data Transformation) and data protocol (Data Reduction) can be gathered, to remove redundant features or clustering method to compress the data. Before data mining, data pre-processing techniques, can greatly improve the quality of the data mining model, to reduce the time required in the actual digging and disk space. In other words, the data pre-processing can improve the quality of the data, which help to improve the accuracy and performance of the subsequent mining process. Quality decision-making must trust in the quality of data, data pre-processing is an important step for the knowledge discovery process. Detection of abnormal data, as soon as possible to adjust the data, and the Statute of the data will get good returns in the data mining process. Web Log Pre-processing Web log mining and Web log cleaning, filtering, and re-combination process. The purpose of the Web log mining, data pre-processing is to remove Web log mining process useless property and data, and Web log data into a recognizable form for the mining algorithm.Web Log Pre-processing including data cleaning (Data Cleaning), user identification (Users Identification), session identification (Session Identification), the path complement (Path Completion) and the transaction identifier (Transaction Identification) a few steps.This article introduces the general approach of the major steps in the second chapter will log pre-processing to make, may appear in the log pre-processing, and on the basis of the log pre-processing, introduced for calculating the maximum forward path and frequent access path algorithm. Chapter 3 will give the pre-treatment of a web log, first set up a data warehouse, and logging into the database, and then clean up the analytical work. Chapter III also proposed a user clicks on the event model, and proposed algorithm of this model, the maximum forward path and frequent access path.The Introduction of the Web log pre-treatmentWeb Log Pre-processing including data cleaning (Data Cleaning), user identification (Users Identification), session identification (Session Identification), the path complement (Path Completion) and the transaction identifier(Transaction Identification)a few steps. Transaction identifier, we are given a maximum before the path (maximal forward references) of the algorithm. Demand frequent access path (Large reference sequences) is the parameter frequently used in web mining, and to establish the maximal forward path (maximal forward references) on the basis of after introduce the general pretreatment process, we will give it the general algorithm.Web log formatThe most commonly used Web server software in one of the three kinds of open log file format to record the log file. These three kinds of file Formats: the NCSA Common Log Format CLF(The Common the Log Format) The expansion of the NCSA Common Log format ECLF(Extended the Common the Log Format) And the W3c Extended Log File ExLF(Extended Log File Format)1.Extended common log file format in Table 1.If the Web server domain data is unavailable, then the Web server will be the successful tenderer in this airspace dash symbol-.Table 1 Extended Common Log Format ECLFRemote host (The Remote the host)domainUser submitting the request host name, the general record IP Addressrfc931DomainThis field holds the data through the system identifies as the identifier of the user remote login name from the multi-user system, it almost always contains a-SymbolAuthorized user(auth user) DomainSave Http The user name of the user authenticationDate(Date)DomainRequest the date and timeRequest(Request) DomainHTTP from client Request arrives for this request to establish the first connection. If the requested file exists, this field will determine the URL of the requested file, As well as access to this file.Status(status)DomainStatus code, the status code of this file is requested successfully.Bytes(bytes) DomainHave been the actual number of bytes transferred, excluding HTTP Header information.Reference(Referrer) DomainGo to this page, click on the link where the URL of the page, If this link does not exist, it is saved-.Domain data is actually extracted from the HTTP header Referrer domain, this contains the Referrer HTTP along with the page request sent with.UserAgent(Agent) DomainRequesting the name and version of browser. Derived from the HTTP The head of the user agent field.State (Status) domain, a total of five categories of status codes:(1) Beginning with a status code information code, server administrators and developers can use this code to provide the information.100: continue 101: protocol conversion;(2) At the beginning of the status code indicate success.200: The operation was successful.(3) At the beginning of the status code for redirection, and that the requested resource exists in another URL.(4) At the beginning of the status code indicate that there is an error. The most common 404: File not found.(5) At the beginning of the status code indicate that the Web server, the request cannot be because of network problems or your response.500: Internal Server Error.The reference domain and the user agent field is ECLF In contrast for the CLF Coupled with the ExLF format is not used, does not describe here.Request (Request) domain contains the request method and the requested resource URL, have the OPTIONS request method, the GET, HEAD, POST, PUT, DELETE, the TRACE, CONNECT we are concerned about the GET method, GET, to retrieve the URL to identify the resources .Agent domain can be used to identify the browser, the Web program can optimize the different browsers based on this domain, and also to identify some of the robot through the domain, such as YahooSeeker/1.2 (compatible; Mozilla 4.0; MSIE 5.5; yahooseeker at yahoo-inc dot com;/help/us/shop/merchant/).The following is a common period of the web Log data (ECLF):Log data hereData cleaningWeb The log contains each http The requested information, but not each data after the mining meaningful, such as the user requests a page, while browsing the page can also download pictures, video, css Files, js Files, data cleansing is to get rid of these requests, and reduce the amount of data.Data cleaning can be carried out under the following three aspects:1 URL: General site, just as HTML Files related to user sessions, suffix as gif, Jpg, Js The files can be filtered from the log. If some special sites, such as photo site, you can reset the relevant information. But for some dynamic website, all content is dynamically generated, filtering rules must be adjusted according to the procedure.2Requested action: You can keep only the GET Actions.3Return status: You can only keep record of success of the request returns an error code 404, 501 and other records to be removed. The error log on the website maintenance and security analysis is very important, if the system administrator should analyze these records.4Requests an IP: To remove the access from the network robot, you can create a robot IPA list of filters.Through experiments we 550.8M(3,078,210 Rows) in the log files to clean up, the results obtained 44.4M(265,016Rows), and reductions in data volume.Further consolidation, data clean-up a different URL May point to the same page (the physical domain, corresponding to more than one domain name), to data harmonization, for example, links hereUser identificationThe user refers to the individual to access one or more servers through a browser. Due to the presence of caching, firewall and proxy server, the only reality is very difficult to identify a user. Can tell the user logs userIP Identification and session cookies, browser, operating system.Because multiple users may access via proxy, resulting in a singleIP corresponding to multiple users via IP Difficult to distinguish between users.With the IP With browser and operating system as the user ID there is a certain degree of difficulty, as the users operating system and browser are relatively concentrated, for a large number of users using the same IP The situation cannot be distinguished.Use session cookies, write to each user a unique identity, and this relates to user privacy issues, and the user may simply does not support cookies, or the user will delete or modify the cookies, so it cannot believe. Can be retained on the server side in order to accurately identify the user session information, including the session ID, user name of a registered user, visit the page, some of the Web server such as Apache by a number of modules the records of the cookie, if not on need the support of the web application, versatility is not high.This article discusses the general log pre-treatment, due to data limitations, only inspired by the rules identify a user, if the users IP, then that is a different user; if user IP is the same but different Agent, but also as different users; If the above two are the same, to see the requested page cannot access to visit the page of history, if not, as a new user.Session identificationThe session identification is the users access records into a single session. Timeout mechanism tends to be used to divide the session, if the difference between the two page request times exceeds certain limits that the user has started a new session.Many web applications use 30 minutes as the default timeout (such as PHPs default session timeout value is), In addition to the default time, According to the trial obtained 25.5 minutes of experience.Path supplementaryPath complement is the omission caused due to cache the requeste
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年中级会计考试常见问题解答及试题与答案
- 业主门窗定制合同范例
- 买船合同范例
- 酒店经营内容的多元化发展试题与答案
- 上海吊车租赁合同范例
- 付款合同范例制作
- 仓库货物抵押合同范例
- 2025年消防责任制度试题及答案
- 产权房卖给个人合同范例
- 动手实践的22025年初级护师考试试题及答案
- AIGC背景下视觉传达专业的教学模式浅谈
- 2025年黑龙江齐齐哈尔市网络舆情中心招聘5人历年高频重点提升(共500题)附带答案详解
- 区域代理方案(3篇)
- 八年级期中英语试卷分析及整改措施
- 2025年新劳动合同范本
- 养老院艺术疗愈活动方案
- 《地理高考备考讲座》课件
- 半挂车包月合同范例
- 2024-2030年全球及中国雅思练习和考试平台行业发展规模及未来前景预测报告
- TSG 07-2019电梯安装修理维护质量保证手册程序文件制度文件表单一整套
- 2025深圳劳动合同下载
评论
0/150
提交评论