数据仓库的7层优化.doc_第1页
数据仓库的7层优化.doc_第2页
数据仓库的7层优化.doc_第3页
数据仓库的7层优化.doc_第4页
数据仓库的7层优化.doc_第5页
已阅读5页,还剩1页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Seven Layers of OptimizationGoing from Detail to SummaryIf one looks at the evolution of the data warehouse arena, there is a clear pattern that emerges. That pattern is one of compromising the project in order to overcome technical barriers at the expense of business value. One such compromise is the automatic use of summary data or data models geared towards a singular application.Given the diverse nature of the readers, I will give some brief background in order to level set the discussion. In the beginning, companies would simply extract data off the operational systems and put the data into another database for business users to analyze. Typically the database used for analysis was the same as the one used for OLTP (On-Line Transaction Processing) processing. This was done to separate the longer running and more complex analysis workloads from interfering with the more mission critical OLTP work. It is also important to note that most operational systems have data stored in a normalized model, that is one where the data is stored according to certain rules that eliminate redundancy and simplify data relationships. While good for the OLTP workload, such a model often inhibited most databases from running the analytical queries for reasons discussed below.As the data volume grew on the analytical systems the time for queries took longer and longer. Two factors came into play here. Since the analytical systems did not need to support the OLTP workload many developers started to pre-plan for the answer sets. This pre-planning included creating summary tables (data that is pre-aggregated), placing indexes on the system (this eliminated the need to scan large volumes of data), denormalizing the model (putting data into one table as opposed to having tables that need to be joined together), and storing the data in sorted order (thus avoiding and sort activity). These are the four areas that most database have problems with and all of them require prior knowledge of the analysis and reports being requested.The second factor was that the people in charge of the development of these systems did not have a background in real data warehouse environments. They were trained in optimizing the performance by manipulating the database parameters and data models. In the early days this was deemed acceptable as before the users could get little to no information so this was considered a step forward.As the user communities became larger and the queries became more complex it was noticed that the users could no longer define all the analysis they wanted to run and even if they could the time to process and store the answer sets was becoming unmanageable. The solution was to once again extract data to other systems that specialized in “slicing and dicing” data. These systems were known as OLAP (On-Line Analytical Processing) systems.Today, users are demanding not only more complex and flexible analysis but also more timely information. Data must be available 7x24 and many business users are demanding that data be available within hours, in some cases minutes, of the event occurring. At one retailer there is a goal to have the data into the data warehouse within 5 seconds of the register scan! All of these pressures make the process of moving data to different platforms, manipulating and pre-planning answer sets, and having to know the query beforehand a huge barrier to the business moving forward. Rather than complicating the architecture the developers should be simplifying the systems with less data movement and propagation. Again this is where having people trained in the area of data warehousing makes a big difference in how far your solution will take you.If you have read previous articles from me then you should be aware of my stand on using detail data. For those not in the know, I, having been trained in data warehousing, have long been a big proponent that companies keep detail data at the lowest form possible in a functional neutral data model. This enables the business community to ask any number of questions from any angle of the business perspective. The basic premise is that one can always aggregate detail data but you can never decompose summary data.Once this position is known I will get two questions. The first is usually “does this mean I should never have summary tables?” Of course the answer to that question is “no”. My position is not to imply that there is no value in summary tables but that one should not replace the detail data with only summary data. It must be an environment of complimentary purposes. I know that most people would not believe me anyway if I said that you should never have a summary table. One clarification here, I also get into the same line of reasoning when discussing normalized versus denormalized (star schema or snowflakes or other) models. The core data model needs to be designed around a business neutral, detail data structure. The steps below also apply to the usage of model changes as well as the level of data summary.Once I get the first question answered I will always get this next question; “So how do I know when to go to summary tables and how do I go about that process?” As I have answered this many times I decided to write this paper. I will attack the process part of the question first and then discuss the justification later. To address this question, one must realize the primary reason you build summary tables and denormalize the data, and that is for performance. If you were able to ask any question against any amount of data and get an instant response then you would never build summaries. The bottom line is that you build these structures as a work around to other limitations. Their very existence requires more space, more data management, and more time between the occurrence of an event and the ability to take action.The important point to take away is that going from the detail level to a summary level is not a single step, nor is it a process to be undertaken without understanding the consequences. There are many levels and options in between. The best way to illustrate this is by stepping through the process.The 7 Steps of performance enhancement1) Detail, Normalized dataThe lowest layer is to have the atomic data in a normalized business model. This means that the model is not specific to any function or business group. It also means that the data is the lowest level of detail data that is necessary to support the business objectives. This may not necessarily be transaction data from day one though it should be able to go down to that level as the business requirements change. When using the detail tables most companies use “access views” for security and concurrency reasons. These are okay as long as the views are strict reflections of the underlying tables.With this foundation, the first order of business is to discover whether or not the performance desired could be obtained by simply running against the detail data directly with optimized SQL. Often times it is simply a matter of the tool or user not executing the most efficient statements. This type of arrangement can be seen in many successful accounts, and it can be argued that it is just this arrangement that makes them more successful than others.2) ViewsThe first step up the ladder to gain higher performance is to implement views. Typically views will not greatly effect the performance though in cases they can make quite a difference. These cases are where a third party tool wants a specific data model. The use of views may cause the tool to create “better” SQL. You can also incorporate optimized join techniques in the views as opposed to having each tool and user having to be knowledgeable of these techniques. Another benefit to this approach is the added layer of security to the data tables themselves. One particular example of this usage was a telecommunication account that used views to create a “logical star schema” model for use with the Microstrategy tools.3) IndexesThe next step up the process is to add indexes. These can be simple indexing techniques like secondary indexes to complex structures like covered indexes, join indexes, or aggregate indexes. In the majority of cases, this will resolve the performance requirement. The biggest advantage here is that the indexes are maintained at the same time as the base data tables. This means that while there is overhead required to keep and automatically update these indexes, the answer sets will always be in agreement with the detail (because you are essentially going against the detail data). This is actually a frequent practice against the smaller reference tables. Usage on the larger, more frequently updated tables should be carefully considered.4) System expansionNow comes the tricky part. If indexes and views do not resolve the performance need you have some decisions to make. An important one is your data management strategy and your longer-term business requirements. If your goal is to have a real time data warehouse then you need to understand that adding the following “downstream” processes will create a barrier to that goal. One option is to increase the system configuration. Of course this needs to be a cost justifiable solution. It is important to understand what type of expansion would be required and the business benefit that is derived by meeting the performance requirement. If the expansion of the system is not an option, or not feasible, then you can proceed to the following steps. These all require that the data be maintained separately from the detail data. These next steps must be undertaken with the recognition of the cost, disk space, data management and time lags that will be introduced. One customer had a philosophy of a single store of data and no downstream processing. They decided it was better to get a two-minute response time to a query mere hours after a transaction occurred than to get a sub second response to a query one week after the transaction occurred.Please keep in mind that these next steps still assume that the data is physically residing on the same platform as the detail data. This leverages the capability of the Bynet and Teradata SQL statements.5) “Rational” summaries and denormalizationOnce you have decided to proceed down this road, you have prioritized the need for higher performance over the flexibility of analysis. This is not to imply it is a bad decision, rather it simply must be made in a strategy of co-existence environment as opposed to a replacement.So what does “rational” mean here? In this step you are making summary tables and denormalization that is not completely tailored to a specific function. For example, you may have a summary table at branch level as opposed to account level. You basically keep the same data structure but have a higher level of aggregation. The same goes for the denormalization. The example here is that according to normalized data modeling you would never put zip code and city, state in the same address table. If you have zip code then you know the city and state via a reference table. We all know that practically anytime you ask for zip code you are also going to ask for the city so these columns are carried in the same tables to avoid the joins.These are some common examples where the step is taken from a more generic position of greater good to the user community. The step was not taken to appease a sole functional purpose. In many accounts you can actually see these tables in the original data model as people assume you have to have these in order to run queries. Again while this may be forgiven in the case of the relatively smaller, and less frequently maintained, reference tables it becomes more of an issue as the data volume or the data volatility increases6) “Irrational” summaries and denormalizaitonIf you still do not get the performance that you need then you are starting to run out of options. The “irrational” step means that you are now going to build summaries and data models geared specifically to a single function. The table will have the exact columns you need in the report and any pretense of flexible analysis is forgotten. These tables will usually require a great deal more time and cost for data management and disk storage.There are reasons for this type of operation. In one account they have a very complex process defined to segment a customer and the definition of a visit. This process has very arcane business relationships and intricate rules. There is little chance of getting this analysis to happen in a timely manner against the normalized detail data.One situation that occurred in Asia was an account using 30 tables, all of which had the same columns for customer information. When asked why this was done, the response was that 30 different reports are requested and each needs the same customer header information. This is an example of the practice becoming irrational and hints to larger foundational problems.7) Expansion or Export data off the systemIf you are still unable to get the performance desired then it is time to make hard decisions. You are now going to either have to expand the system to drive the performance or you will need to export data off the platform to a special system for the functional analysis. Both of these options come at a relatively high cost. As before, you need to understand the business value that can be driven and the extra data movement and management required to keep the extracted data in agreement with the detail data.Justifying the StepsOf course, all the above actions require that you have understood the cost involved with each step and the benefits derived. They also require that you are make decisions that support your longer term goals while at the same time support your short term needs. In many cases you may create summary tables or add denormalized models that will eventually be dropped as the functions evolve over time. This is okay as long as the elimination of the tables do not cause interruptions and massive application changes.So how does one justify moving up the ladder? One method is to look at each step from a cost versus benefit analysis. What is the cost to fully manage the new index, summary table or denormalized data model? The cost needs to include the physical aspects such as disk space, resources to manage the structure, and the loss in opportunity due to the time delay to maintain the process. Once this is known you can balance that against the benefits.One benefit is clearly faster performance when the query is run and the opportunity that is gained by this quicker response. Other benefits may be user concurrency rates and throughput, user satisfaction and productivity, and better utilization of third party tools.Another aspect of to consider is the frequency of the queries and how consistent performance is held. For example, at one account they were running 1000 queries a week. Seventy percent of the queries asked for summary level questions against detail data. The queries ran approximately 4 minutes each. By running the queries against a summary table they executed in approximately 6 seconds. Using some math, we have 700 queries by 4 minutes or 2800 minutes. With the summary table you spend 700 queries times 6 seconds is 70 minutes. This is a difference of 2730 minutes of processing! It was also found that the summary table would take 105 minutes of maintenance a week. Still saving 2625 minutes each week! This was a pretty clear choice of making a summary table.It is interesting to note that over time though the query usage has shifted from one where many queries are resolved by summary data to one where the clear majority of queries are going against detail data. With less queries benefiting from the summary table it was simply dropped without impact to other processes.Of course the formula is not the only criteria to use when deciding whether to take the step up the ladder though it is a good sanity check. On

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论