Are you taking full advantage of the ability to presummarize your data?
By Lawrence Corr
"The single most dramatic way to affect performance in a large data warehouse is to provide a proper set of aggregate (summary) records ... in some cases speeding queries by a factor of 100 or even 1,000. No other means exist to harvest such spectacular gains."
Those are Ralph Kimball's words from "Aggregate Navigation With (Almost) No Metadata" (DBMS magazine, August 1996). If the results are so spectacular - and I don't hear anyone arguing they aren't - why then is aggregation so underused?
First, I believe that the answer lies in our relational database culture or folklore. We were all taught not to redundantly store what could be calculated. This, by the way, is a restriction that users of multidimensional databases have happily ignored. Second, many of us are not clear on what constitutes a good set of aggregates.
One way to get more comfortable with aggregate tables, and to see how mandatory they are, is to think of them as indexes. We would not dream of implementing any significant OLTP or data warehouse database without indexes. These traditional indexes usually duplicate the information content of indexed columns, yet we don't disparage this duplication as "redundancy," because of the benefits.
Traditional indexes get us very quickly to a small number of qualifying records - enough to fill a screen in an OLTP application or satisfy an operational report. However, in the data warehouse, the vast majority of queries ask for millions of records to be summarized, in many cases causing the database to bypass indexes altogether and perform table scans. To speed up these queries, we need another kind of index that can quickly and logically get us to millions of records. I say "logically" because we need only the summarized results of these millions of fact records, as no one but the data miner wants to see the individual records.
Once we start thinking of aggregates as summary indexes we get two tremendous benefits:
First, we stop talking about data marts as containers of only departmental, summarized data and the data warehouse as (the slow) holder of atomic, detailed transaction data. Instead, we naturally think of the summary indexes (aggregates) belonging in the same database as the low-level atomic data that is indexed. We can then reclaim the term data mart to more usefully describe the logically complete building block that we use to incrementally deliver high-performance enterprise data warehouses one subject area at a time.
Second, we become very clear about how aggregates should be used. We don't expect to tell the database to use a particular index when we construct a query, barring the occasional use of hints in the hand-tuned SQL within production systems. What we do expect is for the database optimizer to quietly choose the best combination of indexes behind the scenes. Such transparent use of indexes allows a DBA to drop unused indexes and rebuild better ones.
Ideally, aggregates should be used and administered in exactly the same way. We should not point our queries directly at aggregates. Instead, we should take advantage of aggregate navigation that automatically rewrites queries to access the best presently available aggregates. For a detailed description of how aggregate navigation might work, take a look at Ralph Kimball's aforementioned article. We can now view aggregation navigation as another form of optimization that should be offered by database query optimizers or intelligent middleware.
If we continue to view data marts as mere departmental summaries, we fall into the trap of hard-coding access to these aggregates in queries written against the specific marts; it then becomes unclear how they might drill through automatically to detail data in a different structure in a more atomic database. This obsolete take on data marts is an example of design that hasn't kept pace with technology as aggregate navigation has become more widely available.
Once we start to see aggregates in their true light, we can start planning to take full advantage of them. Although aggregates can profoundly improve query performance, their advantage must be balanced against the cost of storing, building, maintaining, and administering them. The legendary retail data warehouse that collapsed under the weight of more than 2,500 aggregate tables and that took more than 24 hours to refresh is a particularly dramatic example of imbalance. Guidelines can help decide the right number of aggregates for each situation.
Set an overall aggregate storage limit. Consider setting aside a reasonable amount of storage for aggregates, increasing the overall storage by a factor of two. This limit initially might seem high, but we are willing to consider this kind of overhead for traditional indexes, so why not for summary indexes? Significantly undershooting this figure leads to more money spent on additional processing power and memory to speed up queries. Significantly overshooting it leads to diminishing returns from your storage investment because you lack the time to maintain so many additional tables. We must let the DBA provide a dynamic portfolio of aggregates that can be adjusted to match changing demands as the data warehouse matures.
Define aggregates that are small enough. Each aggregate should be 10 to 20 times smaller than the fact table or existing aggregate on which it is based, while still containing enough information to answer common queries. An aggregate's size determines its performance improvement; in the final analysis, it is the reduced I/O activity that provides the performance gain. If the aggregate table is only two or three times smaller than its base table, the limited improvement will not justify the cost of maintaining it.
Unfortunately, it is very easy to define aggregates that are much larger than you anticipate. How many times smaller would a monthly product sales aggregate be than a daily product sales fact table? If your answer is 30 times, you are forgiven but you are likely to be wrong. The reason is "sparsity failure." Is every product sold every day? Probably not. Is every product sold at least once a month? Much more likely (or we stop offering that product), in which case the monthly aggregate may not even be 10 times smaller.
Spread aggregates. Our goal should be to accelerate a broad range of queries, including a few that will be 1,000 times faster. To meet this goal we need to spread our aggregates. We don't just want 10 aggregates that offer a 10-fold improvement; we need a few relatively large aggregate tables that offer 10X improvement and a larger number of small tables that are 100 or 1,000 times faster. It is useful to plot the performance improvement factors of your aggregates against the amount of space used. To visualize aggregate spread, I use an Excel pivot table, so I can see the aggregate space used for the whole warehouse or drill down on specific fact tables. (See Figure.)
This example shows a poor spread - with less than half the allocated space for aggregates in use - and of that, far too much taken by aggregates that are less than 10 times faster (smaller) than the base fact tables. I need to work on increasing the two green slices.
Think of aggregates as indexes. Planning a good spread helps enormously, but in many instances aggregates are still not as easy to define or maintain as traditional indexes, even with the latest DBMS features. The good news is that if you have adopted dimensional models for your atomic detail data, it becomes easier to recognize the opportunities to use aggregates and to then describe them.
Guest columnist Lawrence Corr [firstname.lastname@example.org] designs large data warehouses through DecisionOne Consulting, based in the United Kingdom.
Tuesday, December 25, 2007
Are you taking full advantage of the ability to presummarize your data?