Minnu's Blog on Informatica & Data warehouse concepts


Monday, June 16, 2008

What's all in a Snow Flake Model?

"When can I use a snowflake?" is a question data warehouse designers or developers, came across hundreds of times. The usual answer that it's a bad idea to expose the end users to a physical snowflake design, because it almost always compromises understandability and performance. But in certain situations a snowflake design is not only acceptable, but recommended.
Concern with snowflaking comes from the abuse that the entity-relation approach or from suggestions of an relational database or application modellers who encourages it by creating a plethora of little tables by squeezing out all of the redundancy from the dimension table. The blizzard of snowflaked tables can seriously freeze a data warehouse. All of the little tables intimidate end users, who don't like visual complexity, and all of the little tables destroy the performance of cross browsing amongst dimensional attributes that aren't hierarchically related.

The way to create a classic snowflake, let us remind ourselves, is to remove low cardinality attributes from a dimension table and place these attributes in a secondary dimension table connected by a snowflake key. In cases where a set of attributes form a multilevel hierarchy, the resulting string of tables looks a little like a snowflake - hence the name.
But having issued this warning, We have few cases where variations on a snowflake are not only acceptable, but are the keys to a successful design.

The customer dimension is probably the most challenging dimension in a data warehouse. In a large organization, the customer dimension can be huge, with millions of records, and wide, with dozens of attributes.

To make matters worse, the biggest customer dimensions commonly contain two categories of customers, which I will call "visitor" and "customer." Visitors are anonymous. You may see them more than once, but you don't know their names or anything else about them. On a Web site, the only knowledge you have about visitors is a cookie indicating they have returned. In a retail operation, a visitor engages in an anonymous transaction. Customers, conversely, are reliably registered with your company. You know customers' names, addresses, and as much demographic and historical data as you care to elicit directly from them or purchase from third parties.

Let us assume that at the most granular level of your data collection, 80 percent of the fact table measurements involve visitors and 20 percent involve customers. You accumulate just two simple behavior scores for visitors consisting only of recency (when they last visited you) and frequency (how many times they have visited).

On the other hand, let us assume you have 50 attributes and measures for a customer, covering all the components of location, payment behavior, credit behavior, directly elicited demographic attributes, and purchased demographic attributes.

Now you combine visitors and customers into a single logical dimension called shopper. You give the visitor or customer a single, permanent shopper ID, but make the key to the table a surrogate key so that you can track changes to the shopper over time. Logically, the shopper dimension has the following attributes.

The attributes for both visitors and customers are:
- Shopper surrogate key
- Shopper ID (fixed ID for each physical shopper)
- Recency
- Frequency.

Attributes for customers only are:
- Five name attributes
- 10 location attributes
- 10 behavior attributes
- 25 demographic attributes.
Note the importance of including the recency and frequency information as dimensional attributes rather than as facts and overwriting them as time progresses. This decision makes the shopper dimension very powerful. You can do classic shopper segmentation directly off the dimension without navigating a fact table in a complex application.
Assuming that many of the final 50 customer attributes are textual, you could have a total record width of 500 bytes or more. Suppose you have 20 million shoppers (16 million visitors and four million registered customers). Obviously, you are worried that in 80 percent of your records, the trailing 50 fields contain no data! In a 10GB dimension, this condition gets your attention.

This is a clear case where, depending on the database, you want to introduce a snowflake. You should break the dimension into a base dimension and a snowflake subdimension. All the visitors share a single record in the subdimension, which contains special null attribute values. (See FIGURE Below)

FIGURE . A Shopper dimension where 80 percent of the records have 50 null attributes

Image Hosted by ImageShack.us


Assume you are modelling for a bank which processes credit applications for corporate customers. Though it calculates the risk based on many factors to satisfy regulatory requirements, One of the important factors it also considers is the Audited Accounts of the customer for past few years. Now the customers submit their financial statement yearly / Semi-Yearly / Quarterly / Monthly based on the covenants & conditions that they agree with the bank at the time of their credit approval

So the model should accomadate mutliple financial statements for a single customer. Here you can model a snowflake dimension CUSTOMER & CUSTOMER_FINANACIALS having 1-N relationship


Banks, brokerage houses, and insurance companies all have trouble modeling their product dimensions because each of the individual products has a host of special attributes not shared by other products. Except for a set of common "core" attributes, a checking account doesn't look very much like a mortgage or certificate of deposit. They even have different numbers of attributes. If you try to build a single product dimension with the union of all possible attributes, you end up with hundreds of attributes, most of which are empty in a given record.

The answer in this case is to build a context-dependent snowflake. You isolate the core attributes in a base product dimension table, and include a snowflake key in each base record that points to its proper extended product subdimension. (See Figure below.)

FIGURE . A Financial Product dimension with a subdimension for each product type.

Image Hosted by ImageShack.us

This solution is not a conventional relational join! The snowflake key must connect to the particular subdimension table that a specific product type defines. Usually you can accomplish this task by constructing a relational view for each product type that hardwires the correct join path.


Unlike the financial products dimensions, each of the separate calendars can have the same number of attributes describing fiscal periods, seasons, and holidays. But there may be hundreds of separate calendars. An international retailer may have to deal with a calendar for each foreign country.

In this case you modify the snowflake design to let the snowflake key join to a single calendar subdimension. (See Figure Below.) But the subdimension has higher cardinality than the base dimension! The key for the subdimension is both the snowflake key and the organization key.

Fig: A Calendar dimension with a higher cardinality subdimension.

Image Hosted by ImageShack.us

In this situation, you must specify a single organization in the subdimension before evaluating the join between the tables. When done correctly, the subdimension has a one-to-one relationship with the base dimension as if the two tables were a single entity. Now the entire multienterprise data warehouse can be queried through the calendar of any constituent organization.


These three examples show how variations of snowflake designs can be very useful. I hope you feel more confident about answering the question, "When can I use a snowflake?" When you are thinking about design alternatives, you should separate the issues of physical design from those of logical design. Physical design drives performance. Logical design drives understandability. You can certainly use snowflake designs if you maximize both of these goals.

Courtesy: Ralph Kimball