Data Warehouse Knowledge Base: April 2008

Wednesday, April 30, 2008

Types of Facts and Fact Tables

Types of Facts
- Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
- Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
- Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:
- Date
- Store
- Product
- Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:
- Date
- Account
- Current_Balance
- Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
- Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

- Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Bill Inmon vs. Ralph Kimball

In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two.

Bill Inmon's paradigm
Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm
Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.

There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.

MOLAP, ROLAP, And HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages
- Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.
- Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages
- Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
- Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages
- Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
- Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages
- Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
- Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Conceptual, Logical, And Physical Data Models

There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other.

Conceptual Data Model
Features of conceptual data model include:
- Includes the important entities and the relationships among them.
- No attribute is specified.
- No primary key is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the different entities.

Logical Data Model
Features of logical data model include:
- Includes all entities and relationships among them.
- All attributes for each entity are specified.
- The primary key for each entity specified.
- Foreign keys (keys identifying the relationship between different entities) are specified.
- Normalization occurs at this level.
At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database.

In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).

The steps for designing the logical data model are as follows:
- Identify all entities.
- Specify primary keys for all entities.
- Find the relationships between different entities.
- Find all attributes for each entity.
- Resolve many-to-many relationships.
- Normalization.

Physical Data Model
Features of physical data model include:
- Specification all tables and columns.
- Foreign keys are used to identify relationships between tables.
- Denormalization may occur based on user requirements.
- Physical considerations may cause the physical data model to be quite different from the logical data model.
At this level, the data modeler will specify how the logical data model will be realized in the database schema.

The steps for physical data model design are as follows
- Convert entities into tables.
- Convert relationships into foreign keys.
- Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements

Slowly changing dimensions

The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where value of the attribute for a record varies over time.

The following example describes it

Manveen is a customer with Gucci. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:
Customer Key | Name | State
-----------------------------------
1001 | Manveen | Illinois

At a later date, she moved to Los Angeles, California on January, 2003. How should Gucci now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

In our example, recall we originally have the following table:
Customer Key | Name | State
-------------------------------------
1001 | Manveen | Illinois

After Manveen moved from Illinois to California, the new information replaces the new record, and we have the following table:
Customer Key | Name | State
-------------------------------------
1001 | Manveen | California

Advantages
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Manveen lived in Illinois before.

Usage
About 50% of the time.

When to use Type 1
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.

Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.

In our example, recall we originally have the following table:
Customer Key | Name | State
------------------------------------
1001 | Manveen | Illinois

After Manveen moved from Illinois to California, we add the new information as a new row into the table:
Customer Key | Name | State
--------------------------------------
1001 | Manveen | Illinois
1005 | Manveen | California

Advantages
- This allows us to accurately keep all historical information.

Disadvantages
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.

Usage
About 50% of the time.

When to use Type 2
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.

Type 3: The original record is modified to reflect the change.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:
Customer Key | Name | State
-----------------------------------
1001 | Manveen | Illinois

To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
- Customer Key
- Name
- Original State
- Current State
- Effective Date

After Manveen moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):
Customer Key | Name | Original State | Current State | Effective Date
-------------------------------------------------------------------------
1001 | Manveen | Illinois | California | 15-JAN-2003

Advantages
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.

Disadvantages
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Manveen later moves to Texas on December 15, 2003, the California information will be lost.

Usage
Type 3 is rarely used in actual practice.

When to use Type 3
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Glossary of most common terms

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational database.

Drill Up: Data analysis to a parent attribute

Surrogate Keys: A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A surrogate key is the primary key for a dimension table and is independent of any keys provided by source data systems. Surrogate keys are created and maintained in the data warehouse and should not encode any information about the contents of records; automatically increasing integers make good surrogate keys. The original key for each record is carried in the dimension table but is not used as the primary key. Surrogate keys provide the means to maintain data warehouse information when dimensions change. Special keys are used for date and time dimensions, but these keys differ from surrogate keys used for other dimension tables.
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date

Data Warehouse Knowledge Base