Tuesday, June 10, 2008

How to resolve "OLE DB error: OLE DB or ODBC error: Login failed for user 'NT AUTHORITY\NETWORK SERVICE" in SSAS

Envrionment
- You have SQL server installed on your local machine with localhost as the only instance
- You have SQL server analysis serviecs installed on the machine
- You used SSAS to build a cube and tried to deploy the cube on localhost and got error OLE DB error: OLE DB or ODBC error: Login failed for user 'NT AUTHORITY\NETWORK SERVICE

To resolve this error, do the following
- Right click on My Computer and click Manage
- On the left hand side, expand Services and Applications and click on Services
- Locate service SQL Server Analysis Service
- Right click on it, and click on Properties
- Go to LogOn tab
- Ensure that radio button Local System Account is checked
- Now restart the service.

Try building the cube again. It should work fine.

The following is the screen shot of the final screen.

Wednesday, May 28, 2008

Create Time Dimension in SQL Server

Use dbname

GO

-- Create the time dimension table
CREATE TABLE [dbo].[Dim_Time](
[TimeKey] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[Date] [varchar](8) NOT NULL,
[Year] [smallint] NOT NULL,
[NameOfMonth] [varchar](9) NOT NULL,
[Month] [varchar](2) NOT NULL,
[DayOfWeek] [varchar](9) NOT NULL,
[DayOfMonth] [smallint] NOT NULL,
[WeekOfYear] [smallint] NOT NULL,
)

GO

DECLARE @startdate VARCHAR(10)
DECLARE @enddate VARCHAR(10)

-- Specify date range for which you want to create time dimension
SET @startdate='2008-01-01';
SET @enddate='2015-12-31';

-- Populate the time dimension
WITH mycte AS
(
SELECT CAST(@startdate AS DATETIME) DateValue
UNION ALL
SELECT DateValue + 1 FROM mycte WHERE DateValue + 1 < @enddate

)

INSERT INTO Dim_Time([Date],[Year],[NameOfMonth],[Month],[DayOfWeek],[DayOfMonth],[WeekOfYear])
SELECT CONVERT (VARCHAR,datevalue,112)
,DATEPART(YY, DateValue)
,DATENAME(MM, DateValue)
,DATEPART(MM, DateValue)
,DATENAME(DW, DateValue)
,DATEPART(DD, DateValue)
,DATEPART(WK, DateValue)
from mycte OPTION (MAXRECURSION 0)

Wednesday, May 7, 2008

How to handle Early arriving facts

Data warehouses are usually built around the ideal normative assumption that measured activity (the fact records) arrive in the data warehouse at the same time as the context of the activity (the dimension records). When we have both the fact records and the correct contemporary dimension records, we have the luxury of bookkeeping the dimension keys first, and then using these up-to-date
keys in the accompanying fact records.

Basically three things can happen when we bookkeep the dimension records.
- If the dimension entity (say Customer) is a new member of the dimension, we assign a fresh new surrogate dimension key.
- If the dimension entity is a REVISED version of a Customer, we use the Type 2 slowly changing dimension technique of assigning a new surrogate key and storing the revised Customer description as a new dimension record.
- Finally if the Customer is a familiar, unchanged member of the dimension, we just use the dimension key we already have for that Customer.

For several years, we have been aware of special modifications to these procedures to deal with Late Arriving Facts, namely fact records that come into the warehouse very much delayed. This is a messy situation because we have to search back in history within the data warehouse to decide how to assign the right dimension keys that were in effect when the activity occurred at the right point in the past. See the Intelligent Enterprise article (Backward in Time) on this subject at www.intelligententerprise.com/000929/webhouse.jhtml. If we have Late Arriving Facts, is it possible to have Early Arriving Facts? How can this happen? Are there situations where this is important?

An early arriving fact takes place when the activity measurement arrives at the data warehouse without its full context. In other words, the statuses of the dimensions attached to the activity measurement are ambiguous or unknown for some period of time. If we are living in the conventional batch update cycle of one or more days latency, we can usually just wait for the dimensions to be reported to us. For example, the identification of the new customer may come in a separate feed delayed by several hours. We may just be able to wait until the dependency is resolved. But if we are in a real time data warehousing situation in which the fact record must be made visible NOW, and we don’t know when the dimensional context will arrive, we have some interesting choices.

Our real-time bookkeeping needs to be revised, again using Customer as the problem dimension.
- If the natural Customer key on the incoming fact record can be recognized, then we provisionally attach the surrogate key for the existing most recent version of that Customer to the fact table, but we also hold open the possibility that we will get a revised version of this Customer reported to us ata later time.
- we add the revised Customer record to the dimension with a new surrogate key and then go in and destructively modify the fact record’s foreign key to the Customer table.
- Finally, if we believe that the Customer is new, we assign a new Customer surrogate key with a set of dummy attribute values in a new Customer dimension record. We then return to this dummy dimension record at a later time and make Type 1 (overwrite) changes to its attributes when we get more complete information on the new Customer. At least this step avoids destructively changing any fact table keys.
There is no way to avoid a brief provisional period where the dimensions are “not quite right.” But these bookkeeping steps try to minimize the impact of the unavoidable updates to keys and other fields. If these early arriving records are all housed in a “hot partition” pinned in memory, then aggregate fact table records should not be necessary. Only when the hot partition is conventionally loaded into the static data warehouse tables at the end of the day (and when the dimensions have caught up with the facts) do you need to build the aggregates.

Thursday, May 1, 2008

Common indexing techniques for Datawarehouses

B-Tree Index: Two representations (rowid and bitmap) are implemented at the leaves of the index depending on the cardinality of the data.

Advantages
- It speeds up knownqueries.
- It is well suited for high cardinality.
- The space requirement is independent of the cardinality of the indexed column.
- It is relatively inexpensive when we update the indexed column since individual rows are locked.

Disadvantages
- It performs inefficiently with low cardinality data
- It does not support ad hoc queries. More I/O operations are needed for a wide range of queries.
- The indexes can not be combined before fetching the data. ·

Databases that implement this type of indexing
Most of commercial products (like Oracle, Informix, Red Brick) implement this type of indexing

Pure Bitmap Index: An array of bits is utilized to represent each unique column value of each row in a table, setting the bits corresponding to the row either ON(valued 1) or OFF(valued 0). The equality encoding scheme is used.

Advantages
- It is well suited for lowcardinality columns.
- It utilizes bitwise operations.
- The indexes can be combined before fetching raw data.
- It uses low space
- It works well with parallel machine.
- It is easy to build.
- It performs efficiently with columns involving scalar functions (e.g., COUNT).
- It is easy to add new indexed value.
- It is suitable for OLAP.

Disadvantages
- It performs inefficiently with high cardinality data.
- It is very expensive when we update index column. The whole bitmap segment of the updated row is locked so the other row can not be updated until the lock is released.
- It does not handle spare data well.

Databases that implement this type of indexing
- Oracle
- Informix
- Sybase
- Informix
- Red Brick
- DB2

Encoded Bitmap Index: The index is the binary Bit-Sliced Index built on the attribute domain

Advantages
- It uses space efficiently.
- It performs efficiently with wide range query.

Disadvantages
- It performs inefficiently with equality queries.
- It is very difficult to find a good encoding scheme.
- It is rebuilt every time when a new indexed value for which we run out of bit to represent is added.

Databases that implement this type of indexing
- DB2

Bitmap Join Index: The index is built by restriction of a column on the dimension table in the fact table.

Advantages
- It is flexible.
- It performs efficiently.
- It supports star queries.

Disadvantages
- The order of indexed column is important.

Databases that implement this type of indexing
- Oracle
- Informix
- Red Brick

Projection Index: The index is built by storing actual values of column(s) of indexed table.

Advantages
- It speeds up the performance when a few columns in the table are retrieved.

Disadvantages
- It can be used only to retrieve raw data (i.e., column list in selection).

Databases that implement this type of indexing
- Sybase

Rapidly (large slowly) changing dimensions

A dimension is considered to be a rapidly changing dimension if one or more of its attributes changes frequently in many rows. For a rapidly changing dimension, the dimension table can grow very large from the application of numerous Type 2 changes. The terms "rapid" and "large" are relative, of course. For example, a customer table with 50,000 rows and an average of 10 changes per customer per year will grow to about five million rows in 10 years, assuming the number of customers does not grow. This may be an acceptable growth rate. On the other hand, only one or two changes per customer per year for a ten million row customer table will cause it to grow to hundreds of millions of rows in ten years.

Tracking bands can be used to reduce the rate of change of many attributes that have continuously variable values such as age, size, weight, or income. For example, income can be categorized into ranges such as [0-14,999], [15,000-24,999], [25,000-39,999], and so on, which reduce the frequency of change to the attribute. Although Type 2 change records should not be needed to track age, age bands are often used for other purposes, such as analytical grouping. Birth date can be used to calculate exact age when needed. Business needs will determine which continuously variable attributes are suitable for converting to bands.

Often, the correct solution for a dimension with rapidly changing attributes is to break the offending attributes out of the dimension and create one or more new dimensions. Consider the following example.

An important attribute for customers might be their account status (good, late, very late, in arrears, suspended), and the history of their account status. Over time many customers will move from one of these states to another. If this attribute is kept in the customer dimension table and a Type 2 change is made each time a customer's status changes, an entire row is added only to track this one attribute. The solution is to create a separate account_status dimension with five members to represent the account states.

A foreign key in the customer table points to the record in the account_status dimension table that represents the current account status of that customer. A Type 1 change is made to the customer record when the customer's account status changes. The fact table also contains a foreign key for the account_status dimension. When a new fact record is loaded into the fact table, the customer id in the incoming fact record is used to look up the current account_table key in the customer record and populate it into the fact record. This captures a customer's account history in the fact table. In addition to the benefit of removing the rapidly changing item from the customer dimension, the separate account status dimension enables easy pivot analysis of customers by current account status in OLAP cubes. However, to see the entire account history for a customer, the fact table must be joined to the customer table and the account_status table and then filtered on customer id, which is not very efficient for frequent queries for a customer's account history.

This scenario works reasonably well for a single rapidly changing attribute. What if there are ten or more rapidly changing attributes? Should there be a separate dimension for each attribute? Maybe, but the number of dimensions can rapidly get out of hand and the fact table can end up with a large number of foreign keys. One approach is to combine several of these mini-dimensions into a single physical dimension. This is the same technique used to create what is often called a "junk" dimension that contains unrelated attributes and flags to get them out of the fact table. However, it is still difficult to query these customer attributes well because the fact table must be involved to relate customers to their attributes. Unfortunately, business users are often very interested in this kind of historical information, such as the movement of a customer through the various account status values.

If business users frequently need to query a dimension that has been broken apart like this, the best solution is to create a "factless" schema that focuses on attribute changes. For example, consider a primary data warehouse schema that keeps track of customers' purchases. The Customer dimension has been developed as a Type 2 slowly changing dimension, and account status has been pulled out into a separate dimension. Create a new fact table, CustomerChanges, that tracks only the changes to the customer and account status. A sample schema is illustrated in the following figure(click to enlarge) .



The fact table, CustomerChanges, receives a new row only when a change is made to the Customer table that includes information about the customer's current account status. The fact table has no numeric measure or fact; an entry in the table signifies that an interesting change has occurred to the customer. Optionally, the CustomerChanges schema can track the reason for the change in the CustomerChangeReason and AccountChangeReason dimension tables. Sample values for the account_change_reason might include "Customer terminated account", "Account closed for non-payment", and "Outstanding balance paid in full".

Attribute history tables like this are neither dimension tables nor fact tables in the usual sense. The information in this kind of table is something like Quantity on Hand in an inventory fact table, which cannot be summarized by adding. However, unlike Quantity on Hand in an inventory table, these attributes do not change on a fixed periodic basis, so they cannot be numerically quantified and meaningfully averaged unless the average is weighted by the time between events.

Best Practices

1. Have unknown member in Dimension table
It is often useful to have a pre-established "no such member" or "unknown member" record in each dimension to which orphan fact records can be tied during the update process. Business needs and the reliability of consistent source data will drive the decision as to whether such placeholder dimension records are required

2. Avoid using GUIDs as keys
You should avoid using GUID as key in the data warehouse database. GUIDs may be used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because GUIDs are four times larger.

Wednesday, April 30, 2008

Types of Facts and Fact Tables

Types of Facts
- Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
- Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
- Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:
- Date
- Store
- Product
- Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:
- Date
- Account
- Current_Balance
- Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
- Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

- Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.

Bill Inmon vs. Ralph Kimball

In the data warehousing field, we often hear about discussions on where a person / organization's philosophy falls into Bill Inmon's camp or into Ralph Kimball's camp. We describe below the difference between the two.

Bill Inmon's paradigm
Data warehouse is one part of the overall business intelligence system. An enterprise has one data warehouse, and data marts source their information from the data warehouse. In the data warehouse, information is stored in 3rd normal form.

Ralph Kimball's paradigm
Data warehouse is the conglomerate of all data marts within the enterprise. Information is always stored in the dimensional model.

There is no right or wrong between these two ideas, as they represent different data warehousing philosophies. In reality, the data warehouse in most enterprises are closer to Ralph Kimball's idea. This is because most data warehouses started out as a departmental effort, and hence they originated as a data mart. Only when more data marts are built later do they evolve into a data warehouse.

MOLAP, ROLAP, And HOLAP

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

MOLAP
This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube. The storage is not in the relational database, but in proprietary formats.

Advantages
- Excellent performance: MOLAP cubes are built for fast data retrieval, and is optimal for slicing and dicing operations.
- Can perform complex calculations: All calculations have been pre-generated when the cube is created. Hence, complex calculations are not only doable, but they return quickly.

Disadvantages
- Limited in the amount of data it can handle: Because all calculations are performed when the cube is built, it is not possible to include a large amount of data in the cube itself. This is not to say that the data in the cube cannot be derived from a large amount of data. Indeed, this is possible. But in this case, only summary-level information will be included in the cube itself.
- Requires additional investment: Cube technology are often proprietary and do not already exist in the organization. Therefore, to adopt MOLAP technology, chances are additional investments in human and capital resources are needed.

ROLAP
This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

Advantages
- Can handle large amounts of data: The data size limitation of ROLAP technology is the limitation on data size of the underlying relational database. In other words, ROLAP itself places no limitation on data amount.
- Can leverage functionalities inherent in the relational database: Often, relational database already comes with a host of functionalities. ROLAP technologies, since they sit on top of the relational database, can therefore leverage these functionalities.

Disadvantages
- Performance can be slow: Because each ROLAP report is essentially a SQL query (or multiple SQL queries) in the relational database, the query time can be long if the underlying data size is large.
- Limited by SQL functionalities: Because ROLAP technology mainly relies on generating SQL statements to query the relational database, and SQL statements do not fit all needs (for example, it is difficult to perform complex calculations using SQL), ROLAP technologies are therefore traditionally limited by what SQL can do. ROLAP vendors have mitigated this risk by building into the tool out-of-the-box complex functions as well as the ability to allow users to define their own functions.

HOLAP
HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can "drill through" from the cube into the underlying relational data.

Conceptual, Logical, And Physical Data Models

There are three levels of data modeling. They are conceptual, logical, and physical. This section will explain the difference among the three, the order with which each one is created, and how to go from one level to the other.

Conceptual Data Model
Features of conceptual data model include:
- Includes the important entities and the relationships among them.
- No attribute is specified.
- No primary key is specified.
At this level, the data modeler attempts to identify the highest-level relationships among the different entities.

Logical Data Model
Features of logical data model include:
- Includes all entities and relationships among them.
- All attributes for each entity are specified.
- The primary key for each entity specified.
- Foreign keys (keys identifying the relationship between different entities) are specified.
- Normalization occurs at this level.
At this level, the data modeler attempts to describe the data in as much detail as possible, without regard to how they will be physically implemented in the database.

In data warehousing, it is common for the conceptual data model and the logical data model to be combined into a single step (deliverable).

The steps for designing the logical data model are as follows:
- Identify all entities.
- Specify primary keys for all entities.
- Find the relationships between different entities.
- Find all attributes for each entity.
- Resolve many-to-many relationships.
- Normalization.

Physical Data Model
Features of physical data model include:
- Specification all tables and columns.
- Foreign keys are used to identify relationships between tables.
- Denormalization may occur based on user requirements.
- Physical considerations may cause the physical data model to be quite different from the logical data model.
At this level, the data modeler will specify how the logical data model will be realized in the database schema.

The steps for physical data model design are as follows
- Convert entities into tables.
- Convert relationships into foreign keys.
- Convert attributes into columns.
Modify the physical data model based on physical constraints / requirements

Slowly changing dimensions

The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where value of the attribute for a record varies over time.

The following example describes it

Manveen is a customer with Gucci. She first lived in Chicago, Illinois. So, the original entry in the customer lookup table has the following record:
Customer Key | Name | State
-----------------------------------
1001 | Manveen | Illinois


At a later date, she moved to Los Angeles, California on January, 2003. How should Gucci now modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

There are in general three ways to solve this type of problem, and they are categorized as follows:

Type 1: The new record replaces the original record. No trace of the old record exists.
In Type 1 Slowly Changing Dimension, the new information simply overwrites the original information. In other words, no history is kept.

In our example, recall we originally have the following table:
Customer Key | Name | State
-------------------------------------
1001 | Manveen | Illinois


After Manveen moved from Illinois to California, the new information replaces the new record, and we have the following table:
Customer Key | Name | State
-------------------------------------
1001 | Manveen | California


Advantages
- This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages
- All history is lost. By applying this methodology, it is not possible to trace back in history. For example, in this case, the company would not be able to know that Manveen lived in Illinois before.

Usage
About 50% of the time.

When to use Type 1
Type 1 slowly changing dimension should be used when it is not necessary for the data warehouse to keep track of historical changes.


Type 2: A new record is added into the customer dimension table. Therefore, the customer is treated essentially as two people.
In Type 2 Slowly Changing Dimension, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key.

In our example, recall we originally have the following table:
Customer Key | Name | State
------------------------------------
1001 | Manveen | Illinois


After Manveen moved from Illinois to California, we add the new information as a new row into the table:
Customer Key | Name | State
--------------------------------------
1001 | Manveen | Illinois
1005 | Manveen | California


Advantages
- This allows us to accurately keep all historical information.

Disadvantages
- This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.
- This necessarily complicates the ETL process.

Usage
About 50% of the time.

When to use Type 2
Type 2 slowly changing dimension should be used when it is necessary for the data warehouse to track historical changes.


Type 3: The original record is modified to reflect the change.
In Type 3 Slowly Changing Dimension, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active.

In our example, recall we originally have the following table:
Customer Key | Name | State
-----------------------------------
1001 | Manveen | Illinois


To accommodate Type 3 Slowly Changing Dimension, we will now have the following columns:
- Customer Key
- Name
- Original State
- Current State
- Effective Date

After Manveen moved from Illinois to California, the original information gets updated, and we have the following table (assuming the effective date of change is January 15, 2003):
Customer Key | Name | Original State | Current State | Effective Date
-------------------------------------------------------------------------
1001 | Manveen | Illinois | California | 15-JAN-2003

Advantages
- This does not increase the size of the table, since new information is updated.
- This allows us to keep some part of history.

Disadvantages
- Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Manveen later moves to Texas on December 15, 2003, the California information will be lost.

Usage
Type 3 is rarely used in actual practice.

When to use Type 3
Type III slowly changing dimension should only be used when it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Glossary of most common terms

Dimensional data model is most often used in data warehousing systems. This is different from the 3rd normal form, commonly used for transactional (OLTP) type systems. As you can imagine, the same data would then be stored differently in a dimensional model than in a 3rd normal form model.

To understand dimensional data modeling, let's define some of the terms commonly used in this type of modeling:

Dimension: A category of information. For example, the time dimension.

Attribute: A unique level within a dimension. For example, Month is an attribute in the Time Dimension.

Hierarchy: The specification of levels that represents relationship between different attributes within a dimension. For example, one possible hierarchy in the Time dimension is Year → Quarter → Month → Day.

Fact Table: A fact table is a table that contains the measures of interest. For example, sales amount would be such a measure. This measure is stored in the fact table with the appropriate granularity. For example, it can be sales amount by store by day. In this case, the fact table would contain three columns: A date column, a store column, and a sales amount column.

Lookup Table: The lookup table provides the detailed information about the attributes. For example, the lookup table for the Quarter attribute would include a list of all of the quarters available in the data warehouse. Each row (each quarter) may have several fields, one for the unique ID that identifies the quarter, and one or more additional fields that specifies how that particular quarter is represented on a report (for example, first quarter of 2001 may be represented as "Q1 2001" or "2001 Q1").

A dimensional model includes fact tables and lookup tables. Fact tables connect to one or more lookup tables, but fact tables do not have direct relationships to one another. Dimensions and hierarchies are represented by lookup tables. Attributes are the non-key columns in the lookup tables.

In designing data models for data warehouses / data marts, the most commonly used schema types are Star Schema and Snowflake Schema.

Star Schema: In the star schema design, a single object (the fact table) sits in the middle and is radially connected to other surrounding objects (dimension lookup tables) like a star. A star schema can be simple or complex. A simple star consists of one fact table; a complex star can have more than one fact table.

Snowflake Schema: The snowflake schema is an extension of the star schema, where each point of the star explodes into more points. The main advantage of the snowflake schema is the improvement in query performance due to minimized disk storage requirements and joining smaller lookup tables. The main disadvantage of the snowflake schema is the additional maintenance efforts needed due to the increase number of lookup tables.

Drill Across: Data analysis across dimensions.

Drill Down: Data analysis to a child attribute.

Drill Through: Data analysis that goes from an OLAP cube into the relational database.

Drill Up: Data analysis to a parent attribute

Surrogate Keys: A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A surrogate key is the primary key for a dimension table and is independent of any keys provided by source data systems. Surrogate keys are created and maintained in the data warehouse and should not encode any information about the contents of records; automatically increasing integers make good surrogate keys. The original key for each record is carried in the dimension table but is not used as the primary key. Surrogate keys provide the means to maintain data warehouse information when dimensions change. Special keys are used for date and time dimensions, but these keys differ from surrogate keys used for other dimension tables.
In contrast to surrogate keys used in other dimension tables, date and time dimension keys should be "smart." A suggested key for a date dimension is of the form "yyyymmdd". This format is easy for users to remember and incorporate into queries. It is also a recommended surrogate key format for fact tables that are partitioned into multiple tables by date