Data Warehouse Knowledge Base

Tuesday, June 10, 2008

How to resolve "OLE DB error: OLE DB or ODBC error: Login failed for user 'NT AUTHORITY\NETWORK SERVICE" in SSAS

Envrionment
- You have SQL server installed on your local machine with localhost as the only instance
- You have SQL server analysis serviecs installed on the machine
- You used SSAS to build a cube and tried to deploy the cube on localhost and got error OLE DB error: OLE DB or ODBC error: Login failed for user 'NT AUTHORITY\NETWORK SERVICE

To resolve this error, do the following
- Right click on My Computer and click Manage
- On the left hand side, expand Services and Applications and click on Services
- Locate service SQL Server Analysis Service
- Right click on it, and click on Properties
- Go to LogOn tab
- Ensure that radio button Local System Account is checked
- Now restart the service.

Try building the cube again. It should work fine.

The following is the screen shot of the final screen.

Wednesday, May 28, 2008

Create Time Dimension in SQL Server

Use dbname

GO

-- Create the time dimension table
CREATE TABLE [dbo].[Dim_Time](
[TimeKey] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[Date] [varchar](8) NOT NULL,
[Year] [smallint] NOT NULL,
[NameOfMonth] [varchar](9) NOT NULL,
[Month] [varchar](2) NOT NULL,
[DayOfWeek] [varchar](9) NOT NULL,
[DayOfMonth] [smallint] NOT NULL,
[WeekOfYear] [smallint] NOT NULL,
)

GO

DECLARE @startdate VARCHAR(10)
DECLARE @enddate VARCHAR(10)

-- Specify date range for which you want to create time dimension
SET @startdate='2008-01-01';
SET @enddate='2015-12-31';

-- Populate the time dimension
WITH mycte AS
(
SELECT CAST(@startdate AS DATETIME) DateValue
UNION ALL
SELECT DateValue + 1 FROM mycte WHERE DateValue + 1 < @enddate

)

INSERT INTO Dim_Time([Date],[Year],[NameOfMonth],[Month],[DayOfWeek],[DayOfMonth],[WeekOfYear])
SELECT CONVERT (VARCHAR,datevalue,112)
,DATEPART(YY, DateValue)
,DATENAME(MM, DateValue)
,DATEPART(MM, DateValue)
,DATENAME(DW, DateValue)
,DATEPART(DD, DateValue)
,DATEPART(WK, DateValue)
from mycte OPTION (MAXRECURSION 0)

Wednesday, May 7, 2008

How to handle Early arriving facts

Data warehouses are usually built around the ideal normative assumption that measured activity (the fact records) arrive in the data warehouse at the same time as the context of the activity (the dimension records). When we have both the fact records and the correct contemporary dimension records, we have the luxury of bookkeeping the dimension keys first, and then using these up-to-date
keys in the accompanying fact records.

Basically three things can happen when we bookkeep the dimension records.
- If the dimension entity (say Customer) is a new member of the dimension, we assign a fresh new surrogate dimension key.
- If the dimension entity is a REVISED version of a Customer, we use the Type 2 slowly changing dimension technique of assigning a new surrogate key and storing the revised Customer description as a new dimension record.
- Finally if the Customer is a familiar, unchanged member of the dimension, we just use the dimension key we already have for that Customer.

For several years, we have been aware of special modifications to these procedures to deal with Late Arriving Facts, namely fact records that come into the warehouse very much delayed. This is a messy situation because we have to search back in history within the data warehouse to decide how to assign the right dimension keys that were in effect when the activity occurred at the right point in the past. See the Intelligent Enterprise article (Backward in Time) on this subject at www.intelligententerprise.com/000929/webhouse.jhtml. If we have Late Arriving Facts, is it possible to have Early Arriving Facts? How can this happen? Are there situations where this is important?

An early arriving fact takes place when the activity measurement arrives at the data warehouse without its full context. In other words, the statuses of the dimensions attached to the activity measurement are ambiguous or unknown for some period of time. If we are living in the conventional batch update cycle of one or more days latency, we can usually just wait for the dimensions to be reported to us. For example, the identification of the new customer may come in a separate feed delayed by several hours. We may just be able to wait until the dependency is resolved. But if we are in a real time data warehousing situation in which the fact record must be made visible NOW, and we don’t know when the dimensional context will arrive, we have some interesting choices.

Our real-time bookkeeping needs to be revised, again using Customer as the problem dimension.
- If the natural Customer key on the incoming fact record can be recognized, then we provisionally attach the surrogate key for the existing most recent version of that Customer to the fact table, but we also hold open the possibility that we will get a revised version of this Customer reported to us ata later time.
- we add the revised Customer record to the dimension with a new surrogate key and then go in and destructively modify the fact record’s foreign key to the Customer table.
- Finally, if we believe that the Customer is new, we assign a new Customer surrogate key with a set of dummy attribute values in a new Customer dimension record. We then return to this dummy dimension record at a later time and make Type 1 (overwrite) changes to its attributes when we get more complete information on the new Customer. At least this step avoids destructively changing any fact table keys.
There is no way to avoid a brief provisional period where the dimensions are “not quite right.” But these bookkeeping steps try to minimize the impact of the unavoidable updates to keys and other fields. If these early arriving records are all housed in a “hot partition” pinned in memory, then aggregate fact table records should not be necessary. Only when the hot partition is conventionally loaded into the static data warehouse tables at the end of the day (and when the dimensions have caught up with the facts) do you need to build the aggregates.

Thursday, May 1, 2008

Common indexing techniques for Datawarehouses

B-Tree Index: Two representations (rowid and bitmap) are implemented at the leaves of the index depending on the cardinality of the data.

Advantages
- It speeds up knownqueries.
- It is well suited for high cardinality.
- The space requirement is independent of the cardinality of the indexed column.
- It is relatively inexpensive when we update the indexed column since individual rows are locked.

Disadvantages
- It performs inefficiently with low cardinality data
- It does not support ad hoc queries. More I/O operations are needed for a wide range of queries.
- The indexes can not be combined before fetching the data. ·

Databases that implement this type of indexing
Most of commercial products (like Oracle, Informix, Red Brick) implement this type of indexing

Pure Bitmap Index: An array of bits is utilized to represent each unique column value of each row in a table, setting the bits corresponding to the row either ON(valued 1) or OFF(valued 0). The equality encoding scheme is used.

Advantages
- It is well suited for lowcardinality columns.
- It utilizes bitwise operations.
- The indexes can be combined before fetching raw data.
- It uses low space
- It works well with parallel machine.
- It is easy to build.
- It performs efficiently with columns involving scalar functions (e.g., COUNT).
- It is easy to add new indexed value.
- It is suitable for OLAP.

Disadvantages
- It performs inefficiently with high cardinality data.
- It is very expensive when we update index column. The whole bitmap segment of the updated row is locked so the other row can not be updated until the lock is released.
- It does not handle spare data well.

Databases that implement this type of indexing
- Oracle
- Informix
- Sybase
- Informix
- Red Brick
- DB2

Encoded Bitmap Index: The index is the binary Bit-Sliced Index built on the attribute domain

Advantages
- It uses space efficiently.
- It performs efficiently with wide range query.

Disadvantages
- It performs inefficiently with equality queries.
- It is very difficult to find a good encoding scheme.
- It is rebuilt every time when a new indexed value for which we run out of bit to represent is added.

Databases that implement this type of indexing
- DB2

Bitmap Join Index: The index is built by restriction of a column on the dimension table in the fact table.

Advantages
- It is flexible.
- It performs efficiently.
- It supports star queries.

Disadvantages
- The order of indexed column is important.

Databases that implement this type of indexing
- Oracle
- Informix
- Red Brick

Projection Index: The index is built by storing actual values of column(s) of indexed table.

Advantages
- It speeds up the performance when a few columns in the table are retrieved.

Disadvantages
- It can be used only to retrieve raw data (i.e., column list in selection).

Databases that implement this type of indexing
- Sybase

Rapidly (large slowly) changing dimensions

A dimension is considered to be a rapidly changing dimension if one or more of its attributes changes frequently in many rows. For a rapidly changing dimension, the dimension table can grow very large from the application of numerous Type 2 changes. The terms "rapid" and "large" are relative, of course. For example, a customer table with 50,000 rows and an average of 10 changes per customer per year will grow to about five million rows in 10 years, assuming the number of customers does not grow. This may be an acceptable growth rate. On the other hand, only one or two changes per customer per year for a ten million row customer table will cause it to grow to hundreds of millions of rows in ten years.

Tracking bands can be used to reduce the rate of change of many attributes that have continuously variable values such as age, size, weight, or income. For example, income can be categorized into ranges such as [0-14,999], [15,000-24,999], [25,000-39,999], and so on, which reduce the frequency of change to the attribute. Although Type 2 change records should not be needed to track age, age bands are often used for other purposes, such as analytical grouping. Birth date can be used to calculate exact age when needed. Business needs will determine which continuously variable attributes are suitable for converting to bands.

Often, the correct solution for a dimension with rapidly changing attributes is to break the offending attributes out of the dimension and create one or more new dimensions. Consider the following example.

An important attribute for customers might be their account status (good, late, very late, in arrears, suspended), and the history of their account status. Over time many customers will move from one of these states to another. If this attribute is kept in the customer dimension table and a Type 2 change is made each time a customer's status changes, an entire row is added only to track this one attribute. The solution is to create a separate account_status dimension with five members to represent the account states.

A foreign key in the customer table points to the record in the account_status dimension table that represents the current account status of that customer. A Type 1 change is made to the customer record when the customer's account status changes. The fact table also contains a foreign key for the account_status dimension. When a new fact record is loaded into the fact table, the customer id in the incoming fact record is used to look up the current account_table key in the customer record and populate it into the fact record. This captures a customer's account history in the fact table. In addition to the benefit of removing the rapidly changing item from the customer dimension, the separate account status dimension enables easy pivot analysis of customers by current account status in OLAP cubes. However, to see the entire account history for a customer, the fact table must be joined to the customer table and the account_status table and then filtered on customer id, which is not very efficient for frequent queries for a customer's account history.

This scenario works reasonably well for a single rapidly changing attribute. What if there are ten or more rapidly changing attributes? Should there be a separate dimension for each attribute? Maybe, but the number of dimensions can rapidly get out of hand and the fact table can end up with a large number of foreign keys. One approach is to combine several of these mini-dimensions into a single physical dimension. This is the same technique used to create what is often called a "junk" dimension that contains unrelated attributes and flags to get them out of the fact table. However, it is still difficult to query these customer attributes well because the fact table must be involved to relate customers to their attributes. Unfortunately, business users are often very interested in this kind of historical information, such as the movement of a customer through the various account status values.

If business users frequently need to query a dimension that has been broken apart like this, the best solution is to create a "factless" schema that focuses on attribute changes. For example, consider a primary data warehouse schema that keeps track of customers' purchases. The Customer dimension has been developed as a Type 2 slowly changing dimension, and account status has been pulled out into a separate dimension. Create a new fact table, CustomerChanges, that tracks only the changes to the customer and account status. A sample schema is illustrated in the following figure(click to enlarge) .

The fact table, CustomerChanges, receives a new row only when a change is made to the Customer table that includes information about the customer's current account status. The fact table has no numeric measure or fact; an entry in the table signifies that an interesting change has occurred to the customer. Optionally, the CustomerChanges schema can track the reason for the change in the CustomerChangeReason and AccountChangeReason dimension tables. Sample values for the account_change_reason might include "Customer terminated account", "Account closed for non-payment", and "Outstanding balance paid in full".

Attribute history tables like this are neither dimension tables nor fact tables in the usual sense. The information in this kind of table is something like Quantity on Hand in an inventory fact table, which cannot be summarized by adding. However, unlike Quantity on Hand in an inventory table, these attributes do not change on a fixed periodic basis, so they cannot be numerically quantified and meaningfully averaged unless the average is weighted by the time between events.

Best Practices

1. Have unknown member in Dimension table
It is often useful to have a pre-established "no such member" or "unknown member" record in each dimension to which orphan fact records can be tied during the update process. Business needs and the reliability of consistent source data will drive the decision as to whether such placeholder dimension records are required

2. Avoid using GUIDs as keys
You should avoid using GUID as key in the data warehouse database. GUIDs may be used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because GUIDs are four times larger.

Wednesday, April 30, 2008

Types of Facts and Fact Tables

Types of Facts
- Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
- Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
- Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.

Let us use examples to illustrate each of the three types of facts. The first example assumes that we are a retailer, and we have a fact table with the following columns:
- Date
- Store
- Product
- Sales_Amount

The purpose of this table is to record the sales amount for each product in each store on a daily basis. Sales_Amount is the fact. In this case, Sales_Amount is an additive fact, because you can sum up this fact along any of the three dimensions present in the fact table -- date, store, and product. For example, the sum of Sales_Amount for all 7 days in a week represent the total sales amount for that week.

Say we are a bank with the following fact table:
- Date
- Account
- Current_Balance
- Profit_Margin

The purpose of this table is to record the current balance for each account at the end of each day, as well as the profit margin for each account for each day. Current_Balance and Profit_Margin are the facts. Current_Balance is a semi-additive fact, as it makes sense to add them up for all accounts (what's the total current balance for all accounts in the bank?), but it does not make sense to add them up through time (adding up all current balances for a given account for each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to add them up for the account level or the day level.

Types of Fact Tables
Based on the above classifications, there are two types of fact tables:
- Cumulative: This type of fact table describes what has happened over a period of time. For example, this fact table may describe the total sales by product by store by day. The facts for this type of fact tables are mostly additive facts. The first example presented here is a cumulative fact table.

- Snapshot: This type of fact table describes the state of things in a particular instance of time, and usually includes more semi-additive and non-additive facts. The second example presented here is a snapshot fact table.