Retail Sales
Overview
Four-step dimensional design process Transaction-level fact tables Additive and non-additive facts Sample dimension table attributes Causal dimensions Degenerate dimensions Extending an existing dimension model Snowflaking dimension attributes Avoiding the “too many dimensions” trap Surrogate keys
1
Four-Step Dimensional Design Process 1.
Select the business process to model.
2.
not business department or function E.g., purchasing, ordering, shipping, invoicing, inventorying
Declare the grain of the business process.
Specifies individual fact table row E.g., individual line item on sales ticket, daily snapshot of the inventory levels for a product
Four-Step Dimensional Design Process 3.
Choose the dimensions that apply for each fact table row.
4.
Identify the numeric (measured) facts that will populate each fact table row.
Q: How do business people describe the data that results from the business process? E.g., date, product, store, customer, transaction type Q: What are we measuring? Typical facts are numeric additive figures E.g., quantity ordered, dollar cost amount
In making decisions regarding the 4 steps, consider both the requirements as well as the realities of the source data
2
Retail Case Study
Large grocery chain: 100 grocery stores over 10regions Each store:
Departments: grocery, frozen foods, dairy, meat, produce, bakery, floral, health/beauty aids, etc. 60,000 products (SKUs = stock keeping units) on shelves 55,000 SKUs with UPCs 5,000 SKUs without UPCs but with assigned SKU numbers
Data is collected:
from cash s into a point-of-sale (POS) system at back door where vendors make deliveries
Retail Case Study – Cont’d
Management concerns
Logistics of ordering, stocking, and selling products Maximizing profit Product pricing Lowering cost of acquisition and overhead Use of promotions to increase sales temporary price reductions newspaper ads grocery store displays coupons
3
Step 1. Select the Business Process
Decide what business process to model, by combining an understanding of the business requirements with an understanding of data realities. The first dimensional model built should be the one
with the most impact, that answers the most pressing business questions, is readily accessible for data extraction.
In retail case study: POS retail sales Business Question: What products are selling in which stores on what days and under what promotional conditions?
Step 2. Declare the Grain
What level of data detail should be made available in the dimensional model? Choose the most atomic information captured by the business process.
Atomic data Most detailed, cannot be subdivided Facilitates ad hoc, unexpected usage and ability to drill down to details
Case study grain: individual line item on a POS transaction
4
Step 3. Choose the Dimensions
A careful grain statement determines the primary dimensions. It is then usually possible to add additional dimensions. If an additional desired dimension violates the grain by causing additional fact rows to be generated, then the grain statement must be revised to accommodate this dimension. Case study dimensions: date, product, store, promotion
Preliminary Retail Sales Schema
POS Sales Transaction Fact
Product Dimension
Promotion Key (PK) Promotion attributes TBD
Date Dimension
Product Key (PK) Product attributes TBD
Promotion Dimension
Date Key (FK) Product Key (FK) Store Key (FK) Promotion Key (FK) POS Transaction Number Other facts TBD
Date Key (PK) Date attributes TBD
Store Dimension
Store Key (PK) Store attributes TBD
5
Step 4. Identify the Facts
Picking the business measurements for the fact table: true to the grain. Case study - Facts collected by POS system:
Sales quantity, sales price/unit, sales $ amount, standard cost $ amount Gross Profit = cost – sales
Recommendation: Include in fact table even though it can be calculated. Eliminates the possibility of error.
For non-additive measurements such as percentages and ratios (e.g., gross margin) store the numerator (gross profit) and denominator ($ revenue) in the fact table. The ratio can be calculated in a data access tool for any slice of the fact table. Caution: Calculate the ratio of the sums, not the sum of the ratios
Date Dimension
Ubiquitous in every data mart Use verbose, self-explanatory values rather than coded values. They are used as column headers in reports. By decoding in the database, we ensure consistency across different application environments.
E.g., Holiday Indicator – use values: Holiday, Nonholiday; as opposed to Y/N
Date Key should be an integer rather than a date data type Data warehouses need an explicit date dimension table to describe fiscal periods, seasons, holidays, weekends, and other calendar calculations that are not ed by the SQL date function. If transaction time is of interest, we may need a separate Time Dimension table
6
Product Dimension
Describes every SKU in the store Fill this dimension with as many descriptive attributes as possible. “Robust dimension attributes deliver robust analytic slicing and dicing capabilities.” Hierarchies = groups of attributes Merchandise hierarchy
SKUs roll up to brands to categories to departments. Each is a many-to-one relationship
Although there will be redundancy, no need to normalize. Given the relative size of the dimension (as compared to the fact table) space saving is minimal.
Store Dimension
The store dimension: Store Key (PK), Store Name, Store Number (Natural Key), Store Address, … Possible to represent multiple hierarchies in a dimension table
Store to any geographic attribute (e.g., ZIP, county, state) Store to store district to region
7
Promotion Dimension
Describes the promotion conditions under which a product is sold Called a “causal dimension” – describes factors thought to cause a change in product sales (price reductions, ads, displays, coupons) Could keep all 4 causal mechanisms in a single dimension
… or split into 4 separate dimensions
They are highly correlated, so not much difference in space requirements More efficient browsing for finding out how various promotions are used together May be more understandable to business istration may be more straightforward
To avoid null keys in the fact table (violation of referential integrity), for line items not being promoted include a row in the promotion dimension to indicate “No Promotion in Effect”
Factless Fact Table
Q: Which products were under promotion but did not sell? Cannot answer yet. POS sales fact table has only products that were sold Answer: Create Promotion Coverage Factless Fact Table
Factless Fact Table = has no measurement metrics Contains date, product, store, and promotion keys
Two-step process to answer Q:
Query Promotion Coverage table: products under promotion on given date From POS Sales Fact table: products sold Answer is the set difference of above
8
Degenerate Dimension (DD)
Dimension keys used in fact table without corresponding dimension tables In case study: POS Transaction # Still useful for grouping by transaction Common DDs: order numbers, invoice numbers Fact table primary key: Product Key and POS Transaction Number
Retail Schema Extensibility
Original schema extends gracefully because POS transaction data was modeled at its most granular level. Premature aggregation limits ability to extend if new dimensions do not apply to higher grain Case study new dimensions:
Frequent Shopper Clerk Time of Day
9
Schema Extensibility
Dimensional models can handle extensions without invalidating existing applications: New dimension attributes – simply add columns to dimension table. If new attribute is only available after point in time, populate old dimension records with something like “Not Available” New dimensions – add foreign field keys to fact table New measured facts – add to fact table. If not at the same grain, then need separate fact table Dimension becoming more granular – create new dimension. May imply more granular fact table, in which case, may have to rebuild the fact table. Addition of a completely new data source involving existing and new dimensions – usually needs new fact table
Resisting Dimension Normalization
Snowflaking = Dimension table normalization
Redundant attributes are removed from the denormalized dimension table and are placed in normalized secondary dimension tables Fully snowflaked schema = 3NF ER diagram
The dimension tables must not be normalized, and should remain as flat tables. Numerous tables and s usually translate into slower query performance. Efforts to normalize any of the tables in a dimensional database solely in order to save disk space are a waste of time. Disk space savings gained by normalizing the dimension tables are typically less than one percent of the total disk space needed for the overall schema. Normalized dimension tables destroy the ability to browse within a dimension or across dimensions (e.g., list package types for each brand in a category). SQL needed becomes too complex. The fact table is naturally normalized.
10
Too Many Dimensions
Too many dimensions increase space requirements for the fact table. A very large number of dimensions typically means that several dimensions are not completely independent and should be combined. A single hierarchy should not be captured in separate dimensions.
Surrogate Keys
Surrogate keys are integers assigned sequentially as needed to populate a dimension. They serve to dimension tables to the fact table. Avoid embedding intelligence in the data warehouse keys. Benefits:
Surrogate keys buffer the DW environment from operational changes. What happens when operations decide to recycle numbers after some period of inactivity? Fine for operational systems, but problematic for DW if it is using numbers as a PK. Can more easily integrate data from multiple operational systems, even if they lack consistent source keys. Performance advantages because small size of surrogate keys leads to smaller fact tables Surrogate keys are used to one of the primary techniques for handling changes in dimension table attributes.
11
Inventory
Overview
Value chain implications Inventory periodic snapshot model, transaction and accumulating snapshot models Semi-additive facts Enhanced inventory facts Data Warehouse bus architecture and matrix Conformed dimensions and facts
12
Value Chain
The value chain identifies the natural, logical flow of an organization’s primary activities. Operational source systems produce transactions or snapshots at each step in the value chain. They generate interesting performance metrics along the way. Each business process generates one or more fact tables.
Inventory Models
Inventory periodic snapshot
Inventory transactions
Inventory level of each product measured daily (or weekly) – represented as a separate row in a fact table As products move through the warehouse, all transactions with impact on inventory levels are recorded
Inventory accumulating snapshot
One fact table row for each product updated as the product moves through the warehouse
13
Inventory Periodic Snapshot Model
Business need
Analysis of daily quantity-on-hand inventory levels by product and store
Business process
Granularity
Dimensions
Retail store inventory Daily inventory by product at each store Date, product, store
Fact
Quantity on hand
Inventory Periodic Snapshot Model Challenge
Very dense (huge) fact table
As opposed to retail sales, which was sparse because only about 10% of products sell each day
60,000 items in 100 stores = 6,000,000 rows If 14 bytes per row: 84MB per day One-year period: 365 x 84MB = 30GB Solution: Reduce snapshot frequencies over time
Last 60 days at daily level Weekly snapshots for historical data For a 3-year period =208 snapshots vs. 3x365=1095 snapshots; reduction by a factor of 5
14
Semiadditive Facts Inventory levels (quantity on hand) are additive across products or stores, but NOT across dates = semi-additive facts Compare to Retail Sales:
once the product is sold it is not counted again
Static level measurements (inventory, balances…) are not additive across date dimension; to aggregate over time use average over number of time periods.
Enhanced Inventory Facts
Number of turns = total quantity sold / daily average quantity on hand Days’ supply = final quantity on hand / average quantity sold Gross profit = value at latest selling price - value at cost Gross margin = gross profit / value at latest selling price GMROI (Gross Margin Return On Inventory)
Need additional facts:
GMROI = number of turns * gross margin measures effectiveness of inventory investment high = lot of turns and more profit, low = low turns and low profit quantity sold, value at cost, value at latest selling price
GMROI is not additive and, therefore, is not stored in enhanced fact table. It is calculated from the constituent columns.
15
Inventory Transactions Model
Record every transaction that affects inventory
Receive product Place product into inspection hold Release product from inspection hold Return product to vendor due to inspection failure Place product in bin Authorize product for sale Pick product from bin Package product for shipment Ship product to customer Receive product from customer Return product to inventory from customer return Remove product from inventory
Inventory Transactions Model - Con’t
Dimensions: date, warehouse, product, vendor, inventory transaction type. The transaction-level fact table contains the most detailed information possible about the inventory. It is useful for measuring the frequency and timing of specific transaction types. It is impractical for broad data warehouse questions that span dates or products. To give a more cumulative view of a process, some form of snapshot table often accompanies a transaction fact table.
16
Inventory Accumulating Snapshot Model
Build one record in the fact table for each product delivery to the warehouse Track disposition of a product until it leaves the warehouse
Receiving Inspection Bin placement Authorization to sell Picking Boxing Shipping
The philosophy of the inventory accumulating snapshot fact table is to provide an updated status of the product shipment as it moves through above milestones. Rarely used in long-running, continuously replenished inventory processes. More on this in chapter 5.
Value Chain Integration
Both business and IT organizations are interested in value chain integration Desire to look across the business to better evaluate overall performance Data marts may correspond to different business processes Need to look consistently at dimensions shared between business processes Need an integrated data warehouse architecture If dimension table attributes in various marts are identical, each mart is queried separately; the results are then outer-ed based on a common dimension attribute = drill across
17
Data Warehouse Bus Architecture
Cannot built the enterprise data warehouse in one step. Building isolated pieces will defeat consistency goal. Need an architected incremental approach data warehouse bus architecture. By defining a standard bus interface for the data warehouse environment, separate data marts can be implemented by different groups at different times. The separate data marts can be plugged together and usefully coexist if they adhere to the standard.
Data Warehouse Bus Architecture – Cont’d
During architecture phase, team designs a master suite of standardized dimensions and facts that have uniform interpretation across the enterprise. Separate data marts are then developed adhering to this architecture.
18
Data Warehouse Bus Matrix
The rows of the bus matrix correspond to business processes data marts Separate rows should be created if:
the sources are different, the processes are different, or a row represents more than what can be tackled in a single implementation iteration.
Creating the DW bus matrix is a very important up-front deliverable of a DW implementation. The DW bus matrix is a hybrid resource: technical design tool, project management tool, and communication tool.
Conformed Dimensions
Conformed dimensions are:
Conformed dimensions have consistent
identical, or strict mathematical subsets of the most granular, detailed dimension. Dimension keys Attribute column names Attribute definitions Attribute values
If two marts have dimensions (e.g., customer, product) that are not conformed, then they cannot be used together
19
Types of Dimension Conformity
Mean same thing
Rolled-up level of granularity
Roll-up dimensions conform to the base-level atomic dimension if they are a strict subset of that atomic dimension. (see Fig. 3.9)
Dimension subset at same level of granularity
Single shared table or physical copy Consistent data content, data interpretation, presentation
At same level but one represents only a subset of rows
Combination of above
Centralized Dimension Authority
The major responsibility of the centralized dimension authority is to:
establish, maintain, and publish the conformed dimensions to all client data marts.
90% of up-front data architecture effort Political challenge
20
Conformed Facts
In general, facts table data is not duplicated explicitly in multiple data marts. If facts live in more than one location, then their definitions and equations must be the same and they must be called the same. If it is impossible to conform a fact exactly, then different names should be given to different interpretations. This will make it less likely that incompatible facts will not be used in a calculation.
Procurement
21
Overview
Value chain reinforcement Blended versus separate transaction schema Slowly changing dimension techniques
Procurement Case Study
Procurement involves a wide range of activities:
Negotiation of contracts Issuing of purchase requisitions & POs Tracking receipts Authorizing payments
Common analytic requirements:
Which materials/products are purchased most frequently? Who supplies them? At what prices? Across the enterprise, are there opportunities to negotiate contracts by consolidating suppliers, single sourcing? Are we purchasing from the preferred vendors? How are vendors performing?
22
Procurement Transactions 1.
Business process to model: Procurement
2. 3.
Grain: 1 row per procurement transaction Key dimensions:
Transaction date Product Vendor (1 row for each vendor) Contract (1 row for each set of negotiated with a vendor) Procurement transaction type
Transaction date and product are conformed dimensions
4.
Transactions: purchase requisitions, purchase orders, shipping notifications, receipts, and payments
Measured facts:
Procured units Transaction amount
Multiple- vs. Single-Transaction Fact Tables: Issues
Business s view purchase orders, shipping notices, warehouse receipts, and vendor payments as separate & unique processes Several procurements transactions come from different source systems
Several transaction types have different dimensionality
Purchasing system: purchase requisitions and purchase orders Warehousing system: shipping notices and warehouse receipts s payable system: vendor payments E.g., discounts applicable to vendor payments but not to other types of transactions
Control numbers such as PO # generated during procurement process are candidates for degenerate dimensions Design decision:
Build separate fact tables for each transaction type, or Build a blended transaction fact table with a transaction type dimension
No simple answer.
23
Multiple- vs. Single-Transaction Fact Tables - Con’t.
Questions to ask:
What are the s’ analytic requirements? How do s usually analyze data? Do they analyze multiple transaction types together or do they usually look at a single transaction type? Are there multiple unique business processes? Yes leaning towards separate tables Are multiple source systems involved? Yes leaning towards separate tables What is the dimensionality of the facts? Do some dimensions apply only to some transaction types? Yes leaning towards separate tables
Solution: multiple transaction fact tables
Multiple Fact Tables
Multiple fact tables for procurement processes Advantages:
Richer, more descriptive dimensions and attributes Simplified staging activities, since operational data exist in separate source systems Loading data into separate fact tables will be less complex than attempting to integrate from multiple sources
Disadvantages:
More time to manage and ister: more tables to load, index, and aggregate
24
Slowly Changing Dimensions
We have assumed dimensions to be independent of time, but some dimensions (other than natural keys) may change slowly with time Need to track change, without full-blown normalized structure; without making every dimension time-dependent For each attribute in our dimension tables, we must specify a strategy to handle change
Techniques for Dealing with Dimension Change – Type 1: Overwrite the Value
Overwrite the old attribute value in the dimension row, replacing it with the current value. the attribute always reflects the most recent assignment The type 1 response is easy to implement, but it does not maintain any history of prior attribute values Question: Is there a business need for retaining the old attribute value?
25
Techniques for Dealing with Dimension Change – Type 2: Add a Dimension Row
Create a new dimension row reflecting the new attribute Two separate surrogate keys – one for old row and one for new
Could also use a “most recent row indicator” to tell us which of the two rows is the current Fact table is again untouched Adding a dimension row is the primary technique for accurately tracking SCD attributes Advantage:
Product Key is used as the primary key instead of the SKU number, which is the natural key and is the same for both rows
New dimension row automatically partitions history in fact table – pre-change fact rows use the pre-change surrogate key No need to revisit preexisting aggregation tables
Disadvantage:
Accelerated dimension table growth Does not allow us to associate the new attribute value with the old fact file or vice versa
Techniques for Dealing with Dimension Change – Type 3: Add a Dimension Column
Add a new dimension column containing the old attribute value (E.g., “Prior Department”) Overwrite the old value with the new More appropriate when there is a need to associate new attribute values with old fact history
E.g., business need to track both old and new values of department attribute both forward and backward
Management can use either value for analysis Allows for observing new and historical fact data by either the new or prior attribute values Used less frequently Inappropriate to track numerous intermediate attribute values
26
Hybrid Slowly Changing Dimension Techniques
Two approaches that combine the basic SCD techniques:
Predictable changes with multiple version overlays Unpredictable changes with singleversion overlay
These approaches provide more flexibility at the cost of greater complexity
Predictable Changes with Multiple Version Overlays
Used in cases of sales organization realignments Example: Over a 5-year period the sales organization is reorganized five times. At first sight, candidate for Type 2 approach (add dimension row), but more complex business requirements. E.g.,
Report each year’s sales using the district map for that year Report each year’s sales using the district map from an arbitrary different year Report an arbitrary span of years’ sales using a single district map from a chosen year.
Type 3 is also inappropriate because >2 district maps Because changes are predictable, an extension of Type 3 is possible Multiple District columns:
Current District; District 2001; District 2002; …
27
Unpredictable Changes with SingleVersion Overlay
Preserve historical accuracy surrounding unpredictable attribute changes while ing the ability to report historical data according to the current values Issue a new dimension row (type 2) to capture the change and add a new dimension column to track the historical value (type 3). Also, overwrite “Current Department” value (Type 1).
More Rapidly Changing Dimensions
Break off the rapidly changing attributes into one or more separate dimensions Two foreign keys in fact table: 1. 2.
Primary dimension table Rapidly changing attribute(s)
28