SQLPASS Pre-Conference Lessons Learned – Building a Microsoft Data Warehousing Platform (Brian Knight)

Written by belle
Posted November 2, 2009 at 6:12 pm

I just finished the SQLPASS pre-conference from Brian Knight – Building a Microsoft Data Warehousing Platform, and I’m so pumped to do more BI! Reporting, Data Warehousing, Data Mining – all fun stuff.

In a gist, this is what we learned today (and it’s a lot of acronyms):
– SSIS, ETL
– SSAS
– Cubes, Data Warehouse
– Data Mining
– MDX and DMX (on a very very high level)

It was a lot of details and demos jammed in one day, but I won’t have it any other way, and overall the session was just great! Loved it!

Here are some bits and pieces of information / tips / useful sites / resources (probably mostly for my benefit, so I remember. But if you find it useful, then all the better):

65% of data warehousing projects fail, and one of the reasons for this is lack of communication. Business users want something, but devs deliver something else.
Waterfall model for DW projects – hard to react. Agile is better – and Brian recommends 1 month iterations (ie complete deliverable cycle per month)
Why data warehouse? Consolidate data, aggregates, reporting, analytics, archived data (ex if you need to keep data for years and years)
Business Key = Alternate Key = PK from source
Surrogate Keys – insulates data warehouse from source changes
In one of Pragmatic Works’ projects, they lowered SSIS loading time from 4.5 hours to 3 mins – all about choosing “better” way. In SSIS most components are in memory and are synchronous.
- Most SSIS components are synchronous – good for performance
- Some are asynchronous and partially blocking, not so good – for example UNION ALL and MERGE – slight overhead
- Some are asynchronous and fully blocking – not good at all – for example sort or aggregate transforms – but sometimes there are reasons to use these
In BIDS, devenv.exe -NOSPLASH to prevent splash screen
In SSIS – Protection Levels for Packages – EncryptSensitiveWithUserKey generally a bad idea because it associates the package with your profile; if you transfer package, might say package is corrupt. Best to Encrypt All With Password; ServerStorage relies on msdb
Different types of dimensions
- Type 0 – Changing Attribute – updates; no history kept
- Type 1 – Fixed Attribute – cannot change under any circumstance; ignores change
- Type 2 – helps track changes; biggest disadvantage is additional storage; it creates a new record per update
Good practice – smaller packages; easier to change and maintain
Slowly Changing Dimension Wizard (SCD) – data types need to be lined up, need to be very careful; easy to break; also when you need to customize later on and relaunch the wizard, previous customizations will be dropped. You should keep documentation on how to recreate your customizations
Another problem with SCD – similar to a cursor, goes through row by row; alternative way is to create staging table first. Use T-SQL for what it’s good at, SSIS for what it’s good at. If you can do sorts in T-SQL, do the sorts in T-SQL
Inferred members – data that not in source, but is in Fact; for example, if it takes 2 weeks to add a salesperson in HR, but salesperson already has sales from Day 1
Better load – “Fast Load”; also configure “Maximum Insert Commit Size”
Every member should has an unknown member
ALWAYS alias your columns
Alternatives to detecting updates – instead of specifying all columns in a multi OR expression, use CHECKSUM or HASHBYTES. Problem with CHECKSUM is it is not guaranteed to be unique. Ex.
```
HASHBYTES('sha', ISNULL(col1, '') + ISNULL(col2, ''))
```
So in your expression column can just have something similar to: (DT_WSTR, 64)yourhashcol_src != (DT_WSTR,64)yourhashcol_dw
SQL Cache Transform – creates cache file, can be reused multiple times; 80% faster in SQL Server 2008; SQL Server 2005 was bad with threading
Terminologies – data warehouse (SSAS 2008)
- Measure Group = Fact Table
- Attributes = Columns in a Dimension
- Member = Row in an Attribute
Hierarchies are shortcuts for end users; you can ultimately hide actual attributes (AttributeHierarchyVisible = True), and just expose the hierarchies
Always reconnect to cube after dimension or attribute changes so changes are apparent
Set up attribute relationships ex:
```
Date SK -- Date -- Month -- Quarter -- Year 
```
Note that uniqueness of quarter is really Year + Quarter.
Set up constraints/uniqueness through the “Key Column”. Set up what’s being displayed to user using “Name Column”. You will probably need to explore your dimensions and see which attributes need this.
For any time dimensions, add “Time Intelligence”. As a side effect, this automatically adds additional filters like “Yesterday”, “This Week”, “Last Week” etc in Excel
“Mushroom Cloud” – Invalid queries, but you are getting invalid results. For example if you drag Product then EmployeeCount. This gives you the EmployeeCount value for Product.
In Cube Editor, MeasureGroup, set IgnoreUnrelatedDimension = False helps prevent “mushroom clouds”; result will display blank instead of giving you invalid value
Discretization Method – allows you to create dynamic buckets of data
- Automatic
- EqualAddress – your just specify number of groups
- Clusters – groups based on your data
HideMemberIf (trying to remember why this option was highlighted in the session)
MDX is language for querying the cube/SSAS; used for creating KPIs, calculations
Tuple in MDX points to a cell; you can work with six (6) different axes. Good book is MS Press MDX Step by Step
When you want to show data, put in SELECT; if you don’t but you want to filter by it, put in WHERE
Parenthesis in an MDX helps you represent a tuple
SSRS and Excel can generate some MDX
Calculations < Calculated Member does not store anything in the cube, just the metadata; actual calculations done on the fly
- PRO: Only formula is stored in the cube; don’t need extra storage
- CON: Only formula is stored in the cube; performance will be worse
Usage Based Optimization : Go to SSMS < Connect to SSAS < Set CreateQueryLogTable = true; you can also choose to “Review Aggregation Usage”
How to lock data in your cube : Go to your dimension then go to Role, and add roles/users
Option: Enable Visual Totals – like an implicit WHERE; Off by default, which means users will still see all data even if data is locked down; for example, if user can only see US data, his/her totals should be just for US, but s/he will see US Totals, then a different Grand Total. Enable this to lock down these totals to only the data they can see
Deployment wizard in SSAS – creates .asdatabase file that contains metadata required to deploy the cube
You dont need to always process the whole cube; you can do incremental processing
Partitioning in SSAS – in about 50K records, you will start to feel the slowness
SQL Server Standard allows 3 partitions; SQL Server Enterprise allows any number
New feature in SQL Server 2008 Enterprise – Proactive Caching
In SQL Server 2008, you can use profiler on SSAS to deconstruct MDX queries. In Excel, you can also right click on some columns then check MDX
There is an option to do offline OLAP : Excel < Options < OLAP Tools < Offline OLAP. You can get to your data, but actions and drillthrough are off
Double hop issue in SSRS and SSAS when you build reports on top of the cube : check Kerberos, SPN
You don’t need a cube to do data mining! You can also allow your users to do predictions based on flat tables or views.
Sample data mining scenario: used to auto approve insurance claims; can be used for fraud detection; can be integrated with SSIS in conditional tasks
For data mining structures, Decision Trees is the best way to start; as far as alogrithms go, you can even create your own, or purchase 3rd party algorithms
Users can get/do data mining through Excel
Users can do drillthroughs with data mining

Useful Resources / Sites

pragmaticworks.com
sqlshare.com
ssas-info.com
dtsxchange.com
sqlserverdatamining.com
Marco Russo’s Expert Cube Development with Microsoft SQL Server 2008 Analysis Services
Brian Knight’s 24-Hour Trainer: Microsoft SQL Server 2008 Integration Services (Wrox Programmer to Programmer)
MS Press Microsoft SQL Server 2008 MDX Step by Step (Step By Step (Microsoft))