I just finished the SQLPASS pre-conference from Brian KnightBuilding a Microsoft Data Warehousing Platform, and I’m so pumped to do more BI! Reporting, Data Warehousing, Data Mining – all fun stuff.

In a gist, this is what we learned today (and it’s a lot of acronyms):
– SSIS, ETL
– SSAS
– Cubes, Data Warehouse
– Data Mining
– MDX and DMX (on a very very high level)

It was a lot of details and demos jammed in one day, but I won’t have it any other way, and overall the session was just great! Loved it!

Here are some bits and pieces of information / tips / useful sites / resources (probably mostly for my benefit, so I remember. But if you find it useful, then all the better):

  • 65% of data warehousing projects fail, and one of the reasons for this is lack of communication. Business users want something, but devs deliver something else.
  • Waterfall model for DW projects – hard to react. Agile is better – and Brian recommends 1 month iterations (ie complete deliverable cycle per month)
  • Why data warehouse? Consolidate data, aggregates, reporting, analytics, archived data (ex if you need to keep data for years and years)
  • Business Key = Alternate Key = PK from source
  • Surrogate Keys – insulates data warehouse from source changes
  • In one of Pragmatic Works’ projects, they lowered SSIS loading time from 4.5 hours to 3 mins – all about choosing “better” way. In SSIS most components are in memory and are synchronous.
    • Most SSIS components are synchronous – good for performance
    • Some are asynchronous and partially blocking, not so good – for example UNION ALL and MERGE – slight overhead
    • Some are asynchronous and fully blocking – not good at all – for example sort or aggregate transforms – but sometimes there are reasons to use these
  • In BIDS, devenv.exe -NOSPLASH to prevent splash screen :)
  • In SSIS – Protection Levels for Packages – EncryptSensitiveWithUserKey generally a bad idea because it associates the package with your profile; if you transfer package, might say package is corrupt. Best to Encrypt All With Password; ServerStorage relies on msdb
  • Different types of dimensions
    • Type 0 – Changing Attribute – updates; no history kept
    • Type 1 – Fixed Attribute – cannot change under any circumstance; ignores change
    • Type 2 – helps track changes; biggest disadvantage is additional storage; it creates a new record per update
  • Good practice – smaller packages; easier to change and maintain
  • Slowly Changing Dimension Wizard (SCD) – data types need to be lined up, need to be very careful; easy to break; also when you need to customize later on and relaunch the wizard, previous customizations will be dropped. You should keep documentation on how to recreate your customizations
  • Another problem with SCD – similar to a cursor, goes through row by row; alternative way is to create staging table first. Use T-SQL for what it’s good at, SSIS for what it’s good at. If you can do sorts in T-SQL, do the sorts in T-SQL
  • Inferred members – data that not in source, but is in Fact; for example, if it takes 2 weeks to add a salesperson in HR, but salesperson already has sales from Day 1
  • Better load – “Fast Load”; also configure “Maximum Insert Commit Size”
  • Every member should has an unknown member
  • ALWAYS alias your columns
  • Alternatives to detecting updates – instead of specifying all columns in a multi OR expression, use CHECKSUM or HASHBYTES. Problem with CHECKSUM is it is not guaranteed to be unique. Ex.
    HASHBYTES('sha', ISNULL(col1, '') + ISNULL(col2, ''))

    So in your expression column can just have something similar to: (DT_WSTR, 64)yourhashcol_src != (DT_WSTR,64)yourhashcol_dw

  • SQL Cache Transform – creates cache file, can be reused multiple times; 80% faster in SQL Server 2008; SQL Server 2005 was bad with threading
  • Terminologies – data warehouse (SSAS 2008)
    • Measure Group = Fact Table
    • Attributes = Columns in a Dimension
    • Member = Row in an Attribute
  • Hierarchies are shortcuts for end users; you can ultimately hide actual attributes (AttributeHierarchyVisible = True), and just expose the hierarchies
  • Always reconnect to cube after dimension or attribute changes so changes are apparent
  • Set up attribute relationships ex:
    Date SK -- Date -- Month -- Quarter -- Year 

    Note that uniqueness of quarter is really Year + Quarter.

  • Set up constraints/uniqueness through the “Key Column”. Set up what’s being displayed to user using “Name Column”. You will probably need to explore your dimensions and see which attributes need this.
  • For any time dimensions, add “Time Intelligence”. As a side effect, this automatically adds additional filters like “Yesterday”, “This Week”, “Last Week” etc in Excel
  • “Mushroom Cloud” – Invalid queries, but you are getting invalid results. For example if you drag Product then EmployeeCount. This gives you the EmployeeCount value for Product.
  • In Cube Editor, MeasureGroup, set IgnoreUnrelatedDimension = False helps prevent “mushroom clouds”; result will display blank instead of giving you invalid value
  • Discretization Method – allows you to create dynamic buckets of data
    • Automatic
    • EqualAddress – your just specify number of groups
    • Clusters – groups based on your data
  • HideMemberIf (trying to remember why this option was highlighted in the session)
  • MDX is language for querying the cube/SSAS; used for creating KPIs, calculations
  • Tuple in MDX points to a cell; you can work with six (6) different axes. Good book is MS Press MDX Step by Step
  • When you want to show data, put in SELECT; if you don’t but you want to filter by it, put in WHERE
  • Parenthesis in an MDX helps you represent a tuple
  • SSRS and Excel can generate some MDX
  • Calculations < Calculated Member does not store anything in the cube, just the metadata; actual calculations done on the fly
    • PRO: Only formula is stored in the cube; don’t need extra storage
    • CON: Only formula is stored in the cube; performance will be worse
  • Usage Based Optimization : Go to SSMS < Connect to SSAS < Set CreateQueryLogTable = true; you can also choose to “Review Aggregation Usage”
  • How to lock data in your cube : Go to your dimension then go to Role, and add roles/users
  • Option: Enable Visual Totals – like an implicit WHERE; Off by default, which means users will still see all data even if data is locked down; for example, if user can only see US data, his/her totals should be just for US, but s/he will see US Totals, then a different Grand Total. Enable this to lock down these totals to only the data they can see
  • Deployment wizard in SSAS – creates .asdatabase file that contains metadata required to deploy the cube
  • You dont need to always process the whole cube; you can do incremental processing
  • Partitioning in SSAS – in about 50K records, you will start to feel the slowness
  • SQL Server Standard allows 3 partitions; SQL Server Enterprise allows any number
  • New feature in SQL Server 2008 Enterprise – Proactive Caching
  • In SQL Server 2008, you can use profiler on SSAS to deconstruct MDX queries. In Excel, you can also right click on some columns then check MDX
  • There is an option to do offline OLAP : Excel < Options < OLAP Tools < Offline OLAP. You can get to your data, but actions and drillthrough are off
  • Double hop issue in SSRS and SSAS when you build reports on top of the cube : check Kerberos, SPN
  • You don’t need a cube to do data mining! You can also allow your users to do predictions based on flat tables or views.
  • Sample data mining scenario: used to auto approve insurance claims; can be used for fraud detection; can be integrated with SSIS in conditional tasks
  • For data mining structures, Decision Trees is the best way to start; as far as alogrithms go, you can even create your own, or purchase 3rd party algorithms
  • Users can get/do data mining through Excel
  • Users can do drillthroughs with data mining

Useful Resources / Sites

VN:F [1.9.22_1171]
Rating: 9.0/10 (2 votes cast)
VN:F [1.9.22_1171]
Rating: 0 (from 0 votes)
SQLPASS Pre-Conference Lessons Learned - Building a Microsoft Data Warehousing Platform (Brian Knight), 9.0 out of 10 based on 2 ratings
Be Sociable, Share!