SCOM DWH aggregations data loose Tip and Tricks

10 Apr

 

This ‘short’ post will be about the DWH aggregations again. It will contain some tips on how not to loose any data.

!!! All I suggest and do here is at own risk and totally unsupported without instructions given from Microsoft support. !!!!

The problem:

You run a performance report on 1 month. You notice that you are missing some days of aggregated hourly/daily data. You were not having any troubles as you know… till now.

image

Analyze:

First we are going to look if we have any aggregations that are not completed yet.
Run the SQL Query below on the DWH database:

— checking the to be processed aggregations ————–

SELECT     COUNT(*) AS Aggr_behind, Dataset.DatasetDefaultName
FROM         StandardDatasetAggregationHistory INNER JOIN
                      Dataset ON StandardDatasetAggregationHistory.DatasetId = Dataset.DatasetId
WHERE     (StandardDatasetAggregationHistory.DirtyInd = 1)
GROUP BY Dataset.DatasetDefaultName

The result could be as shown below. The Aggr_behind number shows you the aggregations that are not completed yet.

image

In this case with this high number we are having a serious problem. Okay then you just follow the my pervious blog post on how to solve this , this is for States missing but can also be applied for performance data. Look at the FIX: part. To kickoff the aggregation processing.

(https://michelkamp.wordpress.com/2012/03/23/dude-where-my-availability-report-data-from-the-scom-dwh/)

But if you see a performance data set number around 2.  (See picture below) It means 2 aggregations have to be processed yet. This is what we want to see. So everything seems okay. But why are we missing the date period 01-02-2012 till 20-01-2012 ?

image

We could have 2 scenarios here:

1. The data was simply not provided to the DWH ?

2. The data was provided but due to stage/aggregation problems not processed.

For case 1 we have to look at the agents what went wrong. That is for this post out of scope.

For case 2 we have some solutions see below.

Case 2

First let me explain how the aggregation process works at helicopter view.  I am sure I miss some details (so feel free to add / correct me on this!)

image

Looking at the picture above: (click on it to expand)

1. The SCOM Management server DWH writer Datasource writes the Performance Data to a RAW staging table.

2. The DWH staging process processes this data by copying the RAW rows into a process table. Sometimes the table is simple renamed and recreated if the new RAW data count is less then a configured number. If you have a big number of new RAW rows the table rows will be copied in batches. This to minimize the transaction log impact. At last the RAW data is copied into the RAW data partitions tables.

3. The Standard Maintenance process generate the Aggregation sets that have to be processed in step 4. During this process there will be created aggregation process rows in the Aggregation history table with a Dirty Indication (DirtyInd) of 1.

4. The RAW staged partition data will be processed to aggregated hourly and daily data. When the aggregation is complete the Dirty Indication for that aggregation will be set on 0.

5. The stored procedure reads the just aggregated data.

6. Data received from step 5 will be used to generate the report for the end user.

 

So now knowing the data flow what could be wrong ?

The answer we have to search at the grooming process (?) yes, the grooming process. The data in the RAW partitions tables from step 2 has a grooming/retention period. This period is standard 10 days. So if your aggregation is broken for more than 10 days (and you didn’t detected this) you will LOOSE your RAW data and as a result the aggregation process will have nothing to aggregate. So no performance data, resulting in our root problem the date gap in the report.

Solution:

Pfff … nice all of this theory stuff but how do I fix this ?

Simply by :   😉

1. Manually insert the missing RAW data and kickoff the aggregation process. I will blog post on how to do this later. (would be after the MMS)

2. Prevent that this is going to happen again.

To prevent this you can increase the retention/grooming period from 10 days to lets say 30 days. Check if you have enough DB space first. Execute the query below:

update StandardDatasetAggregation
set MaxDataAgeDays = 30
where GroomStoredProcedurename = ‘PerformanceGroom’  and AggregationTypeID= 0

Now you will have 30 days to solve your aggregation problems. Of course this is a workaround to get more air to breath during fixing your aggregation problems.

The best way is to monitor it pro active. Since we can monitor everything we create a monitor that checks the outstanding aggregations every 60 minutes and alerts when a threshold is hit. You can use the query from the analyze part in this post to do this. I would set the threshold on 10 so you will be notified if your aggregation process has a delay of 10 datasets (about 10h). If I have time before I’m going to the MMS I will blog post this extra monitor because with the normal DB watcher you can’t make this one. And of course I will use the VS Authoring extensions for this.

Happy scomming.

Michel

3 Responses to “SCOM DWH aggregations data loose Tip and Tricks”

  1. Alex August 3, 2012 at 15:02 #

    Hello Michel,

    Thank you very much for the article! I am using it in the troubleshooting. Would you be so kind to elaborate on the following items?

    1. Where to search for the information which you mentioned in your recommendations: “The answer we have to search at the grooming process (?) yes, the grooming process. The data in the RAW partitions tables from step 2 has a grooming/retention period. This period is standard 10 days. So if your aggregation is broken for more than 10 days (and you didn’t detected this) you will LOOSE your RAW data and as a result the aggregation process will have nothing to aggregate. So no performance data, resulting in our root problem the date gap in the report.”

    2. Manually insert the missing RAW data and kickoff the aggregation process. I will blog post on how to do this later. (would be after the MMS) Could you please share the link to the post?

    Thank you very much for your time and help!

  2. Alex August 3, 2012 at 16:12 #

    Michel, how would you recommend the SCOM agent troubleshooting?

Leave a comment