Archive | March, 2013

Get a grip on the DWH aggregations

24 Mar

 

The problem

If you run availability reports or performance reports with a aggregation type of daily or hourly the reports are empty. This problem is described a lot on the web. And I have also written a couple of blog post how to fix this issue. But as you know we are using scom to monitor stuff , so why not monitor this aggregation processing and alert if a processing delay is occurring. ? That’s our mission today….

Analyze

Using SQL enterprise manager and a SQL query on the data warehouse DB we can read out the aggregation processing. This query looks like this:

Select AggregationTypeId, Datasetid, (Select SchemaName From StandardDataSet Where Datasetid = StandardDataSetAggregationHistory.Datasetid) ,  COUNT(*) as ‘Count’, MIN(AggregationDateTime) as ‘First’, MAX(AggregationDateTime) as ‘Last’ From StandardDataSetAggregationHistory
Where LastAggregationDurationSeconds IS NULL
group by AggregationTypeId , Datasetid

The output will show us how many aggregations there have still to be processed /aggreationtype  (20=hourly , 30 = daily).

image

So in this case we have no problem. But I have seen scom environments where the state aggregations where so far behind that it was almost not possible to fix it. This bring up a point: especially the state aggregations are the tricky ones. If you have many ‘flipping’ monitors there will be a lot of state changes and so a lot of aggregations data to process. This process takes a lot of SQL CPU power and also disk space. In most of this cases it was the tempdb data space free or transaction log that was the root cause of the failure.

Solution

In scom we have for every aggregation an target. This target is named ‘Standard data set’. You can find it here:

image

If you compare the screenshot with the results on your scom console you will notice that you don’t have the green healthy state… And that’s why you are reading this post. So lets add this state.

I wanted to give every dataset that has to be processed a health state on how many aggregation it has still to process. So we make a monitor that executes for every data set the query above and if a threshold is hit the health state is changed. Also we will add a rule so that this aggregation behind count is put in a trend graph.

I have used VSAE for this , and I will not share the code but only the idea. Why not ? I believe you have to know what you are doing and by copy & pasting you don’t learn from it if you don’t have done it once from start till end.

The real work

Open a new VSAE project and add a empty MP fragment and a PowerShell fragment.

image

Then you make a datasource that reads the aggregation count. This is done using PowerShell and the SQL snapin.

image

The PowerShell script has as input the GUID of the dataset (property of the target) and as output a property bag with the aggregations count (daily and hourly). I made the script somewhat intelligent by reading out the registry where the data warehouse is located.

Now we use this datasource in a monitor module type to create a 3 state monitor. And since we have created a datasource module we can create also a rule that collects the aggregation behind for the trend graph. Yes know know this is easier to type as to do…

Below a snap of the datasource module

image

And below a snap of the monitor module type

image

and the monitor. Create one for hourly(not shown) an one for daily.

image

At last for trending we have to create a collection rule.

image

Notice that the monitor and collection rule are having as target the “Microsoft.SystemCenter.DataWarehouse.DataSet” alias “standard dataset” and notice the runas profile.

The result

When you have constructed the MP and build/deployed it you will see 2 extra monitors on the standard dataset targets as show above. Open the health explorer to see if all is ok.

image

Above dataset has had a problem. To see some details, view the performance counters and you will see the aggregations trend.

image

In this case the state hourly aggregations where way behind. So I followed one of my own blog posts to solve this one. Where I manually executed in a loop the state aggregation process to speed up the processing.

The End.

Yes I know this post is a bit ‘çloudy’ and not something you can download and import. But I hope by sharing the idea I triggered you to try it your self.

Happy SCOMMING!

Michel Kamp

Aggrr missis SCOM sometimes you drive me nuts!

12 Mar

Here we are again. I am for sure you will remember a situation where you were sure everything was configured correctly and you even had restarted the server but still it doesn’t work…… NO ??

Okay so this time it was my turn…

The problem:

I was trying to discover a new network device. Network discovery won’t find my newly added network device. Why ? I checked everything and it isn’t connectivity or security. Normally when those issue apply you will see a pending failure discovery attempt. What could be wrong..

Analyze:

1) First I checked the discovery rule. The network devices are explicit specified.

image

I made sure the device was listed in the explicit discovery.

image

2) I started the discovery and checked the progress events.

image

Hey what’s that ! One device excluded ?? Strange this option can only be applied to recursive discovery’s….

So it must be something in the discovery rule….

The solution:

So this excluded event triggered my attention.

1) open the discovery rule.  And change the Discovery method to Recursive discovery

image

2) Now select the Exclude Filters tab

image

And what do we see here . The network device I was trying to discover.

3) Now remove this exclude filter entry and save the discovery template without RUNNING IT

image

4) Reopen the discovery rule and change the discovery mode back to ‘explicit’ and save it again.

image

5) now run the discovery.

6) and what do we see here … the device is discovered successfully !

image

Conclusion

It looks like this rule was initially created in recursive discovery mode. After a while the operator decided to remove a device from the discovered device inventory. If you do that a exclude filter entry is automatically created. At the end the operator decided to change the discovery mode back to explicit (don’t ask why) and that’s causing the problem I faced…

The end

Hope I saved you a lot of time almost hating misses SCOM ; – )

Happy SCOMMING

Michel Kamp MVP System Center Operations Manager