Archive | April, 2012

SCOM DWH aggregations data loose Tip and Tricks

10 Apr

 

This ‘short’ post will be about the DWH aggregations again. It will contain some tips on how not to loose any data.

!!! All I suggest and do here is at own risk and totally unsupported without instructions given from Microsoft support. !!!!

The problem:

You run a performance report on 1 month. You notice that you are missing some days of aggregated hourly/daily data. You were not having any troubles as you know… till now.

image

Analyze:

First we are going to look if we have any aggregations that are not completed yet.
Run the SQL Query below on the DWH database:

— checking the to be processed aggregations ————–

SELECT     COUNT(*) AS Aggr_behind, Dataset.DatasetDefaultName
FROM         StandardDatasetAggregationHistory INNER JOIN
                      Dataset ON StandardDatasetAggregationHistory.DatasetId = Dataset.DatasetId
WHERE     (StandardDatasetAggregationHistory.DirtyInd = 1)
GROUP BY Dataset.DatasetDefaultName

The result could be as shown below. The Aggr_behind number shows you the aggregations that are not completed yet.

image

In this case with this high number we are having a serious problem. Okay then you just follow the my pervious blog post on how to solve this , this is for States missing but can also be applied for performance data. Look at the FIX: part. To kickoff the aggregation processing.

(https://michelkamp.wordpress.com/2012/03/23/dude-where-my-availability-report-data-from-the-scom-dwh/)

But if you see a performance data set number around 2.  (See picture below) It means 2 aggregations have to be processed yet. This is what we want to see. So everything seems okay. But why are we missing the date period 01-02-2012 till 20-01-2012 ?

image

We could have 2 scenarios here:

1. The data was simply not provided to the DWH ?

2. The data was provided but due to stage/aggregation problems not processed.

For case 1 we have to look at the agents what went wrong. That is for this post out of scope.

For case 2 we have some solutions see below.

Case 2

First let me explain how the aggregation process works at helicopter view.  I am sure I miss some details (so feel free to add / correct me on this!)

image

Looking at the picture above: (click on it to expand)

1. The SCOM Management server DWH writer Datasource writes the Performance Data to a RAW staging table.

2. The DWH staging process processes this data by copying the RAW rows into a process table. Sometimes the table is simple renamed and recreated if the new RAW data count is less then a configured number. If you have a big number of new RAW rows the table rows will be copied in batches. This to minimize the transaction log impact. At last the RAW data is copied into the RAW data partitions tables.

3. The Standard Maintenance process generate the Aggregation sets that have to be processed in step 4. During this process there will be created aggregation process rows in the Aggregation history table with a Dirty Indication (DirtyInd) of 1.

4. The RAW staged partition data will be processed to aggregated hourly and daily data. When the aggregation is complete the Dirty Indication for that aggregation will be set on 0.

5. The stored procedure reads the just aggregated data.

6. Data received from step 5 will be used to generate the report for the end user.

 

So now knowing the data flow what could be wrong ?

The answer we have to search at the grooming process (?) yes, the grooming process. The data in the RAW partitions tables from step 2 has a grooming/retention period. This period is standard 10 days. So if your aggregation is broken for more than 10 days (and you didn’t detected this) you will LOOSE your RAW data and as a result the aggregation process will have nothing to aggregate. So no performance data, resulting in our root problem the date gap in the report.

Solution:

Pfff … nice all of this theory stuff but how do I fix this ?

Simply by :   😉

1. Manually insert the missing RAW data and kickoff the aggregation process. I will blog post on how to do this later. (would be after the MMS)

2. Prevent that this is going to happen again.

To prevent this you can increase the retention/grooming period from 10 days to lets say 30 days. Check if you have enough DB space first. Execute the query below:

update StandardDatasetAggregation
set MaxDataAgeDays = 30
where GroomStoredProcedurename = ‘PerformanceGroom’  and AggregationTypeID= 0

Now you will have 30 days to solve your aggregation problems. Of course this is a workaround to get more air to breath during fixing your aggregation problems.

The best way is to monitor it pro active. Since we can monitor everything we create a monitor that checks the outstanding aggregations every 60 minutes and alerts when a threshold is hit. You can use the query from the analyze part in this post to do this. I would set the threshold on 10 so you will be notified if your aggregation process has a delay of 10 datasets (about 10h). If I have time before I’m going to the MMS I will blog post this extra monitor because with the normal DB watcher you can’t make this one. And of course I will use the VS Authoring extensions for this.

Happy scomming.

Michel

SCOM DWH aggregations process takes all your disk space

2 Apr

Short post on a very strange issue I solved this week. The DWH database was taking all the log space it could take. I some cases , when you have a lot of aggregations waiting due to a state change burst this would be normal. So you add some extra log space to the DWH and remove it after the processing has succeeded. But this time it was hungry and took 80GB+ on log space. So my alarm bells went on this is for sure not normal.

Analyze:

I opened SQL Server management studio and executed the query below. This query shows me the free log space / database.

— transaction usage info
DBCC SQLPERF(logspace)

The output was showing that the log of the DWH database was 100% used. Okay no real new news we knew this already.

Now we must look why .

We check if the database is in simple mode. This is the default setting for the SCOM DWH. And yes this is configured correctly. So since a simple mode DB releases the log pages when the transaction is completed(committed or roll backed) it must be that a transaction isn’t completed or will never be.

Lets lookup this open transaction(s)

Execute SQL below to see the open transactions:

— open trans , if status =2 trans is still open
DBCC LOGINFO

It returned over 60K on rows with status 2. So now we are sure its caused by a open transaction. So find out the guilty process causing this never closed transaction.

Execute SQL below to get the process:

— if process spid is given. Or if no spid is shown but a last old LSN is shown. look at the replication
DBCC OPENTRAN

The output was really not what I expected :

Transaction information for database ‘OperationsManagerDW’.
Replicated Transaction Information:
Oldest distributed LSN : (0:0:0)
Oldest non-distributed LSN : (4881975:876:1)

Hmm strange we should see this form of result , showing the process SPID causing this transactions:

Transaction information for database ‘OperationsManagerDW’.

Oldest active transaction:
    SPID (server process ID): 76
    UID (user ID) : -1
    Name          : user_transaction
    LSN           : (250110:77:2)
    Start time    : Apr  2 2012 11:38:30:477AM
    SID           : 0x010500000000000515000000f848abc31e7f1b2ca574174c53040000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Solution

Good readers should have noticed the words “Replicated Transaction Information” in the results. Hmm but I don’t use replication!! Even when I used SQL Server Management Studio to check the replications it did not show any repl. configuration.

After some Bing work I found the system stored procedure to force delete any replication configurations. So I executed the SQL below:

— remove all replication. now the transaction log truncated
sp_removedbreplication @dbname = ‘OperationsManagerDW’

And what do you think ? Yes problem solved. All log space is returned !!

Check it by running the DBCC LOGINFO again and you will see its almost around 0%.

 

Happy SCOMMING,

Michel Kamp [MVP] Knipogende emoticon

Yheeaaaa it’s party time : I have the MVP 2012 Award !!!!

1 Apr

Last 11 years I have been doing MOM2000 , MOM 2005 , SCOM 2007 and SCOM 2012. In the beginning I blogged a lot on mom2000 and mom 2005. After changing from company I had to confirm to a non blog policy. Meanwhile I contributed a lot in the SCOM 2007/R2 TAP and SCOM 2012 TAP programs. Worked the last 5 years full time on/with the SCOM product. 6 month ago the company removed the non blog policy and I went posting blog posts. And see my work is noticed ! Glimlach 

Thanks to everyone who appreciates my contributions to the SCOM community, If you are @MMS I will give a round of beer!

19:53 CET

Dear Michel Kamp,
Congratulations! We are pleased to present you with the 2012 Microsoft® MVP Award! This award is given to exceptional technical community leaders who actively share their high quality, real world expertise with others. We appreciate your outstanding contributions in System Center Cloud and Datacenter Management technical communities during the past year.
Also in this email:

  • About your MVP Award Gift
  • How to claim your award benefits
  • Your MVP Identification Number
  • MVP Award Program Code of Conduct

The Microsoft MVP Award provides us the unique opportunity to celebrate and honor your significant contributions and say “Thank you for your technical leadership.”

Lourdes Orive
Group Manager
Community & Influencer Programs
Microsoft

 

Mike Hickman
Group Manager
Community Engagement
Microsoft