Archive | DataWareHouse RSS feed for this section

Get a grip on the DWH aggregations

24 Mar

 

The problem

If you run availability reports or performance reports with a aggregation type of daily or hourly the reports are empty. This problem is described a lot on the web. And I have also written a couple of blog post how to fix this issue. But as you know we are using scom to monitor stuff , so why not monitor this aggregation processing and alert if a processing delay is occurring. ? That’s our mission today….

Analyze

Using SQL enterprise manager and a SQL query on the data warehouse DB we can read out the aggregation processing. This query looks like this:

Select AggregationTypeId, Datasetid, (Select SchemaName From StandardDataSet Where Datasetid = StandardDataSetAggregationHistory.Datasetid) ,  COUNT(*) as ‘Count’, MIN(AggregationDateTime) as ‘First’, MAX(AggregationDateTime) as ‘Last’ From StandardDataSetAggregationHistory
Where LastAggregationDurationSeconds IS NULL
group by AggregationTypeId , Datasetid

The output will show us how many aggregations there have still to be processed /aggreationtype  (20=hourly , 30 = daily).

image

So in this case we have no problem. But I have seen scom environments where the state aggregations where so far behind that it was almost not possible to fix it. This bring up a point: especially the state aggregations are the tricky ones. If you have many ‘flipping’ monitors there will be a lot of state changes and so a lot of aggregations data to process. This process takes a lot of SQL CPU power and also disk space. In most of this cases it was the tempdb data space free or transaction log that was the root cause of the failure.

Solution

In scom we have for every aggregation an target. This target is named ‘Standard data set’. You can find it here:

image

If you compare the screenshot with the results on your scom console you will notice that you don’t have the green healthy state… And that’s why you are reading this post. So lets add this state.

I wanted to give every dataset that has to be processed a health state on how many aggregation it has still to process. So we make a monitor that executes for every data set the query above and if a threshold is hit the health state is changed. Also we will add a rule so that this aggregation behind count is put in a trend graph.

I have used VSAE for this , and I will not share the code but only the idea. Why not ? I believe you have to know what you are doing and by copy & pasting you don’t learn from it if you don’t have done it once from start till end.

The real work

Open a new VSAE project and add a empty MP fragment and a PowerShell fragment.

image

Then you make a datasource that reads the aggregation count. This is done using PowerShell and the SQL snapin.

image

The PowerShell script has as input the GUID of the dataset (property of the target) and as output a property bag with the aggregations count (daily and hourly). I made the script somewhat intelligent by reading out the registry where the data warehouse is located.

Now we use this datasource in a monitor module type to create a 3 state monitor. And since we have created a datasource module we can create also a rule that collects the aggregation behind for the trend graph. Yes know know this is easier to type as to do…

Below a snap of the datasource module

image

And below a snap of the monitor module type

image

and the monitor. Create one for hourly(not shown) an one for daily.

image

At last for trending we have to create a collection rule.

image

Notice that the monitor and collection rule are having as target the “Microsoft.SystemCenter.DataWarehouse.DataSet” alias “standard dataset” and notice the runas profile.

The result

When you have constructed the MP and build/deployed it you will see 2 extra monitors on the standard dataset targets as show above. Open the health explorer to see if all is ok.

image

Above dataset has had a problem. To see some details, view the performance counters and you will see the aggregations trend.

image

In this case the state hourly aggregations where way behind. So I followed one of my own blog posts to solve this one. Where I manually executed in a loop the state aggregation process to speed up the processing.

The End.

Yes I know this post is a bit ‘çloudy’ and not something you can download and import. But I hope by sharing the idea I triggered you to try it your self.

Happy SCOMMING!

Michel Kamp

No Mr. SCOM I told you not a availability state report but a performance state report I want!

16 Jan

Sometimes you wonder why not all the reports are as the should be. For example of course you are known with the availability report . Just pick a target and period and you will get a nice report telling you when a target when unhealthy.

image

The challenge.

Okay nice …. but I want a report not based on the availability data but on the performance or configuration or security data. But wait this is build into the availability report isn’t it ?

looking at the report description:

Description:

“For every managed object within System Center Operations Manager, monitors configured in each of the disciplines below determine an objects time in state and then roll-up to an objects overall health. The availability report by default shows an objects time in state as per the monitors that roll-up within the availability discipline.

Entity health

Availability   <= this you get

Configuration <= this you want

Performance <= ..

Security <= ..

O no , it looks like not. So yes it’s a real challenge. That the way we like it.

Solution

Since the availability report was intended to be used for this but at the end it looks like the SCOM program team decided to make it locked on ‘availability’ only.  I know this because when you look into the report definition you will see:

image

So the report is using only the availability rollup as state calculation data. AND this parameter is hidden for gurus as us. How dear they Knipogende emoticon

So we can solve it on several ways. The root solution is that we want to change the value ‘System.Health.AvailabilityState’ to ‘System.Health.PerformanceState’ or ‘System.Health.ConfigurationState’  or ‘System.Health.SecurityState’ to get the report state type we want.

1) export the report from report service and edit the hidden value to false. Import the report and open it in the SCOM console and edit the MonitorName value to for example System.Health.PerformanceState . Run the report and you are done.

2) make a normal report run using the non modified availability report and save it to a Management pack. Now export the MP and open it in notepad and edit the MP.

3) make a normal report run using the non modified availability report save it as favorite. Now open SQL enterprise and lookup the report in the table dbo.favoritereport . Change the ReportParameterValues with the changed parameters.

I know you are thinking right now… what would you do Michel…

I would go for option 1. Because I would also change the report definition to have the correct name as ‘Performance availability’ ect.. and save it also under a different name. Because you must be aware that if you only change the report value to hidden = false and don’t change the report file name….. The next time you import a new service pack or MP version it could be that your report is going to be overwritten… So said that go for the more save one and choose 2.

Let’s go!

1) So make the normal availability report in the SCOM console

2) Save it to a MP

image

3) Export the MP

4) Edit the MP with notepad

image

5) import it in scom. (leave the mp version number unchanged)

6) wait a few minutes and you will see the report in the console

Below the end result. Also notice that you can still click to sub report that that this report are also of the state type you wanted!.

 image

Yes I know that you will have to do this for every 3 report types because you can’t change the monitor type runtime. At the end the decision is at you to use step 1 , 2 or 3.

The End

Every time I tell my self make a short blog post! But every time I notice that I am failing.. But who cares…  (yes okay.. my wife) Knipogende emoticon 

Happy scomming!

Michel Kamp

Don’t let the data warehouse write action fool you!

26 Sep

Yes I know. It’s a long time ago I posted. Vacation and most work pressure were and are still the reason. But never less I will share a problem I undergone that looks a small one but can have big impact.

The problem.

You have a workflow that has a PowerShell/vbs script that outputs a property bag with performance data. The performance data contains multiply counters. Now the performance data is going to be written to the OPSDB and DWHDB.  All works okay, you see the performance data counters in the native console. So you say now its okay because the DWH write actions is also writing the same counters to the DWH….  but when you look in the DWH you see that only one counter is stored. But you are sure the workflow outputted multiply counters…. 

Below the performance counters in the native console. All the 4 perf counters are there (yellow) in the ops console

image

Below the DWH.

You see only one rule (yellow) , this was the first in the property bag.

clip_image002

What could be wrong ???

Analyze

The workflow looks like this:

   <Rule ID=”TransferFile.ReadSec” Enabled=”true” Target=”FileTransferClient” ConfirmDelivery=”true” Remotable=”true” Priority=”Normal” DiscardLevel=”100″>
        <Category>Custom</Category>
        <DataSources>
          <DataSource ID=”SMBFileTransfer” TypeID=”FileTransfer”>            <Debug>false</Debug>
            <IntervalSeconds>300</IntervalSeconds>
          </DataSource>
        </DataSources>
         <WriteActions>
          <WriteAction ID=”ToOps” TypeID=”SystemCenter!Microsoft.SystemCenter.CollectPerformanceData” />
          <WriteAction ID=”ToDWH” TypeID=”SCDW!Microsoft.SystemCenter.DataWarehouse.PublishPerformanceData” />
        </WriteActions>      
      </Rule>

1. Frist you check what the property bag output from the datasource SMBFileTransfer  is containing

<Collection><DataItem type=”System.PropertyBagData” time=”2012-09-20T19:55:28.0638791+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><Property Name=”Instance” VariantType=”8″>c:\destionation</Property><Property Name=”Counter” VariantType=”8″>Read Transfer Kbyte Sec</Property><Property Name=”Value” VariantType=”5″>14450.625</Property></DataItem><DataItem type=”System.PropertyBagData” time=”2012-09-20T19:55:28.1079971+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><Property Name=”Instance” VariantType=”8″>c:\destionation</Property><Property Name=”Counter” VariantType=”8″>Read Transfer Total Sec</Property><Property Name=”Value” VariantType=”5″>0.3</Property></DataItem><DataItem type=”System.PropertyBagData” time=”2012-09-20T19:55:28.1079971+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><Property Name=”Instance” VariantType=”8″>c:\destionation</Property><Property Name=”Counter” VariantType=”8″>Write Transfer Kbyte Sec</Property><Property Name=”Value” VariantType=”5″>14450.625</Property></DataItem><DataItem type=”System.PropertyBagData” time=”2012-09-20T19:55:28.1079971+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><Property Name=”Instance” VariantType=”8″>c:\destionation</Property><Property Name=”Counter” VariantType=”8″>Write Transfer Total Sec</Property><Property Name=”Value” VariantType=”5″>0.3</Property></DataItem></Collection>

You see multiply counter values that have to be converted to performance data.

2. Now we check using the WFAnalyzer the converted performance data. See below. It looks okay.

Recieved DataItem <DataItem type=”System.Performance.Data” time=”2012-09-20T19:55:28.1109383+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><TimeSampled>2012-09-20T19:55:28.0638791+02:00</TimeSampled><ObjectName>SMB File Transfer</ObjectName><CounterName>Read Transfer Kbyte Sec</CounterName><InstanceName>c:\destionation</InstanceName><IsNull Type=”Boolean”>false</IsNull><Value>14450.625</Value></DataItem>

Recieved DataItem <DataItem type=”System.Performance.Data” time=”2012-09-20T19:55:28.1109383+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><TimeSampled>2012-09-20T19:55:28.1079971+02:00</TimeSampled><ObjectName>SMB File Transfer</ObjectName><CounterName>Read Transfer Total Sec</CounterName><InstanceName>c:\destionation</InstanceName><IsNull Type=”Boolean”>false</IsNull><Value>0.3</Value></DataItem>

Recieved DataItem <DataItem type=”System.Performance.Data” time=”2012-09-20T19:55:28.1109383+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><TimeSampled>2012-09-20T19:55:28.1079971+02:00</TimeSampled><ObjectName>SMB File Transfer</ObjectName><CounterName>Write Transfer Kbyte Sec</CounterName><InstanceName>c:\destionation</InstanceName><IsNull Type=”Boolean”>false</IsNull><Value>14450.625</Value></DataItem>

Recieved DataItem <DataItem type=”System.Performance.Data” time=”2012-09-20T19:55:28.1109383+02:00″ sourceHealthServiceId=”0F6B7345-4C8E-CFAF-BD7A-454E6C94B77F”><TimeSampled>2012-09-20T19:55:28.1079971+02:00</TimeSampled><ObjectName>SMB File Transfer</ObjectName><CounterName>Write Transfer Total Sec</CounterName><InstanceName>c:\destionation</InstanceName><IsNull Type=”Boolean”>false</IsNull><Value>0.3</Value></DataItem>

3. Next step is to check the write actions. This also looks okay. The “ToDWH “ writeaction should write the data to the DWH.

<WriteActions>

<WriteAction ID=”ToOps” TypeID=”SystemCenter!Microsoft.SystemCenter.CollectPerformanceData” />

<WriteAction ID=”ToDWH” TypeID=”SCDW!Microsoft.SystemCenter.DataWarehouse.PublishPerformanceData” />

</WriteActions>

All looks okay….

Solution

After some mailing with the OM development team the answer was found: Writing multiply counters to the DWH from 1 property bag output is NOT supported! So the DWH write module has a one to one reference map that means only one rule can contain one counter. Be aware no error is reported if this happens..

The only way to solve this is to make 1 rule for every performance counter you want to store in the DWH.  Use a condition detection in the rule for filtering the correct performance counter. See below for a example.

<Rule ID=”TransferFile.ReadSec” Enabled=”true” Target=”FileTransferClient” ConfirmDelivery=”true” Remotable=”true” Priority=”Normal” DiscardLevel=”100″>
<Category>Custom</Category>
<DataSources>
<DataSource ID=”SMBFileTransfer” TypeID=”OPS.SMB.Performance.FileTransfer”> <Debug>false</Debug>
<IntervalSeconds>300</IntervalSeconds>
</DataSource>
</DataSources>
<ConditionDetection ID=”Filter” TypeID=”System!System.ExpressionFilter”>
<Expression>
<SimpleExpression>
<ValueExpression>
<XPathQuery Type=”String”>CounterName</XPathQuery>
</ValueExpression>
<Operator>Equal</Operator>
<ValueExpression>
<Value Type=”String”>Read Transfer Total Sec</Value>
</ValueExpression>
</SimpleExpression>
</Expression>
</ConditionDetection>
<WriteActions>
<WriteAction ID=”ToOps” TypeID=”SystemCenter!Microsoft.SystemCenter.CollectPerformanceData” />
<WriteAction ID=”ToDWH” TypeID=”SCDW!Microsoft.SystemCenter.DataWarehouse.PublishPerformanceData” />
</WriteActions>
</Rule>

THE END

Maybe this will help you. Till next Time.

Happy SCOMMING

Michel Kamp

SCOM DWH aggregations data loose Tip and Tricks

10 Apr

 

This ‘short’ post will be about the DWH aggregations again. It will contain some tips on how not to loose any data.

!!! All I suggest and do here is at own risk and totally unsupported without instructions given from Microsoft support. !!!!

The problem:

You run a performance report on 1 month. You notice that you are missing some days of aggregated hourly/daily data. You were not having any troubles as you know… till now.

image

Analyze:

First we are going to look if we have any aggregations that are not completed yet.
Run the SQL Query below on the DWH database:

— checking the to be processed aggregations ————–

SELECT     COUNT(*) AS Aggr_behind, Dataset.DatasetDefaultName
FROM         StandardDatasetAggregationHistory INNER JOIN
                      Dataset ON StandardDatasetAggregationHistory.DatasetId = Dataset.DatasetId
WHERE     (StandardDatasetAggregationHistory.DirtyInd = 1)
GROUP BY Dataset.DatasetDefaultName

The result could be as shown below. The Aggr_behind number shows you the aggregations that are not completed yet.

image

In this case with this high number we are having a serious problem. Okay then you just follow the my pervious blog post on how to solve this , this is for States missing but can also be applied for performance data. Look at the FIX: part. To kickoff the aggregation processing.

(https://michelkamp.wordpress.com/2012/03/23/dude-where-my-availability-report-data-from-the-scom-dwh/)

But if you see a performance data set number around 2.  (See picture below) It means 2 aggregations have to be processed yet. This is what we want to see. So everything seems okay. But why are we missing the date period 01-02-2012 till 20-01-2012 ?

image

We could have 2 scenarios here:

1. The data was simply not provided to the DWH ?

2. The data was provided but due to stage/aggregation problems not processed.

For case 1 we have to look at the agents what went wrong. That is for this post out of scope.

For case 2 we have some solutions see below.

Case 2

First let me explain how the aggregation process works at helicopter view.  I am sure I miss some details (so feel free to add / correct me on this!)

image

Looking at the picture above: (click on it to expand)

1. The SCOM Management server DWH writer Datasource writes the Performance Data to a RAW staging table.

2. The DWH staging process processes this data by copying the RAW rows into a process table. Sometimes the table is simple renamed and recreated if the new RAW data count is less then a configured number. If you have a big number of new RAW rows the table rows will be copied in batches. This to minimize the transaction log impact. At last the RAW data is copied into the RAW data partitions tables.

3. The Standard Maintenance process generate the Aggregation sets that have to be processed in step 4. During this process there will be created aggregation process rows in the Aggregation history table with a Dirty Indication (DirtyInd) of 1.

4. The RAW staged partition data will be processed to aggregated hourly and daily data. When the aggregation is complete the Dirty Indication for that aggregation will be set on 0.

5. The stored procedure reads the just aggregated data.

6. Data received from step 5 will be used to generate the report for the end user.

 

So now knowing the data flow what could be wrong ?

The answer we have to search at the grooming process (?) yes, the grooming process. The data in the RAW partitions tables from step 2 has a grooming/retention period. This period is standard 10 days. So if your aggregation is broken for more than 10 days (and you didn’t detected this) you will LOOSE your RAW data and as a result the aggregation process will have nothing to aggregate. So no performance data, resulting in our root problem the date gap in the report.

Solution:

Pfff … nice all of this theory stuff but how do I fix this ?

Simply by :   😉

1. Manually insert the missing RAW data and kickoff the aggregation process. I will blog post on how to do this later. (would be after the MMS)

2. Prevent that this is going to happen again.

To prevent this you can increase the retention/grooming period from 10 days to lets say 30 days. Check if you have enough DB space first. Execute the query below:

update StandardDatasetAggregation
set MaxDataAgeDays = 30
where GroomStoredProcedurename = ‘PerformanceGroom’  and AggregationTypeID= 0

Now you will have 30 days to solve your aggregation problems. Of course this is a workaround to get more air to breath during fixing your aggregation problems.

The best way is to monitor it pro active. Since we can monitor everything we create a monitor that checks the outstanding aggregations every 60 minutes and alerts when a threshold is hit. You can use the query from the analyze part in this post to do this. I would set the threshold on 10 so you will be notified if your aggregation process has a delay of 10 datasets (about 10h). If I have time before I’m going to the MMS I will blog post this extra monitor because with the normal DB watcher you can’t make this one. And of course I will use the VS Authoring extensions for this.

Happy scomming.

Michel

SCOM DWH aggregations process takes all your disk space

2 Apr

Short post on a very strange issue I solved this week. The DWH database was taking all the log space it could take. I some cases , when you have a lot of aggregations waiting due to a state change burst this would be normal. So you add some extra log space to the DWH and remove it after the processing has succeeded. But this time it was hungry and took 80GB+ on log space. So my alarm bells went on this is for sure not normal.

Analyze:

I opened SQL Server management studio and executed the query below. This query shows me the free log space / database.

— transaction usage info
DBCC SQLPERF(logspace)

The output was showing that the log of the DWH database was 100% used. Okay no real new news we knew this already.

Now we must look why .

We check if the database is in simple mode. This is the default setting for the SCOM DWH. And yes this is configured correctly. So since a simple mode DB releases the log pages when the transaction is completed(committed or roll backed) it must be that a transaction isn’t completed or will never be.

Lets lookup this open transaction(s)

Execute SQL below to see the open transactions:

— open trans , if status =2 trans is still open
DBCC LOGINFO

It returned over 60K on rows with status 2. So now we are sure its caused by a open transaction. So find out the guilty process causing this never closed transaction.

Execute SQL below to get the process:

— if process spid is given. Or if no spid is shown but a last old LSN is shown. look at the replication
DBCC OPENTRAN

The output was really not what I expected :

Transaction information for database ‘OperationsManagerDW’.
Replicated Transaction Information:
Oldest distributed LSN : (0:0:0)
Oldest non-distributed LSN : (4881975:876:1)

Hmm strange we should see this form of result , showing the process SPID causing this transactions:

Transaction information for database ‘OperationsManagerDW’.

Oldest active transaction:
    SPID (server process ID): 76
    UID (user ID) : -1
    Name          : user_transaction
    LSN           : (250110:77:2)
    Start time    : Apr  2 2012 11:38:30:477AM
    SID           : 0x010500000000000515000000f848abc31e7f1b2ca574174c53040000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

Solution

Good readers should have noticed the words “Replicated Transaction Information” in the results. Hmm but I don’t use replication!! Even when I used SQL Server Management Studio to check the replications it did not show any repl. configuration.

After some Bing work I found the system stored procedure to force delete any replication configurations. So I executed the SQL below:

— remove all replication. now the transaction log truncated
sp_removedbreplication @dbname = ‘OperationsManagerDW’

And what do you think ? Yes problem solved. All log space is returned !!

Check it by running the DBCC LOGINFO again and you will see its almost around 0%.

 

Happy SCOMMING,

Michel Kamp [MVP] Knipogende emoticon

Dude where is my Availability Report data from the SCOM DWH ??

23 Mar

” Huston we got a problem!” when I run a availability report  the data isn’t complete. I’m missing a huge number of days. The graph shows UP (Monitoring unavailable) But I am really sure the server was up and monitored !!

image

 

Analyze:

Don’t panic, we are going to solve this. (I hope..) First we are going to look up the days we are missing. Simply click on the white bar. And the detail report will be rendered.

image

Okay looks like we are missing the most data from of 4-3-2012. And we see strange gaps of data that is present.

Okay that’s what the report says, but I am a core stuff guy I check it this way:

Open an SQL session and connect to the DWH db. Run this query. The last aggregated data will be on the first row. So you know what the last data date is you have. We change the DateTime to the same datetime we used in the report.

SELECT     ManagedEntity.FullName, vStateHourly.*
FROM         ManagedEntityMonitor INNER JOIN
                      ManagedEntity ON ManagedEntityMonitor.ManagedEntityRowId = ManagedEntity.ManagedEntityRowId INNER JOIN
                      State.vStateHourly ON ManagedEntityMonitor.ManagedEntityMonitorRowId = State.vStateHourly.ManagedEntityMonitorRowId
WHERE     (ManagedEntity.FullName LIKE ‘Microsoft.Windows.Computer:opsrms01%’)
AND vStateHourly.DateTime between  ‘20120301’ and ‘20120401’
order by vStateHourly.DateTime desc

The output will be:

image

So the last successful hourly aggregation was 02-03-2012 (dd-mm-yyyy). Hmmmm but when I look at the rendered report I see periods of data after this date ??? I must confess I really don’t have a idea now why Knipogende emoticon

Now we have to find the root cause and fix this missing aggregations. Luckily we can enable debug information for the aggregation process so we can see more what going wrong.

Open SQL and run the query below to enable debugging for the State aggregation .

UPDATE [OperationsManagerDW].[dbo].[StandardDataset]
   SET [DebugLevel] = 5

     WHERE [SchemaName] = ‘State’
GO

Now we can see the debug date with this query:

SELECT     TOP (100) DATEADD(hh, 1, DebugMessage.MessageDateTime) AS CET_datetime, StandardDataset.SchemaName , DebugMessage.MessageLevel, DebugMessage.MessageText
FROM         DebugMessage WITH (nolock) INNER JOIN
                      StandardDataset ON DebugMessage.DatasetId = StandardDataset.DatasetId
WHERE    (StandardDataset.SchemaName = N’State’)
order by messagedatetime desc

The output will be as below:

image

It looks like my aggregation process is way behind !

Since the maintenance for the DWH has a sequence run it means when some procedure before fails (lets say the event staging) the other won’t be hit. So I look in the debug table for other messages with ‘failed’ in the message.

Notice that we are now going a little of track , we main problem was the State report incomplete , but now we are looking at the Events. Just follow me.

SELECT     TOP (100) DATEADD(hh, 1, DebugMessage.MessageDateTime) AS CET_datetime, StandardDataset.SchemaName , DebugMessage.MessageLevel, DebugMessage.MessageText
FROM         DebugMessage WITH (nolock) INNER JOIN
                      StandardDataset ON DebugMessage.DatasetId = StandardDataset.DatasetId
where messagetext like ‘%Failed%’
order by messagedatetime desc

O no this is not good:

image

It looks like the event staging is broken. The error is:

Failed to process stagingarea for data set. Error 777971002, Procedure EventProcessStaging, Line 398, Message: Sql execution failed. Error 515, Level 16, State 2, Procedure DebugMessageInsert, Line 15, Message: Cannot insert the value NULL into column ‘MessageText’, table ‘OperationsManagerDW.dbo.DebugMessage’; column does not allow nulls. INSERT fails.

Mmmm When I look at the error I see the debug procedure that writes to the debug log has a problem writing a debug message. Strange this error… So we have to find the real error.  So I open the stored procedure “EventProcessStaging”. And there I found a BUG .. brrrr. The variable @InsertTableName is not set to a value before it is used as part of the debug message variable @MessageText. Because you can’t concat NULL to a string variable an exception is raised. I fixed this by moving the sql where this variable @InsertTableName is assigned to above the first use of the @InsertTableName variable.  (this is for SCOM 2007 and 2012!) I raised already a bug request @Microsoft throughout the TAP program. This only occurs when you set the debuglevel > 3.

For you it simple means don’t set it above 3  or fix this stp own your own risk. (as I have done ;-0 ) In our case the debug level was already above 3 for the state dataset for the last month. So because the event processing was braking the total maintenance (bad architecture , sorry) all my state staging  was stopped. And caused my empty reports.

So now we know it we can go back to the real issue. The missing states fix.

Fix:

This is now very simple. And if you Bing you can find plenty of info on it. One I found out in the beginning very helpful was this one from the Microsoft SCOM team ( http://blogs.technet.com/b/operationsmgr/archive/2011/09/06/standard-dataset-maintenance-troubleshooter-for-system-center-operations-manager-2007.aspx )

What I do is most of the time :

Set a enable = false override on the rule “Standard Data Warehouse Data Set maintenance rule” for all instances of “Standard Data Set”.

SNAGHTML15ff6a4a

Now I am really sure no maintenance process is running.

And I run my own maintenance process every 1 min. Because I know catching up the state data aggregation will take some time and I don’t want to create problem’s for the other datasets (performance , events ..) I will also run the important ones in the same script.

Open a query to the DWH and run:

USE [OperationsManagerDW]
DECLARE @DataSet uniqueidentifier

Print ‘starting loop of StandardDatasetMaintenance jobs’   
Print ‘Processing dataset:’   
Print @DataSet

while (1=1)
begin
    PRINT getdate()
    Print ‘Start StandardDatasetMaintenance’   
SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = ‘Perf’)
    EXEC StandardDatasetMaintenance @DataSet

SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = ‘Exchange2010’)
    EXEC StandardDatasetMaintenance @DataSet

SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = ‘State’)
    EXEC StandardDatasetMaintenance @DataSet

SET @DataSet = (SELECT DatasetId FROM StandardDataset WHERE SchemaName = ‘Event’)
   
    EXEC StandardDatasetMaintenance @DataSet
    –EXEC StandardDatasetProcessStaging @Dataset
    PRINT getdate()
    PRINT ‘END StandardDatasetMaintenance’
    WAITFOR DELAY ’00:01′
end

now you check the debug log on regularly base to see if the state aggregation is completed.

You can also use the query below:

———————————————————————–
— check first and last aggregation time from still to be processed  data
— first and last date must be equal
———————————————————————–
Declare @DataSet as uniqueidentifier
Set @DataSet = (Select DataSetId From StandardDataSet Where SchemaName = ‘State’)
Select AggregationTypeId, COUNT(*) as ‘Count’, MIN(AggregationDateTime) as ‘First’, MAX(AggregationDateTime) as ‘Last’ From StandardDataSetAggregationHistory
Where DataSetId = @DataSet AND LastAggregationDurationSeconds IS NULL
group by AggregationTypeId

So lets check if the process is running okay. Simply rerun the report. the output will be:

image

looks like its all going to be alright. Just be patient.

DO NOT FORGET:

after the states are complete to remove the overrides , other wise you will have for sure the same and more , problem again.

Not to be continued:

I really hope not. Because in my case we have a DWH size almost against 1TB and because of this size it can be very complex and tricky to solve this sort of problems. So if mr. Murphy is reading this , skip my place please …

Happy SCOMMING,

Michel Kamp