What is that growth?

Unless you’re a special sort of person, capacity planning and monitoring isn’t really a fun task. We have an SSIS package that we use to collect file statistics every six hours. The names have been changed, and the structure is different, but given the table

CREATE TABLE [FileStats] (
    CheckTime DATETIME2(0),
    FileID INTEGER,
    FileSizeKB BIGINT
)

how do we figure out what’s a normal growth and what’s abnormal?

Let’s start by getting our last two checks (I only wanted files in active checks so I filtered by checks within the last day):

SELECT
    rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY CheckTime DESC),
    FileID,
    FileSizeKB,
    CheckTime
FROM FileStats
WHERE CheckTime >= DATEADD(DAY,-1,GETDATE());

Then I want to compare what has happened between the last two checks (which is why I ran a ROW_NUMBER() over the partitioned set. There’s probably a better way to do this, but it’s how I solved it.

SELECT
    FileID,
    FileSizeKB,
    CheckTime,
    LastFileSize = LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC),
    LastCheckDate = LAG(CheckTime,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC)
FROM (
    SELECT
        rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY CheckTime DESC),
        FileID,
        FileSizeKB,
        CheckTime
    FROM FileStats
    WHERE CheckTime >= DATEADD(DAY,-1,GETDATE())
) b
WHERE b.rn <= 2

This is the basic information needed to compare the last two checks, but they haven’t been compared yet.

To begin comparison, calculate growth percent difference as the basic measure of file growth. The formula is D = (x2-x1) / x1.

Calculating this gives some truly fascinating numbers, both large and small but doesn’t given any insight into how to understand these numbers. Now it’s time to compare this with historical data. This will calculate our standard deviation and average percent difference throughout history. Essentially, this calculates the data file’s propensity to change through time.


SELECT
    FileID,
    AvgPctChange = AVG(PctChange),
    StdevPctChange = STDEVP(PctChange)
FROM (
    SELECT
        FileID,
        PctChange = (FileSizeKB - LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY CheckTime DESC))
        / (LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY CheckTime DESC) * 1.)
    FROM FileStats ) avgs
GROUP BY FileID

To actually make the decision, the current change needs to be compared with the observed historical changes. To do this, basic statistical measures are employed

SELECT
    FileID,
    LastCheckTime,
    LastFileSize,
    CheckTime,
    FileSizeKB,
    GrowthPctDiff = (FileSizeKB-LastFileSize) / (LastFileSize * 1.),
    t_avg.AvgPctChange,
    t_avg.StdevPctChange,
    z =
        (((FileSizeKB-LastFileSize) / (LastFileSize * 1.)) - t_avg.AvgPctChange)
            / t_avg.StdevPctChange
 FROM (
    SELECT
        FileID,
        FileSizeKB,
        CheckTime,
        LastFileSize = LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC),
        LastCheckDate = LAG(CheckTime,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC)
    FROM (
        SELECT
            rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY CheckTime DESC),
            FileID,
            FileSizeKB,
            CheckTime
        FROM FileStats
        WHERE CheckTime >= DATEADD(DAY,-1,GETDATE())
        ) b
   WHERE b.rn <= 2) agg
JOIN (
    SELECT
        DatabaseFileID,
        AvgPctChange = AVG(PctChange),
        StdevPctChange = STDEVP(PctChange)
    FROM (
        SELECT
            DatabaseFileID,
            PctChange = (FileSizeKB - LAG(FileSizeKB,1,NULL) OVER (PARTITION BY DatabaseFileID ORDER BY CheckTime DESC))
                 / (LAG(FileSizeKB,1,NULL) OVER (PARTITION BY DatabaseFileID ORDER BY CheckTime DESC) * 1.)
        FROM FileStats ) avgs
    GROUP BY DatabaseFileID) t_avg
        ON agg.DatabaseFileID = t_avg.DatabaseFileID
WHERE agg.LastFileSize IS NOT NULL
      AND ((FileSizeKB-LastFileSize) / (LastFileSize * 1.)) > 0

This query will return the historical statistics and percent difference of any data file that has exhibited any growth over the last two checks. This percent difference is evaluated against the average and is assigned a z-score (how many standard deviations away from the mean this change is). A z-score of 1.96 would indicate statistical significance and would indicate a growth that falls outside of the “normal” distribution. With this query (and assuming you’ve set up the appropriate collection and retention structures), you can find significant data file growths.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s