Unless you’re a special sort of person, capacity planning and monitoring isn’t really a fun task. We have an SSIS package that we use to collect file statistics every six hours. The names have been changed, and the structure is different, but given the table

CREATE TABLE [FileStats] ( CheckTime DATETIME2(0), FileID INTEGER, FileSizeKB BIGINT )

how do we figure out what’s a normal growth and what’s abnormal?

Let’s start by getting our last two checks (I only wanted files in active checks so I filtered by checks within the last day):

SELECT rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY CheckTime DESC), FileID, FileSizeKB, CheckTime FROM FileStats WHERE CheckTime >= DATEADD(DAY,-1,GETDATE());

Then I want to compare what has happened between the last two checks (which is why I ran a ROW_NUMBER() over the partitioned set. There’s probably a better way to do this, but it’s how I solved it.

SELECT FileID, FileSizeKB, CheckTime, LastFileSize = LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC), LastCheckDate = LAG(CheckTime,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC) FROM ( SELECT rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY CheckTime DESC), FileID, FileSizeKB, CheckTime FROM FileStats WHERE CheckTime >= DATEADD(DAY,-1,GETDATE()) ) b WHERE b.rn <= 2

This is the basic information needed to compare the last two checks, but they haven’t been compared yet.

To begin comparison, calculate growth percent difference as the basic measure of file growth. The formula is * D = (x2-x1) / x1*.

Calculating this gives some truly fascinating numbers, both large and small but doesn’t given any insight into how to understand these numbers. Now it’s time to compare this with historical data. This will calculate our standard deviation and average *percent difference* throughout history. Essentially, this calculates the data file’s propensity to change through time.

SELECT FileID, AvgPctChange = AVG(PctChange), StdevPctChange = STDEVP(PctChange) FROM ( SELECT FileID, PctChange = (FileSizeKB - LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY CheckTime DESC)) / (LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY CheckTime DESC) * 1.) FROM FileStats ) avgs GROUP BY FileID

To actually make the decision, the current change needs to be compared with the observed historical changes. To do this, basic statistical measures are employed

SELECT FileID, LastCheckTime, LastFileSize, CheckTime, FileSizeKB, GrowthPctDiff = (FileSizeKB-LastFileSize) / (LastFileSize * 1.), t_avg.AvgPctChange, t_avg.StdevPctChange, z = (((FileSizeKB-LastFileSize) / (LastFileSize * 1.)) - t_avg.AvgPctChange) / t_avg.StdevPctChange FROM ( SELECT FileID, FileSizeKB, CheckTime, LastFileSize = LAG(FileSizeKB,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC), LastCheckDate = LAG(CheckTime,1,NULL) OVER (PARTITION BY FileID ORDER BY rn DESC) FROM ( SELECT rn = ROW_NUMBER() OVER (PARTITION BY FileID ORDER BY CheckTime DESC), FileID, FileSizeKB, CheckTime FROM FileStats WHERE CheckTime >= DATEADD(DAY,-1,GETDATE()) ) b WHERE b.rn <= 2) agg JOIN ( SELECT DatabaseFileID, AvgPctChange = AVG(PctChange), StdevPctChange = STDEVP(PctChange) FROM ( SELECT DatabaseFileID, PctChange = (FileSizeKB - LAG(FileSizeKB,1,NULL) OVER (PARTITION BY DatabaseFileID ORDER BY CheckTime DESC)) / (LAG(FileSizeKB,1,NULL) OVER (PARTITION BY DatabaseFileID ORDER BY CheckTime DESC) * 1.) FROM FileStats ) avgs GROUP BY DatabaseFileID) t_avg ON agg.DatabaseFileID = t_avg.DatabaseFileID WHERE agg.LastFileSize IS NOT NULL AND ((FileSizeKB-LastFileSize) / (LastFileSize * 1.)) > 0

This query will return the historical statistics and percent difference of any data file that has exhibited any growth over the last two checks. This percent difference is evaluated against the average and is assigned a z-score (how many standard deviations away from the mean this change is). A z-score of 1.96 would indicate statistical significance and would indicate a growth that falls outside of the “normal” distribution. With this query (and assuming you’ve set up the appropriate collection and retention structures), you can find significant data file growths.