More on the lognormal model
When I recently pointed out how the records exposed in data breaches seem to follow a lognormal distribution, an astute reader, Patrick Florer, asked how a very large breach would affect how well the lognormal model fits the data. It turns out that it essentially doesn’t affect it at all. I’ll try to explain this in three ways. One appeals to intuition, one is a proof by picture, and the third actually crunches some numbers to show this.
Let’s suppose that there has been a data breach that’s been reported but we don’t know the size of yet. Let’s suppose that this hypothetical data breach has exposed 200 million records and see how adding that single extra data point to the existing would affect the accuracy of the lognormal model that has a logmean of 3.5 and a logdeviation of 1.2. I'll be a bit sloppy here and use "mean" instead of "logmean" and "standard deviation" instead of "logdeviation," but that's just to make the following a bit more understandable. I've also used base 10 logs here. That's because most people immediately know what the base 10 log of 1 million is, while almost nobody knows what the natural log of 1 million is.
In the first place, note that the lognormal model deals with the logarithms of the sizes of breaches. So while 200 million certainly sounds like a big number, it’s log is only 8.3, which doesn’t sound quite as big. And note that 8.3 is only 4 standard deviations above the mean of 3.5. Events 4 standard deviations from the mean aren’t that common, but they’re also not really that rare. The so-called Six Sigma quality control process tries to manage events that are 6 standard deviations from the mean, for example.
And because the lognormal distribution is symmetric about its mean, we would expect that a really large breach would have the same effect on the accuracy of the model as a really small breach would. So while our intuition might lead us to believe that a really large breach to throw off a lognormal model, it actually has essentially the same effect as a very small breach.
That’s the appeal to intuition (also known as “hand waving”) part.
Another way to look at how a single large breach affects the lognormal model is to look at how well the data fits the model after this additional data point is added to it. Here is a graph that shows how well the existing data fits the lognormal model. In this graph, the straight line is the model and the squares represent actual breaches. There’s already a breach that exposed close to 100 million records, where the log of the number of records exposed is about 8. Our hypothetical big breach just adds a single data point slightly past that one and fairly close to it. In this case, you might expect that this doesn't really change things much.
That’s the proof by picture. It’s still not very rigorous, but it might give you an idea of why adding a single breach that exposed 200 million records doesn’t really affect the lognormal model of breaches much.
A more careful approach would add the hypothetical point to the data set and see if the distributions that you see in the old data set and the new, bigger data set are significantly different. If we look at these two distributions, we find that the two are both lognormal with a logmean of 3.5 and a logdeviation of 1.2. We can use a Shapiro-Wilk test for normality, for example, to check this. If we do this we get p-values that are much greater that 0.05 in both cases, so both the distribution with the additional data point and without the additional data point both agree with the same lognormal model with a reasonable degree of accuracy.
So the bottom line is that the lognormal of data breaches seems to model the available data fairly well, and it’s still fairly accurate if we add a hypothetical very large breach to the historical data.





Comments