Markus Stocker bio photo

Markus Stocker

Between information technology and environmental science with a flair for economics, the clarinet, and the world of soups and salads.

Email Twitter Google+ LinkedIn Github

How can you tell it is?

If you analyze data you realize, time and again, that you need to familiarize with a rather messy ensemble of rows and columns. That they are all nicely aligned is misleading.

Assuming you deal with univariate data you might begin exploring it with the help of histograms. On a lucky day, the data might turn out looking roughly bell shaped, for instance. Mostly, though, it will not remind you of a textbook probability distribution.

Yet, upon closer look, you might notice that fitting a bell curve isn’t entirely unrealistic, if it wouldn’t be for that one peak along the right tail. What to do? Let’s remove the corresponding data arguing that it amounts to outliers, human error, faulty hardware or otherwise erroneous data.

Assumed is the existence of a distribution underlying the sample data you are looking at. If most of the data looks Gaussian then it might be safe to flag the remaining as “not quite fitting” and discard it. Is it? I’m cautious with such a conclusion. Though, sometimes it’s exactly what you need to do.

An example. Recently, I was staring at such a histogram for vibration sensor data of an ongoing data capture effort. It had a peak in its tail. Erroneous? Well, it was. As it turned out, I had labeled apples for bananas. Staring at a histogram for apples, there in the tail was data corresponding to an observation that truly was banana.

However, not always it’s the right thing to do. As Fenger [1] recounts, the existence of the so-called “ozone hole” could have been observed as early as 1978 if “satellites ozone measurements below a certain value had [not] been discarded by the data treatment as being erroneous.”

[1] Jes Fenger. Air pollution in the last 50 years - From local to global. Atmospheric Environment 43(1), 13-22, 2009