At Influential, we strive to provide insights into social data by scoring our social accounts on over 10k data-points. Scores give users the ability to easily compare social accounts, even across social-networks.
We use standard scoring (z-scores) to normalize our data. Accounts are scored on a “curve”, similar to classes in academia. We take the (normal) distribution of each data-point (across all accounts for a given social-network), and use it to determine how many standard deviations each account is from the average. Each of these “z-scores” are then scaled to an intuitive grade that most people will be familiar with:
Scoring is straightforward when data falls into a normal distribution, but is often not the case. Most of our raw data is skewed, meaning one of the “tails” of our distribution is longer than the other. The longer tail pulls the mean (average) away from the 50th percentile, meaning a significant majority of the accounts would score below or above average. This makes our score less intuitive, and thus less useful.
Luckily, most of our skewed distributions approximate lognormal distributions, which look like normal distributions that have been logarithmically scaled. We can transform these distributions back to normal distributions by applying the inverse of this scale (another logarithmic function), which is determined by machine-learning algorithms. Once the distribution is normal, we can go back to scoring as usual.
Why Scaling Works
Let’s say we are watching 10 swimmers compete at the Olympics. As each competitor finishes, the difference between their times increase (2 seconds between 1st and 2nd vs 20 seconds between 9th and 10th). In this scenario, the distribution of times is skewed toward the faster swimmers, and we can argue that a second difference is much more significant between first and second than ninth and tenth. Scaling our distribution allows us to account for this change in significance. This principle is also applied when scaling social data.
Although we score individual data-points on a normal distribution, our overall scores are a weighted average of these individual scores. Averaging multiple independent normal distributions can result in a smaller deviation from the mean. Averaging interdependent distributions may also cause the overall distribution to deviate from normality. It is important then for the overall distribution to be displayed, so that the score may viewed in the proper context (For instance, if few scores are above 85, then an 84 is a very good score).