Lying Statistic and the Lying Politicians that Use Them

Ok... first, a disclaimer. I don't often expose this, because, for some reason, guitar players, and musicians in general, are not supposed to have brains. Though my own brain is probably not even close to the most educated in the musician sphere*, I do have a degree in Math/Computer Science, with a minor in Philosophy and emphasis in statistics. Advanced statistics is VERY difficult. The whizzes in stats become actuaries after passing a series of tests, largely known for their insane difficulty. I'm NOT a whiz... in fact, I fell asleep in statistics class and fell off my chair one day.

This blog entry doesn't touch on advanced statistics and you don't need more than a 6th grade education to understand this. In other words: Donald Trump -- just stop here. You'll be lost after this point.

Today, I'm writing about how politicians -- or their corporate sponsors -- like to quote statistics to support their spin of the day. However, in many instances, the exact same data used to tout their viewpoint also can be used to support the exact opposite opinion.

How does this work? As an example, I'm going to cite some hypothetical data on household income, and poverty levels.

Example 1: Using "Average" and "Median" to determine income distribution

Suppose your friendly neighborhood pundit enthusiastically announces that under the current administration, AVERAGE income has increased by 15%.

Well -- this sounds like the administration policies must be really wise and effective. But, let's deconstruct this to see how this is not at ALL a meaningful indicator of income distribution.

Suppose we're working with this data set (reduced for simplification):

Incomes of the population of Dullsville:

SAMPLE 1: {$15,000, $16,000, $17,000, $17,500, $16,500}

Obviously, there are very few people in Dullsville, which flies in the face of experience. But it's pretty easy to calculate the Average -- you just add all the incomes together and then divide by the number of data items, in this case, 5.

I've done the math for you -- sum of incomes: $82,000, 5 persons, $82,000 / 5 = $16,400.

OK.... this would appear to be a pretty income challenged town... an average income of $16,400, about $10,000 short of the poverty level for a family of 4. These people are most likely not eating very well. (Sadly to say, the yearly income of an earner working a minimum wage job @ $7.25/hr, 40 hr/week, 52 weeks/year is just $15,080. In order to achieve federal poverty level, a family with a single earner must make at least $12.50/hr.)

But let's make this addition to the data sample. The owner of the sweatshop where the other five persons in Dullsville toil in desperation, decides to move his home to Dullsville. His annual income is $14,000,000,000. Lets add this into our existing data set. Now, the "average" income is calculated to be $2,333,347,000 -- over 2 billion per family! Wow... now the town is one of the richest in the world! That tycoon can brag that during his reign, the average income of Dullsville increased by over $2.3 billion! (Lying bastard!)

Clearly, the statistic of Average Income is monumentally misleading -- in fact so bad that it's useless.

There's a second measure called The Median. This value is determined by ordering the data in increasing value -- smallest to largest. The median is the value in the center. If the number of data points is even, the two middle values are averaged to determine the Median. In the above example, before the Dullsville, Inc. magnate descends upon the town, the median income would be $16,500. The Median shows that there are as many data points below this value as are above it. With this simple data set, and before the addition of Mr. Dullsville, both the Average and the Median are really fairly accurate descriptions of the economic reality of Dullsville.

The problem with the Median as a measure is that, like the Average, wildly scattered data can seriously skew this measure of wealth. Let's say that four VERY impoverished families move to Dullsville... so the data set becomes:

SAMPLE 2: {$0, $2, $3, $4, $10,000, $15,000, $16,000, $17,000, $18,000}.

Now, the Average income is calculated as $8,445.44. The Median is $10,000. Obviously neither of these tells us much about the income of Dullsville. Even if the tycoon of Dullsville's income is substituted for that of the previously richest family, the Average rises to around $1.5 Million, but the Median is remains $10,000. Neither statistic is at all descriptive of the actual financial state of Dullsville.

How to more clearly interpret data

Accurately interpreting data is VERY complex. This is why actuaries are the elites of statisticians. I think casinos also employ actuaries to keep their business in the black. It's also very difficult to condense data such as income to a single value that is at all significant and few people have the patience to digest anything at all complicated -- just ask Dr. Fauci. (If you're still reading at this point, CONGRATULATIONS... you're an unusual American!

But if we don't understand at least how statistics can be manipulated to deceive, then how are we to make informed decisions with respect to our votes?

To restate, there IS no one number that can express a reliable measure of a data set, but there are other figures that can be calculated and that can help to understand data.

The first of these is The Range of data. This is simply the difference between the highest value and the lowest value in a dataset. Clearly, knowing the range gives you a clue that can help to understand the spread of data. In SAMPLE 2 above, for example, the range is $18,000 (excluding the tycoon's income.) In SAMPLE 1, the range is a mere $2,500. From this, we can determine that the data in SAMPLE 1 is more consistent. So, possibly, the average and mean values are more indicative of a real measure.

A third compilation figure is The Variance. The variance is generally an interim step toward calculating The Standard Deviation. To calculate the variance, you first calculate the Average, then add together the difference between each data point and the mean (average) squared; then average this set of values. Here are the steps for SAMPLE 1 as I calculated the Variance:

VARIANCE OF SAMPLE 1:

1. Average = (15,000+16,000+17,000+17,500+16,500) / 5 = (82,000 / 5) = 16,400

2. Differences between each data point and the Average, squared:

(16,400 - 15,000) ^ 2    = 1400 ^ 2    = 1,960,000
(16,400 - 16,000) ^ 2    = 400 ^ 2       = 160,000
(16,400 - 17,000) ^ 2    = 600 ^ 2       = 360,000
(16,400 - 17,500) ^ 2    = 1,100 ^ 2   = 1,210,000
(16,400 - 16,500) ^ 2    = 100 ^ 2      = 10,000

(Note that for the difference, I'm using "absolute values", that is, the minus sign is ignored. It doesn't matter, because a negative number squared is positive anyway.)

3. Average of differences squared from Step 2:

(1,960,000 + 160,000 + 360,000 + 1210,000 + 10,000) = 3,700,000
3,700,000 / 5 = 740,000

This is abbreviated by the Greek letter, "σ²" (Sigma squared).

4. Then standard deviation, "σ" (Sigma) is the square root of the variance.

If this seems like a lot of ballyhoo just to calculate a number that seems no more informative than just average and median -- well, it's not for naught. Once you know the standard deviation, you can determine which values lie outside the standard deviation.

Let's take the example of the incomes of five families in Dullsville -- not including the tycoon. I'm going to do the math off blog -- you don't want to deal with this stuff, I know, which is how we got into trouble in the first place. But here's a summary of the calculations of the Standard Deviation of SAMPLE 1:

OK... according to my potentially flawed calculation, the standard deviation is roughly $860. (I used the formula for calculating the standard deviation in a "Population"... this is when the data are the only values we're interested in. There's a slightly different formula for calculating Sigma for a sample from a larger population.)

Once we know the standard deviation, we can inspect the data and see which data points are within ONE STANDARD DEVIATION of the average. It my original sample, these would be the data points between $15,610 and $17,400 -- three of the data points, $16,000, $16,500, and $17,000, lie within one standard deviation of the average. In a typical population, a statistician would expect about 68% of data point to lie within one standard deviation. In my tiny population, it's 3/5 or 60% -- not too bad for totally trumped up (oops, excuse me) data.

If I add back the tycoons exorbitant income, it will lie far beyond even the recalculated standard deviation -- so we spot it as an anomaly.

Likewise, for the data sample that includes extremely low values, the average and median values are equally meaningless.

What does this mean in practical terms?

Well, for starters, DJT repeatedly bragged that:

"... median household income is up $5,000 since I took office"

We can see that this is a meaningless claim. We know that the divide between high echelon earners and the rest of us increased significantly, because of tax cuts that heavily benefitted higher incomes. The number of people in this group has increased -- the 1% is now the 1.25% or something like that. With a population of 300,000,000 plus people, this means that the number of gazillionaires has increased from 3 million, to 3.75 million. This moves the median significantly higher meaning the already wealthy have become even more wealthy. It does NOT mean that all American's incomes are $5,000 higher. In fact, it says absolutely NOTHING about the lower half of the population. To make ANY conclusion about the state of affluence in the general population, we would have to calculate the Standard Deviation and eliminate all those values that lie outside ONE STANDARD DEVIATION as anomalies.

Equally important for understanding the economic status of the general population, we would also discount the low values that lie beyond one standard deviation.

The overall point is that clearly, the ex-president -- and probably every president before and since -- have and will use statistics to support whatever point they're trying to make, fully knowing that mean and median are completely meaningless on their own.

Just a note -- all the increased income brags ignore one extremely significant point: inflation. Average income in 1970 was $52,000 (again, with the AVERAGE.) To purchase what at that time would cost $52,000 (say a house), would now cost $350,000+ (unless you're in Austin, in which case that house would cost about $1.25 million using the 25 year inflation on the proportional increase on estimated value of my own house as a guide.)

------------------------------------------------------------------------------------------

*The Big Brain Musician award most likely belongs to one of these:

Brian Mays (lead guitar of Queen), PhD in astrophysics;
Phil Alvin (lead vocal of The Blasters), advanced degree in math;
Dexter Holland (lead vocal, The Offspring), PhD in Molecular Biology;
Art Garfunkel, masters in Math;
Sterling Morrison (Velvet Underground), PhD in Medieval Literature from UT, no less!!!;
Milo Aukerman (vocal, The Descendants), PhD in Molecular Biology from USC;

And my all time favorite, (besides Leonardo Da Vinci)
Charles Ives, who founded and ran a successful insurance company, in the process advancing many innovation in financial services. This allowed him the freedom to write the music that he WANTED to, instead of the music he NEEDED to. He's one of my life models.

From the Wikipedia on Charles Ives ()
"Igor Stravinsky praised Ives. In 1966 he said: [Ives] was exploring the 1860's during the heyday of
Strauss and Debussy. Polytonality; atonality; tone clusters; perspectivistic effects; chance; statistical
composition; permutation; add-a-part, practical-joke, and improvisatory music: these were Ives’s
discoveries a half-century ago as he quietly set about devouring the contemporary cake before the rest of
us even found a seat at the same table."

Three Rights Make a Left

Search This Blog