Data Analysis Pitfalls: Spurious Correlations & Jumping to the Wrong Conclusions

Data Analysis Pitfalls: Spurious Correlations & Jumping to the Wrong Conclusions


Since the beginning of this month (June, 2019) there has been a lot of speculation on just what Google did to their algorithm. Site’s like Mercola’s Natural Health website lost 99% of their traffic. Since quite a few sites in the health sector were affected, it has been easy for many to assume that Google has a problem with this type of site. Since other health sites benefited, this must mean that Google is picking and choosing its favorites through manual intervention.

It’s fine (and fun) to speculate on these cause and effect theories, but it doesn’t really help sites who have been affected to remedy their situation. The fact is that manual actions against this (and other similar sites reporting the same type of issues) is probably the wrong answer. It’s an easy conclusion to come to, but that doesn’t make it the right one.

Data Analysis Relies on Other Data to be Useful

In this modern age of computers and information, we’re presented with data all the time. We hear statements like “science doesn’t lie” and “the numbers prove…” It can be easy to look at a set of facts and come to conclusions. The problem is that the mistake isn’t in the data or the science, it’s in the conclusions we’ve come to.

If you’ve been in business for a while, you’ve certainly seen some of this paradoxical information in play. Sales are up 20% this quarter, yet somehow there’s less money to show for it? Going just by your sales, there should be more money, right? Twenty percent more, to be exact. The data doesn’t lie, so where is the mistake?

The mistake isn’t in the accounting, it’s in the assumption we made that 20% increase in sales equates to 20% more money (or any percent more money). We need to take more into account here like profit and loss margins, cost of goods, overall efficiency and so on.

A few years back I was helping a friend work out food cost for his restaurant – something he should have been doing all along, but for some reason wasn’t. We figured out the cost of everything that went into making a BLT sandwich and realized that at the $6.50 he was charging you for the sandwich, he was actually losing about $68 cents. In the end, the more BLTs he sold, the less he made. If he sold 100 in a month, he wasn’t making $6,500 – he was actually losing $680.

So, while it’s a truth that the numbers don’t lie, any conclusions you draw from them need to take all the factors into account. Every single one of them. Keep in mind that even if the food costs are way off, that doesn’t mean those profits aren’t being affected by labor costs, fixed costs, and incidentals which could come up, too.

Cause, Effect & Spurious Correlations

Another pitfall in analyzing data is in assuming that two sets of numbers which line up are necessarily related. It is a fact that Google applies manual penalties to sites for various reasons. If you get one of these penalties, you’re going to lose almost all your traffic from Google. It is also a fact that Mercola (and many other sites) lost as much as 99% of their traffic after this June update. (Though I do have a feeling that the 99% number is inflated a bit for sensationalism). From knowing these two things, it’s easy to assume that Google must have penalized these web sites.

And, unfortunately, many have jumped to that conclusion.

Corellations of Suicides in motor vehicals and US sales of German CarsJust because two sets of data or facts line up, it doesn’t necessarily mean that there is any correlation there. For example… over a ten year period, the number of German cars sold in the US correlates with the number of Suicides by crashing motor vehicles.

These numbers are astounding and can only mean one thing – that German cars on the road make people want to kill themselves. We obviously need to do something about this – at once!

Or not.

Data correlations like these are easy to see. There’s simply no way these two things actually relate to one another. For other things, it’s not so easy because the data simply looks like it should be related. As with my previous example, we have to look at a lot more things in order to to be fairly certain that we’ve found the right cause to align with our observed effect.

Science: Proof of Hypothesis not Finding Answers from Proof

Scientific Method
The overall process involves making conjectures (hypotheses), deriving predictions from them as logical consequences, and then carrying out experiments based on those predictions to determine whether the original conjecture was correct

Scientific Method Process – Wikipedia

Now that we understand the pitfalls and limitations of data analysis, how to we get on track to understand the numbers?

Data Analysis is more like science than math. With the scientific method we know that we start with a hypothesis and then create tests or experiments to figure out if it is correct or not. If you come across results which don’t match your conjecture, you need to adjust the hypothesis and start over again.

A common mistake in analyzing data is to look at the data (which represents the results of our experiments), decide what it’s telling us by forming a conclusion and then stop there. To be certain that your analysis is correct, you need to take your conclusion (hypothesis) and run that back through more tests to make sure it all holds up.

In our German Cars example, above – my conclusion that German Cars make people want to kill themselves can be disproved in several ways. Though it may be hard to get data from the past, we can continue monitoring these statistics but also watch for any signs that German cars were anywhere present during a period of time leading up to the incidents. We could look at the victims’ suicide notes to see if their stated reasons mention anything that might give us insight.

To flip the coin for a moment, it may seem obvious to us that the example here has absolutely no basis in fact – that it’s all a coincidence. Except, we truly can’t be certain of that until we’ve tested it, too.

Test, Test Some More, and then Test Again

With all the conjecture as to what happened in the last Google update, we have a lot of hypotheses to test. Here are a few of the most common ones that I’ve heard in relation to the Mercola story:

  • Google is manually penalizing any site that has to do with natural medicine
  • Google is (manually or otherwise) penalizing sites where Wikipedia has negative mentions of them
  • Google has a new (and potentially flawed and exploitable) method of determining whether claims are facts or not.
  • <insert your hypothesis here>

In the case of Mercola, there were also some fundamental SEO issues which may have helped to amplify the effect on them. From looks, they have remedied several of these already. It’s difficult to run any tests and make improvements based upon the first two in my list because they rely on factors that only Google can control. So unless they admit to it or all other possibilities are exhausted, we can’t know either way.

I do know that the scenario fits into my analysis of the June 2019 Google Core update. They do provide a lot of external citations, but those are all for well established medical facts – plus those citations are at the bottom of the page and not clearly hooked into the context of the content where the claim is made. (Wikipedia’s citations are both in the form of structured citations, but also directly connected to the points where the claim is made through the links between the claim and the citation.)

If the World Wide Web is the world, and links are the roads, then Google hates a cul-de-sac.

Stockbridge Truslow – Equestics

The biggest problem, though, is that in any claim that it makes outside of mainstream and established medical fact, Mercola only links to and cites itself. For Google to be able to put a claim into its frame as a valid facet or element of an entity (the subject of a claim) there needs to be corroboration.

For example, I can go on and on all day long saying that Google can determine facts, but it isn’t likely going to recognize that claim unless I back it up with corroborating statements from other sources which concur with my claims. It’s even better if I can cite a source which explains how it all works. Granted, both of these articles are 4 years old and describing a technology that was still in development, so for my claim to be truly trusted, I also need to cite some sort of source which corroborates that this stuff is happening now.

If your site was affected in a similar way…

  • Be careful not to focus on things you can’t control. Throwing up your hands and saying, “Google just hates me” does nothing to help you figure out if your hypothesis is true.
  • Try looking at some of the suggestions above and understanding some of the concepts outline in the links here – and give it a shot. The worst that happens is nothing. The best that happens is that you start to climb back up the ranks.
  • Try coming up with other hypotheses and things you can try – and try them.
  • Beware the pitfalls of spurious correlations and jumping to the wrong conclusion.
  • Don’t give up. Look at the data, hypothesize, test and repeat.


How useful was this post?

Click on a thumb to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

As you found this post useful...

Share it with your friends!

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: