top of page

Beyond the Numbers: Decoding the Dance of Causation and Correlation

Updated: Mar 14


Causation and Correlation

The survey showed that 60% of startup founders use data analytics in their businesses. Of those founders, 75% know the correlation vs. causation trap. However, only 50% of those founders are confident in avoiding the trap.


That's an exciting and essential issue that we can address here. Let's start with the basics. Correlation doesn't equal causation. Just because two things are correlated does not mean one causes the other. For example, there is a high correlation between the number of ice cream cones sold and the number of shark attacks. However, this does not mean that eating ice cream causes shark attacks. In reality, ice cream sales and shark attacks are likely caused by the same facto - hot weather.


Simply speaking:

  • Correlation is a statistical measure that shows how two variables are related.

  • Causation is a relationship between two variables where one variable causes the other.

Discovering a correlation between two metrics is valuable as it enables predicting future outcomes. However, understanding the cause behind a phenomenon holds even greater significance, as it allows for the possibility of making changes and influencing outcomes. It's important to note that causation is rarely a simple one-to-one relationship, as multiple factors (often multiple steps) produce a particular result. Take, for instance, summertime car crashes, where various elements like alcohol consumption, inexperienced drivers, increased daylight hours, and summer vacations contribute to the overall occurrence. Consequently, despite most noble intentions attaining a 100% causal relationship is uncommon. Instead, multiple independent metrics come into play, each providing insight into a portion of the behavior exhibited by the dependent metric. Nonetheless, even partial causality holds substantial value in understanding and managing outcomes.


You prove causality by finding a correlation, then running an experiment in which you control the other variables and measure the difference. This is hard to do because no two users are identical; it's often impossible to subject a statistically significant number of people to a properly controlled experiment in the real world. As we can see in this example, interpreting the data might quickly become arduous. However, still, there is hope in those situations.


If you have a sufficiently large user sample, conducting a reliable test becomes possible even without controlling all other variables, as their impact becomes relatively insignificant. This is why companies like Google can experiment with subtle factors like hyperlink color, and Microsoft can precisely measure the impact of slower page load times on search rates. However, it is more practical for the average startup to conduct more straightforward tests focusing on a few variables and assessing their impact on the business. Significant factors are the size of your business and product complexity. The bigger the organization, the more extensive analytical supervision in your organization (at least, that's how it should be). This shouldn't be an excuse for smaller organizations to avoid analytical rigor. This is vital to your scaling pace and ability to catch up with competitors or get ahead.


Before delving into the examples and challenges, it's essential to emphasize the significant role of causal reasoning. As human beings, we are inherently attuned to cause and effect in our everyday lives. Whether it's playing pool or considering the effects of vaccination, we constantly contemplate causality. When we aim the cue ball at a particular angle, we ponder whether it will result in the remaining balls finding the corner pocket. Similarly, when considering vaccination, we weigh the likelihood of contracting COVID based on our decision. We regularly make such decisions, some successful and others not so much (oh, life). Whenever we contemplate the possible consequences of our choices, whether consciously or subconsciously, we inherently consider the underlying cause. We envision how the world would unfold under different circumstances: What if we take action X? What if we opt for alternative Y instead? While attributing human consciousness to this thought process might be ambitious, causation is poised to bring a transformative revolution in data utilization. I would dare suggest that causality lies at the heart of AI-like "thinking within the current AI-driven madness." That may be why many founders need help reducing the correlation vs. causation trap risk. Or that's what is driving our next technological revolution? Both? I don't know…


Let's run this through examples of financial services and the retail industry. Company examples are entirely random, along with the hypothetical cases used to illustrate these situations better. These are well-managed companies. I'm just using their names to make these examples more realistic.


FinServ company: LTCM

Metric: Historical volatility

Assumption: LTCM believed historical volatility was a reliable predictor of future market behavior.


Issue: LTCM, a hedge fund, relied heavily on historical volatility as a key metric in its investment strategy. They assumed that higher historical volatility indicated higher future returns. However, they should have considered other factors and the limitations of using historical data as a predictor. In 2008, when global markets experienced a financial crisis, LTCM suffered significant losses as they had not adequately accounted for unforeseen events and their cascading effects. The reliance on historical volatility alone proved to be an incorrect understanding of the relationship between correlation and causality, leading to a flawed investment approach.


Retail company: J.C. Penney

Metric: Everyday low-pricing (EDLP) strategy

Assumption: J.C. Penney believed implementing an everyday low-pricing strategy would increase sales and customer loyalty.


Issue: In 2012, J.C. Penney implemented a new pricing strategy, eliminating frequent discounts and opting for a simplified everyday low-pricing model. The assumption was that customers would respond positively to transparent and consistent pricing. However, this decision was based on a correlation between lower prices and increased sales in other retail companies. J.C. Penney failed to consider other factors, such as brand perception, customer behavior, and overall value proposition. As a result, the strategy backfired, resulting in decreased sales and customer dissatisfaction. The company mistakenly assumed a causal relationship between lower prices and increased sales, needing to understand the complexities of customer buying behavior fully.


Conclusion: Getting this done is a ton of work. I'm not saying this to discourage anyone, I know this is critical for many companies' success. But you can only verify some metrics or decisions through an analytically intensive process. You should devise an approach to these matters at the minimum level. If your organization follows a rigor around critical metrics and the correlation between them at the core, it will start noticing more in conversations at multiple levels. Your RevOps team must remain skeptical of phrases like "We have always done it that way," "What else we could do?" or (my favorite) "I've done this in my previous company." Jokes aside, it doesn't make it accurate just because someone tells you something. Although previous experiences are valuable, every company has different products and operates in a different market. Digging deeper into the underlying causation makes all the difference. Focusing on a handful of metrics verifying correlation vs. causation significantly de-risks your decision-making process. As you onboard on this journey with the rest of your teams, you eventually get to the point where you feel comfortable with the risk you're taking. That's pretty good.


Here are some more practical suggestions for GTM teams, especially analysts, RevOps, and Finance groups:


  1. Use multiple data sources. I suggest leveraging numerous data sources to minimize the likelihood of drawing false correlations. By utilizing various data sources, you can validate the correlation between two variables, thereby mitigating the risk of erroneous associations.

  2. Use statistical tests. Involving various statistical tests enables the assessment of the magnitude of a correlation. By leveraging these tests, you can enhance the quality of your decision-making when utilizing the correlation, thus allowing more informed choices.

  3. Consider multi-factors. When evaluating the causal relationship between two variables, it is crucial to account for other influencing factors that impact the correlation. For instance, if you investigate whether ice cream sales directly cause shark attacks, it is essential to consider additional variables, such as hot weather, that may contribute to the observed correlation.

  4. Leverage a control group. Using a control group when determining whether two variables are causally related is helpful. A control group is a group of people not exposed to the variable you are testing. By comparing the results of the control group to those of the group exposed to the variable, you can better understand whether the variable is causing the correlation. It's more than just a method for the pharma industry.

  5. Create a time series analysis. If you have data on two variables over time, you can use a time series analysis to determine whether there is a causal relationship between the two variables. A time series analysis can help you identify data patterns that suggest a causal relationship.

  6. Confounding variables. Confounding variables are variables that could affect the correlation between two variables. Identifying and controlling for confounding variables is essential when determining whether there is a causal relationship. You can minimize the risk using randomization, matching, and experimental design or leverage a bit more curated/ relevant data.

  7. Machine Learning to identify causal relationships. I know this sounds techy, but more and more tools leverage these algorithms that can be used to identify causal relationships between variables. This is done by looking for data patterns suggesting that one variable causes the other.

  8. Use difference-in-differences (DD) model is a statistical method used to estimate the causal effect of an intervention on an outcome. The DD model compares the outcome of two groups: a treatment group that received the intervention and a control group that did not. The DD model then compares the change in the outcome between the two groups over time. Don't be swayed by the fancy name.

  9. RCT (Randomized Controlled Trials) is the gold standard for determining causal relationships. RCTs randomly assign participants to either a treatment group or a control group. The treatment group is given the intervention being studied, while the control group is not. The results of the RCT are then used to determine whether the intervention has a causal effect on the outcome.

10 views0 comments

Comments


bottom of page