#2883: Correlation Beyond Pearson: 5 Techniques You Need

Pearson, Spearman, Kendall, partial, distance correlation — when to use each one and why most people stop too soon.

Featuring
Listen
0:00
0:00
Episode Details
Episode ID
MWP-3052
Published
Duration
29:17
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Correlation analysis is one of the first things taught in statistics — and one of the most frequently misapplied. This episode walks through the full toolkit, starting with the three foundational measures: Pearson's r for linear relationships between normally distributed data, Spearman's rho for monotonic relationships that are robust to outliers, and Kendall's tau for small samples with ties. Each has strengths and blind spots, and the classic Anscombe quartet demonstrates why a correlation coefficient without a scatterplot is a press release, not an analysis.

From there, the episode moves to advanced techniques that address real-world complexity. Partial correlation isolates the relationship between two variables while controlling for confounders — the tool that makes the ice-cream-and-drowning correlation vanish when you control for temperature. Distance correlation, a more recent development from 2007, detects any form of dependence, including non-monotonic relationships like Y = X² that Pearson and Spearman both miss entirely. Finally, canonical correlation analysis extends the framework to entire sets of variables, finding linear combinations that maximize correlation between two groups — a workhorse in genomics, neuroscience, and multivariate statistics. The episode covers regularization, the kernel trick, and the critical warning about autocorrelation in time series data.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2883: Correlation Beyond Pearson: 5 Techniques You Need

Corn
Daniel sent us this one — he's asking about correlation analysis, both the basics and the advanced stuff. He wants to know what techniques are out there, what the pitfalls are, and how to think about correlation beyond just "here's a number, ship it." There's a lot lurking under the surface here, because correlation is one of those things everyone learns in week one of statistics and then... mostly gets wrong for the rest of their career.
Herman
The prompt gets at something really important — what do you actually do when the obvious Pearson correlation isn't enough? Because the standard intro stats version of correlation is basically the musical equivalent of beige wallpaper. It works fine on tidy, linear, normally-distributed data and completely falls apart everywhere else.
Corn
Which is everywhere.
Herman
Which is basically everywhere, yes. So let's start with the foundation and then build up. Pearson's r — the one everyone knows. It measures linear correlation between two continuous variables, ranges from negative one to positive one, zero means no linear relationship. It was developed by Karl Pearson in the eighteen nineties, building on Francis Galton's earlier work on regression.
Corn
Galton, who also gave us eugenics and the concept of statistical regression to the mean. A mixed legacy, let's say.
Herman
But the math stuck. Pearson's r is essentially covariance divided by the product of standard deviations. The formula normalizes everything so you get a unitless number. And that is both its strength and its weakness — it's wonderfully interpretable until it isn't.
Corn
The "until it isn't" is doing a lot of work there. What breaks first?
Herman
Pearson's r is catastrophically sensitive to outliers. A single extreme point can drag your correlation from zero point eight to zero point two, or create a phantom correlation where none exists. There's a classic dataset — the Anscombe quartet from nineteen seventy-three — four sets of data with identical means, variances, and correlations, but the scatterplots look completely different.
Corn
Anscombe's quartet should be mandatory viewing before anyone is allowed to report a correlation coefficient. One of the sets is basically a straight line with one outlier that destroys the fit. Another is a perfect parabola with zero linear correlation. All four give you the same r.
Herman
And that brings us to the first rule of correlation analysis: always plot your data. A correlation coefficient without a scatterplot is a press release, not an analysis.
Corn
"A press release, not an analysis." I'm putting that on a mug.
Herman
That's Pearson. Now, what do you do when your data isn't normally distributed or the relationship isn't linear? That's where Spearman's rank correlation comes in. Spearman's rho — developed by Charles Spearman in nineteen oh four — works by converting your data to ranks and then computing Pearson's r on those ranks.
Corn
Instead of asking "do these numbers move together," you're asking "do these rankings move together.
Herman
And that makes it robust to outliers and works for monotonic relationships — relationships that always go up or always go down, even if they're not straight lines. If Y consistently increases as X increases, but the curve bends, Spearman catches it. Pearson might miss it.
Corn
Monotonic but not linear — like diminishing returns.
Herman
And Spearman doesn't assume normality, which makes it the go-to for a lot of real-world data. Then there's Kendall's tau, developed by Maurice Kendall in nineteen thirty-eight. It's also rank-based but uses a different approach — it looks at concordant and discordant pairs. For every pair of observations, it asks: do they point in the same direction or opposite directions?
Corn
It's counting agreements versus disagreements in ordering.
Herman
Kendall's tau tends to be more robust than Spearman with small samples and handles ties better. It's also got a cleaner probabilistic interpretation — tau is basically the probability of concordance minus the probability of discordance for a randomly selected pair. I find it elegant in a way Spearman isn't.
Corn
When would you reach for Kendall over Spearman?
Herman
Small samples, lots of ties in the data, or when you want that probabilistic interpretation. In practice, Spearman is more common, but Kendall is arguably better behaved statistically. They'll usually give you similar answers, but when they diverge, trust Kendall.
Corn
We've covered the big three. Pearson for linear and normal, Spearman for monotonic and robust, Kendall for small samples and ties. That's the starter pack.
Herman
Here's where most people stop. But the prompt is asking about advanced techniques, and this is where it gets genuinely interesting. Because the fundamental limitation of all three of these is that they measure bivariate association — the relationship between exactly two variables. In the real world, you almost always have more than two variables interacting.
Corn
That's where you get the classic "ice cream sales correlate with drowning deaths" problem. Both go up in summer. The correlation is real but the causal interpretation is nonsense.
Herman
The lurking variable problem. And the technique designed to handle this is partial correlation. Partial correlation measures the relationship between two variables while controlling for one or more other variables. You're essentially asking: if I hold Z constant, what's the residual correlation between X and Y?
Corn
In the ice cream example, you'd control for temperature and watch the correlation between ice cream sales and drownings vanish.
Herman
The math is straightforward — you regress X on Z, regress Y on Z, take the residuals from both regressions, and then correlate those residuals. What's left is the relationship between X and Y that isn't explained by Z.
Corn
This scales to multiple control variables?
Herman
You can compute partial correlations controlling for entire sets of variables. This is foundational in fields like epidemiology and econometrics, where you're constantly trying to isolate relationships from confounders. But there's a trap here — partial correlation assumes linear relationships among all variables. If the confounder relationship is nonlinear, your partial correlation can be misleading.
Corn
Everything in statistics is a model with assumptions, and every assumption is a lie waiting to be exposed.
Herman
That's the most Corn thing you've ever said.
Corn
I have my moments. So partial correlation handles confounders — what's next?
Herman
Let's talk about distance correlation. This is a much more recent development — Gábor Székely and his colleagues introduced it in two thousand seven. And it solves a fundamental problem. Pearson, Spearman, Kendall — they all measure some specific kind of association. If two variables are related but the relationship is non-monotonic, those measures can give you zero even when there's a clear dependence.
Corn
Give me an example of non-monotonic dependence.
Herman
Y equals X squared. As X increases, Y first decreases, then increases. Pearson gives you zero. Spearman gives you zero. There's clearly a relationship — a perfect deterministic one — but all the standard measures miss it completely.
Corn
Because they're all measuring "does Y go up when X goes up" in some form, and here Y goes up and down.
Herman
Distance correlation solves this. The intuition is beautiful: instead of measuring how values covary, you measure how distances between pairs of points covary. For any two pairs of observations, you look at the distance between the X values and the distance between the Y values. If X and Y are related in any way, pairs that are close in X will tend to be close in Y in some systematic pattern.
Corn
You're comparing the distance matrices of the two variables.
Herman
And distance correlation has a property that is almost magical: it equals zero if and only if the variables are statistically independent. Not just linearly independent, not just monotonic independent — fully independent. For any dependence at all, distance correlation is positive.
Corn
That's a strong claim. Any dependence whatsoever?
Herman
In the population limit, at least. With finite samples, you're estimating it, so there's noise. But the theoretical property is that distance correlation characterizes independence completely. No other correlation measure does that.
Corn
Why isn't this everywhere? Why do people still use Pearson?
Herman
A few reasons. It's computationally heavier — you're working with n-by-n distance matrices, so it scales poorly to massive datasets. It's less intuitive to explain. And honestly, a lot of fields are just slow to adopt new statistical methods. But for moderate-sized datasets where you don't know the functional form of the relationship, distance correlation is arguably the best tool we have.
Corn
There are tests based on it?
Herman
Yes, there's a permutation test for distance correlation that gives you a p-value for dependence. You shuffle one variable, recompute the distance correlation thousands of times, and see where your observed value falls in that null distribution. It's computationally intensive but conceptually clean.
Corn
We've got Pearson, Spearman, Kendall, partial correlation, distance correlation. What's the next rung up the ladder?
Herman
Canonical correlation analysis. This is where we stop looking at pairs of variables and start looking at sets of variables. CCA finds linear combinations of variables in one set that are maximally correlated with linear combinations of variables in another set.
Corn
Instead of correlating X with Y, you're correlating a weighted sum of X-one through X-n with a weighted sum of Y-one through Y-m.
Herman
And it gives you not just one pair of canonical variates but multiple pairs, each orthogonal to the previous ones, each capturing the next strongest mode of covariation between the two sets. It's like principal component analysis, but instead of maximizing variance within one set, you're maximizing correlation between two sets.
Corn
This sounds like the kind of thing that gets used in genomics or neuroscience, where you have thousands of measurements on one side and behavioral outcomes on the other.
Herman
Classic use case. You've got gene expression data — thousands of genes — and you want to see how they relate to a set of clinical measurements. CCA finds the combinations of genes that most strongly correlate with combinations of symptoms. Harold Hotelling developed it in nineteen thirty-five, and it's been a workhorse in multivariate statistics ever since.
Herman
Overfitting is the big one. With high-dimensional data — more variables than observations — CCA will find perfect correlations even with random noise. Regularization is essential in modern applications. There's regularized CCA, sparse CCA that forces many weights to zero for interpretability, kernel CCA for nonlinear relationships. The basic idea has spawned a whole family of methods.
Corn
Kernel CCA — that's where you map everything into a higher-dimensional space first?
Herman
Right, using the kernel trick from machine learning. You implicitly transform your variables into a feature space where relationships that are nonlinear in the original space become linear. Then you do CCA in that transformed space. It's powerful but even more prone to overfitting, so you need careful cross-validation.
Corn
We've covered the spectrum from "correlate two columns in Excel" to "kernelized regularized sparse canonical correlation analysis." There's something satisfying about that arc.
Herman
There's one more I want to mention because it addresses a specific pain point: autocorrelation in time series. If you compute a standard correlation between two time series — say, GDP and stock prices — you're going to get a misleading answer because both series are correlated with their own past values.
Corn
The spurious regression problem. Granger and Newbold's classic paper.
Herman
From nineteen seventy-four. They showed that if you regress two independent random walks on each other, you get statistically significant correlations most of the time. The solution for correlation specifically is to work with differenced data or to use the cross-correlation function, which computes correlation at different time lags.
Corn
You're asking not just "are these related" but "does X lead Y by three months.
Herman
The cross-correlation function gives you a correlation coefficient for each lag. X at time t correlated with Y at time t-minus-k. This is fundamental in time series econometrics, signal processing, any domain where timing matters. And it surfaces lead-lag relationships that simple contemporaneous correlation completely misses.
Corn
Which brings us to the philosophical question lurking behind all of this. Correlation is not causation — everyone knows that. But what is correlation actually telling you?
Herman
That's the question, isn't it? Correlation is a measure of association, and association can arise for many reasons. Direct causation, reverse causation, common causes, selection bias, measurement error, sheer coincidence. The correlation coefficient itself is silent on the mechanism.
Corn
Yet people desperately want it to speak. There's an entire industry of "X is correlated with Y, therefore you should do Z" that skips the hard work of identifying mechanisms.
Herman
This is where the techniques we've discussed become tools for investigation rather than endpoints. Partial correlation helps you rule out confounders. Cross-correlation helps you establish temporal precedence — which is one of the Bradford Hill criteria for causation. Distance correlation tells you whether there's any dependence at all worth investigating.
Corn
Bradford Hill — the nine criteria for causal inference in epidemiology. Strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, experiment, analogy.
Herman
Temporality — the cause must precede the effect — is the only one that's strictly necessary. Cross-correlation directly addresses that for time series. But even with a strong cross-correlation at a plausible lag, you still haven't proven causation. You've just made the case more plausible.
Corn
Let's talk about a specific pitfall that I think deserves more attention: restriction of range. If you only look at a narrow slice of your data, correlations can vanish or appear out of nowhere.
Herman
Classic example is SAT scores and college GPA. Within a highly selective university, the correlation might look weak because everyone has high SAT scores. You've restricted the range on the predictor. But across the full population of college students, the correlation is substantial. This trips up people constantly.
Corn
Because they're computing correlations on their existing customers, or their current employees, or whatever convenient sample is sitting in the database, and forgetting that the sample isn't representative of the population they're trying to reason about.
Herman
And the math is unforgiving — the formula for the correlation in the full population given the restricted correlation involves the ratio of variances. If the variance in your sample is much smaller than in the population, your observed correlation can be dramatically attenuated.
Corn
Another one: ecological correlation. Correlating group-level averages and then interpreting the result as if it applies to individuals.
Herman
The ecological fallacy. A classic from sociology — Emile Durkheim found that Protestant regions had higher suicide rates, but that doesn't mean individual Protestants were more likely to commit suicide. The correlation at the aggregate level doesn't necessarily hold at the individual level. And in fact, the sign can flip — that's Simpson's paradox territory.
Corn
Simpson's paradox is the ultimate cautionary tale for correlation analysis. A trend appears in several groups of data but disappears or reverses when the groups are combined.
Herman
The famous Berkeley graduate admissions case from nineteen seventy-three. Overall, men were admitted at a higher rate than women. But broken down by department, women were admitted at equal or higher rates in most departments. The apparent discrimination was an artifact of women applying to more competitive departments.
Corn
The aggregate correlation pointed in one direction, the disaggregated correlations pointed in another. And both were "correct" in the sense that the math was fine — the interpretation was the problem.
Herman
Which brings us to a practical framework. When I'm doing correlation analysis, I try to follow a workflow. Step one: plot everything. Scatterplots, distributions, time series plots if relevant. Step two: check for outliers and decide how to handle them — remove, transform, or use robust methods. Step three: choose your correlation measure based on what you've seen in the plots.
Herman
Step four: think about confounders. What else could be driving both variables? Compute partial correlations if you have data on plausible confounders. Step five: check for subgroup effects — could the relationship differ across categories? Step six: if it's time series, check for autocorrelation and use cross-correlation functions. Step seven: don't overinterpret. Report the uncertainty. Report the assumptions. And never, ever say "therefore" when you mean "is associated with.
Corn
"Never say 'therefore' when you mean 'is associated with.'" That might be the best statistical advice I've ever heard.
Herman
It's aspirational. I violate it all the time. But I try.
Corn
Let's dig into something you mentioned earlier — the computational scaling issue with distance correlation. You said it works with n-by-n distance matrices. At what point does it become impractical?
Herman
The naive computation is O of n squared in both time and memory. For a million observations, that's a trillion pairwise distances. You're not doing that on a laptop. But there's been progress. Some recent work uses random projections or binning to approximate distance correlation in near-linear time. The two thousand nineteen paper by Chaudhuri and Hu introduced a fast algorithm based on averaging over random one-dimensional projections.
Corn
You project the data onto random lines, compute one-dimensional distance correlations, and average?
Herman
The one-dimensional distance correlation has a closed form that's fast to compute, and by averaging over enough random projections, you get an unbiased estimate of the full distance correlation. It's a clever trick that makes the method feasible for much larger datasets.
Corn
The trade-off is variance — you're introducing approximation error.
Herman
Right, but with enough projections, that variance becomes manageable. It's the same principle as Monte Carlo integration. You're trading computational cost for statistical precision in a controlled way.
Corn
Let's circle back to something more basic that I think gets overlooked: the distinction between correlation and agreement. A correlation of zero point nine doesn't mean two measurements agree — it means they move together. If I have a scale that's consistently ten pounds too high, it correlates perfectly with the true weight but doesn't agree with it at all.
Herman
That's such an important point. Correlation measures linear association, not agreement. For agreement, you want measures like the intraclass correlation coefficient or Bland-Altman plots. The ICC is specifically designed for situations where you want to know if two measurements are interchangeable — same mean, same scale. Pearson doesn't care about either.
Corn
This trips up people in medical research constantly. Two devices measuring the same thing, high Pearson correlation, everyone concludes they're equivalent. But the new device could be systematically biased and wildly variable, and Pearson wouldn't flag it.
Herman
Bland and Altman's nineteen eighty-six Lancet paper on this is one of the most cited statistical papers of all time, and people still get it wrong. Their method is to plot the difference between measurements against their average. It's brilliantly simple — you can see bias, heteroscedasticity, outliers, all at a glance.
Corn
Heteroscedasticity — the variance changes across the range. Another thing standard correlation methods don't handle well.
Herman
Another reason to plot before computing. A funnel shape in the scatterplot — narrow at one end, wide at the other — tells you the correlation isn't uniform across the range. Maybe it's strong for low values and weak for high values, or vice versa. A single correlation coefficient collapses all of that variation into one misleading number.
Corn
We've got agreement versus association, heteroscedasticity, restriction of range, ecological fallacy, Simpson's paradox, autocorrelation, outliers, nonlinearity, confounding. The list of ways to misuse correlation is longer than the list of correlation techniques.
Herman
That's before we get to the multiple testing problem. If you compute correlations between every pair of variables in a dataset with a hundred variables, that's nearly five thousand correlations. At a five percent significance level, you expect about two hundred fifty false positives. Without correction, you'll "discover" patterns in pure noise.
Corn
The garden of forking paths, correlation edition. P-hacking by browsing.
Herman
Bonferroni correction is the simplest fix — divide your significance threshold by the number of tests. But it's conservative. The Benjamini-Hochberg procedure controls the false discovery rate and is more powerful. Either way, if you're doing exploratory correlation analysis on high-dimensional data, you need some form of multiple testing correction, or you're just telling stories about noise.
Corn
Which, to be fair, is a thriving industry.
Herman
It's the business model of a surprising number of things, yes.
Corn
Let's talk about mutual information for a moment. It's not a correlation measure in the classical sense, but it captures dependence. How does it fit into this landscape?
Herman
Mutual information comes from information theory — it's Shannon's concept from nineteen forty-eight. It measures how much knowing one variable reduces your uncertainty about another. If X and Y are independent, mutual information is zero. If Y is a deterministic function of X, mutual information is the entropy of Y.
Corn
Like distance correlation, it captures any form of dependence, not just linear or monotonic.
Herman
And it's got a beautiful mathematical foundation. The catch is estimation — estimating mutual information from data is hard, especially in high dimensions. There are k-nearest-neighbor methods, kernel density methods, and more recently, neural network-based estimators. But none of them are as straightforward as computing Pearson's r.
Corn
The curse of dimensionality hits mutual information estimation particularly hard.
Herman
The k-NN estimator by Kraskov, Stögbauer, and Grassberger from two thousand four is the most widely used — it's in basically every mutual information toolbox. But it's still finicky with small samples and high dimensions.
Corn
For practical dependence detection, distance correlation often wins on ease of use, even though mutual information has the deeper theoretical pedigree.
Herman
That's my read, yes. Distance correlation has the "drop it in and it works" property that mutual information estimation hasn't quite achieved yet. But the field is moving fast.
Corn
What about correlation for categorical data? We've been mostly talking about continuous variables.
Herman
That's a whole parallel universe. For nominal categories, you've got Cramér's V, which is derived from the chi-squared statistic and ranges from zero to one. For ordinal categories — Likert scales, rankings — you've got polychoric correlation, which assumes there's an underlying continuous normal variable that's been discretized.
Corn
Polychoric correlation — that's the one where you're estimating what the Pearson correlation would be if you could observe the underlying continuous variable.
Herman
It's widely used in psychometrics and structural equation modeling. For binary variables, it reduces to the tetrachoric correlation. These are all based on the idea that the categories are crude measurements of something continuous.
Corn
If you don't buy that assumption?
Herman
Then you're back to rank-based methods or Cramér's V. There's no single right answer — it depends on what you believe about the data-generating process. Which is true of basically everything we've discussed.
Corn
That might be the unifying theme of this entire episode. Every technique has assumptions. The skill isn't in knowing the formulas — it's in knowing when the assumptions are violated and what to do about it.
Herman
That's what separates statistical literacy from recipe-following. A recipe says "compute Pearson correlation, if p is less than zero point zero five, you win." Statistical literacy says "look at the data, think about the generating process, choose the right tool, report the uncertainty, and for heaven's sake don't claim causation without a mechanism.
Corn
The prompt asked about basic and advanced techniques. I think we've covered the techniques. But the meta-lesson might be more valuable: correlation analysis is as much about skepticism as it is about computation.
Herman
The best correlation analyst is the one who trusts their own results the least.
Corn
Verifies everything three different ways before believing it.
Herman
Which is exhausting, but it's the job.
Corn
To summarize the toolkit for anyone keeping score at home. Basic: Pearson, Spearman, Kendall. Always plot first. Intermediate: partial correlation for controlling confounders, cross-correlation for time series, Cramér's V and polychoric for categorical data. Advanced: distance correlation for detecting any form of dependence, canonical correlation for relating sets of variables, mutual information for information-theoretic dependence. And throughout: check for outliers, check for range restriction, check for subgroup effects, correct for multiple comparisons, and never confuse correlation with agreement or association with causation.
Herman
That's the syllabus. And honestly, if someone internalizes even half of that, they're ahead of ninety percent of people who report correlations for a living.
Corn
The bar is low, but the ladder is tall.
Herman
That's another mug.
Corn
I'm building a whole kitchen set at this point.
Herman
One thing I want to add before we wrap — there's been some interesting work recently on correlation in high-dimensional settings. When you have more variables than observations, the sample correlation matrix is a terrible estimate of the population correlation matrix. Random matrix theory gives you tools to understand what's signal and what's noise.
Corn
This is the Marcenko-Pastur distribution territory?
Herman
The eigenvalue distribution of a random correlation matrix follows a known law. Eigenvalues that fall outside that distribution are potentially meaningful structure. It's used in finance for cleaning correlation matrices before portfolio optimization — you essentially shrink the noisy eigenvalues and keep the ones that stick out.
Corn
Even the humble correlation matrix has a whole field of study dedicated to figuring out which parts of it are real.
Herman
Which feels like a fitting place to land. Correlation is simple to compute and bottomless to understand. Every time you think you've mastered it, there's another layer.
Corn
Somewhere out there, someone is computing a Pearson correlation on two columns of dirty data, not plotting anything, and writing a press release about their groundbreaking discovery.
Herman
The circle of life.

And now: Hilbert's daily fun fact.

Hilbert: In eighteen twelve, astronomer Honoré Flaugergues observed a bright, transient lunar phenomenon from his observatory in French Guiana — he named the effect "clair de terre lunaire," believing it was reflected earthlight, though modern astronomers still debate what he actually saw.
Corn
...clair de terre lunaire. That's going to be stuck in my head all day.
Herman
This has been My Weird Prompts. Our producer is Hilbert Flumingtop. If you enjoyed this, leave us a review wherever you get your podcasts — it helps more than you'd think. I'm Herman Poppleberry.
Corn
I'm Corn. Go plot your data.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.