Smarter Correlation for Postgres with pgxicor
When analyzing relationships between variables in PostgreSQL,
many turn to the Pearson correlation coefficient (corr(X, Y)
).
While useful, it only captures linear relationships.
Many real-world datasets contain relationships that aren’t strictly linear. Consider cases like:
- Diminishing returns in economics (e.g., increasing ad spend vs. revenue growth)
- Biological or medical data with exponential growth patterns
- Logarithmic or polynomial trends in machine learning features
In these scenarios, corr(X, Y)
might report a weak correlation.
This is where the xi (ξ) correlation coefficient comes into play:
Unlike Pearson’s correlation,
Xi Correlation can detect both linear and nonlinear functional relationships between variables.
For more information on XI, see the original paper. A New Coefficient of Correlation S. Chatterjee (2020)
pgxicor is a PostgreSQL extension that introduces the Xi Correlation (xicor(X, Y)
) aggregate function.
CREATE EXTENSION xicor;
CREATE TABLE xicor_test (x float8, y float8);
INSERT INTO xicor_test (x, y)
VALUES
(1.0, 2.0),
(2.5, 3.5),
(3.0, 4.0),
(4.5, 5.5),
(5.0, 6.0);
-- Query to calculate the Xi correlation using the aggregate function
SELECT xicor(x, y)
FROM xicor_test;
I’ve published this extension on GitHub on Florents-Tselai/pgxicor.
If you’re interested in this, also check out vasco; Another similar extension is based on the Maximal Information Coefficient (MIC). A standalone C implementation of ξ is also available libxicor.