Extending SQL Databases with Python

Florents (Flo) Tselai

Tsel.Ai

Lé Me

  • Data generalist
  • Early Python for Data adopter
  • Probably one of the first Pandas user in Greece (2012)
  • Shifted toward in-DB Analytics & Ops
  • Core PostgreSQL Contributor
  • Extension developer pgxicor, vasco, pgJQ, liteJQ, pgPDF, spat
  • diofanti.org: diavgeia for humans. Monitoring the Greek Gov. 
  • See you at PGConf.EU 2025 (October, Riga)

Procedural vs Relational

# Find all employees older than 30
result = []
for emp in employees:
    if emp.age > 30:
        result.append(emp)
-- Find all employees older than 30
SELECT *
FROM employees
WHERE age > 30;

Q: Where’s the data, where the execution happens ?

SQL Database

  1. Compiler: SQL → Execution Plan → Tabular Result
  2. Infinite memory emulator

Enter Pandas

import pandas as pd

employees = pd.read_sql("SELECT * FROM employees", con)
result = employees[employees["age"] > 30]
print(result)

Too much of anything hurts

Where’s Pandas Today ?

  • In-memory eventually → Out-Of-Memory OOM
  • Interactive Programming (Notebooks) is not sustainable for large codebases
  • Complexity integrating with BI tools
  • Declarative eventually wins

Let’s Reinvent SQL

Those who avoid using SQL are bound to reinvent it…. Poorly!

Use SQL More!

  • Less complexity
  • Less devops
  • Built-in integration with BI tools
  • Easier to learn if you try
  • Improves performance: pushdown optimization!
  • ACID Compliance!

SQL

Standard Query Language

SQL = Extensible

– You want standard?

– Yes!

– I’ll give you standard

– I want custom too!

– !@#$%&&

How to import in SQL ?

SQL Extensions

  • More powerful than they sound
  • Whole companies built on DB extensions (Citus, TimescaleDB, PipelineDB)
  • Whole domains (PostGIS)
  • More powerful than they sound

Why & Extend SQL

  • Custom functions
  • Custom types

PostgreSQL

  • The best database in the world!
  • Simple (textbook-db), pragmatic robust
  • Extensible by design

pgPDF: pdf type for PostgreSQL

CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;
            pdf                                        
-------------------------
 PostgreSQL Introduction 
 Digoal.Zhou 
 7/20/2011Catalog 
  PostgreSQL Origin 
SELECT pdf_title('/tmp/pgintro.pdf');
SELECT pdf_author('/tmp/pgintro.pdf');

pgPDF: pdf type for PostgreSQL

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

SELECT 'DOC START ' || doc;

SELECT name FROM pdfs WHERE doc::text LIKE '%Postgres%';

pgPDF & http

Combining extensions.

CREATE EXTENSION pgpdf;
CREATE EXTENSION http;

SELECT pdf_read_bytes(text_to_bytea(content))
FROM http_get('https://....pdf');

pgJQ: JQ in PostgreSQL

CREATE EXTENSION pgjq;
       
SELECT jq('[{"bar": "baz", "balance": 7.77, "active":false}]'::jsonb, 
          '.[0].bar');
    jq  
---------
"baz"
(1 row)

select jq('{
  "runner": 1,
  "message": "jobStatus",
  "jobs": [
    {
      "id": 9,
      "is_successfull": true
    },
    {
      "id": 100,
      "is_successfull": false,
      "metdata": {
        "environ": "prod"
      }
    }
  ]
}'::jsonb, 
          '.jobs[] | select(.is_successfull == $is_success and .id == 100) | .', 
          '{"is_success": false, "id": 100}');

                                  jq                                  
----------------------------------------------------------------------
 {"id": 100, "metdata": {"environ": "prod"}, "is_successfull": false}
(1 row)

spat: Like Redis, but it’s PostgreSQL!

CREATE EXTENSION spat;

SELECT SPSET('key', 'value');
SELECT SPGET('key'); -- value

SELECT LPUSH('list1', 'elem1');
SELECT LPUSH('list1', 'elem2');
SELECT LPOP('list1'); -- elem2

SELECT SADD('set1', 'elem1', 'elem2');
SELECT SISMEMBER('set1', 'elem1'); -- t

SELECT HSET('h1', 'f1', 'Hello');
SELECT HGET('h1', 'f1'); -- Hello

Extending Postgres With Python

  • Mostly done with C, but not necessarily
  • Has a complete extension-building infrastructure Makefile-based.
  • PL/Python: Procedural Python in SQL

py_string_join in PL/Python

CREATE OR REPLACE FUNCTION py_string_join(
    elements text[],
    separator text DEFAULT ', '
)
RETURNS text
AS 
$$
    return separator.join(elements)
$$ LANGUAGE plpython3u
   IMMUTABLE
   COST 10
   PARALLEL SAFE;

-- Custom separator
SELECT py_string_join(ARRAY['2025','08','29'], '-');
-- Result: "2025-08-29"

-- Empty separator
SELECT py_string_join(ARRAY['a','b','c'], '');
-- Result: "abc"

moving_avg in PL/Python

CREATE OR REPLACE FUNCTION moving_avg(
    vals float8[], win int DEFAULT 3) RETURNS float8[]
AS $$
    result = []
    n = len(vals)
    for i in range(n):
        start = max(0, i - win + 1)
        window_vals = vals[start:i+1]
        avg = sum(window_vals) / len(window_vals)
        result.append(avg)
    return result
$$ LANGUAGE plpython3u IMMUTABLE COST 100 PARALLEL UNSAFE;

SELECT moving_avg(ARRAY[10, 20, 30, 40, 50], 2);

    moving_avg
------------------
{10,15,25,35,45}
(1 row)

CREATE OR REPLACE FUNCTION py_zip(
    arr1 text[],
    arr2 text[]
)
RETURNS TABLE (first text, second text)
AS $$
    for a, b in zip(arr1, arr2):
        yield (a, b)
$$ LANGUAGE plpython3u 
       IMMUTABLE PARALLEL SAFE;

SELECT * FROM py_zip(ARRAY['a','b','c'], ARRAY['1','2','3']);
 first | second 
-------+--------
 a     | 1
 b     | 2
 c     | 3
(3 rows)

-- Different lengths: Python zip stops at the shortest
SELECT * FROM py_zip(ARRAY['x','y','z'], ARRAY['10','20']);
 first | second 
-------+--------
 x     | 10
 y     | 20
(2 rows)

PL/Python Considerations

  • Ergonomics with imported libs can be tricky
  • Notice plpython3u. It can be unsafe!
  • Need super priviliges to update
  • You can’t really version control: src is in DB!

SQLite & DuckDB

Both are serverless, in-process, file-based

  • Both embed easily in apps
  • No network daemon needed
  • Great for analytics, prototyping & data sharing.
  • Can probably fit all of your company’s data.
  • SQLite: OLTP-oriented
  • DuckDB: OLAP-oriented

Extending SQLite With Python

  • import sqlite3
  • Probably the most underappreciated std Python feature

py_string_join in SQLite

import sqlite3
conn = sqlite3.connect(":memory:")
# or conn = sqlite3.connect("/path/to/db")

def py_string_join(elements, separator=", "):
    if isinstance(elements, str):
        elements = elements.split(",")
    return separator.join(elements)

conn.create_function("py_string_join", 2, py_string_join)

cur = conn.cursor()
cur.execute("SELECT py_string_join('apple,banana,cherry', ' | ')")
print(cur.fetchone()[0])  # apple | banana | cherry

py_string_join in DuckDB

import duckdb
import numpy as np   # DuckDB UDFs rely on NumPy types internally

con = duckdb.connect()

def py_string_join(elements, separator=", "):
    if isinstance(elements, str):
        elements = elements.split(",")
    return separator.join(elements)

con.create_function(
    "py_string_join",
    py_string_join,
    return_type=duckdb.typing.VARCHAR
)

print(con.execute("SELECT py_string_join('apple,banana,cherry', ' | ')").fetchone()[0])
# → apple | banana | cherry

Custom Aggregates: Finding Hidden Correlations

  • Built-in aggregates (SUM, AVG, COUNT) = core analytics tools
  • Plain SELECT = mostly descriptive reporting
  • Custom aggregates can uncover deeper patterns

SELECT corr(x, y)

  • Built-in correlation → detects mostly linear relationships
  • Real data often shows non-linear dependencies
  • Modern correlation metrics (e.g. Maximal Information Coefficient, ξ (xi) correlation coefficient) go beyond linearity
  • Ideal case for extensibility.

pgxicor: XI (ξ) Correlation Coefficient in Postgres

That’s written in C.

CREATE TABLE xicor_test (id serial PRIMARY KEY, x float8, y float8);
INSERT INTO xicor_test (x, y)
VALUES
    (1.0, 2.0),
    (2.5, 3.5),
    (3.0, 4.0),
    (4.5, 5.5),
    (5.0, 6.0);

SELECT xicor(x, y) FROM xicor_test;
    xicor   
------------
    0.5
(1 row)

XI (ξ) Correlation Coefficient in SQLite3

import sqlite3
import numpy as np
from scipy.stats import chatterjeexi
class ChatterjeeXiAgg:
    def __init__(self):
        self.x, self.y = [], []
    def step(self, x, y):
        if x is not None and y is not None:
            self.x.append(x)
            self.y.append(y)
    def finalize(self):
        if not self.x:
            return None
        return float(chatterjeexi(np.array(self.x), 
                                  np.array(self.y), method="asymptotic"))

XI (ξ) Correlation Coefficient in SQLite3

con = sqlite3.connect(":memory:")
con.create_aggregate("chatterjee_xi", 2, ChatterjeeXiAgg)

cur = con.cursor()
cur.execute("CREATE TABLE t(x REAL, y REAL)")
cur.executemany("INSERT INTO t VALUES (?, ?)", [
    (1, 2),
    (2, 5),
    (3, 7),
    (4, 8),
    (5, 10),
])
print(cur.execute("SELECT chatterjee_xi(x, y) FROM t").fetchone()[0])

Advice for Pythonistas

  • Apollo Guidance Computer (AGC): 72KB. Think before you ask for more resources.
  • Learn SQL.
  • Learn how DBs Work.
  • Before you move data, move queries & programs.
  • Move data only if necessary.
  • Doing BI? Learn about window functions. Vibe code!
  • Be declarative! See Zen of Python.
  • Be diligent with your imports. Stick to standard lib.

Thank You!

Florents (Flo) Tselai

Tsel.Ai

flo@tsel.ai