py-db-extensions

Extending SQL Databases with Python

Florents (Flo) Tselai

Lé Me

Data generalist
Early Python for Data adopter
Probably ~~one of~~ the first Pandas user in Greece (2012)
Shifted toward in-DB Analytics & Ops
Core PostgreSQL Contributor
Extension developer pgxicor, vasco, pgJQ, liteJQ, pgPDF, spat
diofanti.org: diavgeia for humans. Monitoring the Greek Gov.
See you at PGConf.EU 2025 (October, Riga)

Procedural vs Relational

# Find all employees older than 30
result = []
for emp in employees:
    if emp.age > 30:
        result.append(emp)

-- Find all employees older than 30
SELECT *
FROM employees
WHERE age > 30;

Q: Where’s the data, where the execution happens ?

SQL Database

Compiler: SQL → Execution Plan → Tabular Result
Infinite memory emulator

Enter Pandas

import pandas as pd

employees = pd.read_sql("SELECT * FROM employees", con)
result = employees[employees["age"] > 30]
print(result)

Too much of anything hurts

Where’s Pandas Today ?

In-memory eventually → Out-Of-Memory OOM
Interactive Programming (Notebooks) is not sustainable for large codebases
Complexity integrating with BI tools
Declarative eventually wins

Let’s Reinvent SQL

Those who avoid using SQL are bound to reinvent it…. Poorly!

Use SQL More!

Less complexity
Less devops
Built-in integration with BI tools
Easier to learn if you try
Improves performance: pushdown optimization!
ACID Compliance!

SQL

Standard Query Language

SQL = Extensible

– You want standard?

– Yes!

– I’ll give you standard

– I want custom too!

– !@#$%&&

How to `import` in SQL ?

SQL Extensions

More powerful than they sound
Whole companies built on DB extensions (Citus, TimescaleDB, PipelineDB)
Whole domains (PostGIS)
More powerful than they sound

Why & Extend SQL

Custom functions
Custom types

PostgreSQL

The best database in the world!
Simple (textbook-db), pragmatic robust
Extensible by design

pgPDF: `pdf` type for PostgreSQL

CREATE EXTENSION pgpdf;

SELECT '/tmp/pgintro.pdf'::pdf;

            pdf                                        
-------------------------
 PostgreSQL Introduction 
 Digoal.Zhou 
 7/20/2011Catalog 
  PostgreSQL Origin

SELECT pdf_title('/tmp/pgintro.pdf');
SELECT pdf_author('/tmp/pgintro.pdf');

pgPDF: `pdf` type for PostgreSQL

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

SELECT 'DOC START ' || doc;

SELECT name FROM pdfs WHERE doc::text LIKE '%Postgres%';

pgPDF & http

Combining extensions.

CREATE EXTENSION pgpdf;
CREATE EXTENSION http;

SELECT pdf_read_bytes(text_to_bytea(content))
FROM http_get('https://....pdf');

pgJQ: JQ in PostgreSQL

CREATE EXTENSION pgjq;
       
SELECT jq('[{"bar": "baz", "balance": 7.77, "active":false}]'::jsonb, 
          '.[0].bar');
    jq  
---------
"baz"
(1 row)

select jq('{
  "runner": 1,
  "message": "jobStatus",
  "jobs": [
    {
      "id": 9,
      "is_successfull": true
    },
    {
      "id": 100,
      "is_successfull": false,
      "metdata": {
        "environ": "prod"
      }
    }
  ]
}'::jsonb, 
          '.jobs[] | select(.is_successfull == $is_success and .id == 100) | .', 
          '{"is_success": false, "id": 100}');

                                  jq                                  
----------------------------------------------------------------------
 {"id": 100, "metdata": {"environ": "prod"}, "is_successfull": false}
(1 row)

spat: Like Redis, but it’s PostgreSQL!

CREATE EXTENSION spat;

SELECT SPSET('key', 'value');
SELECT SPGET('key'); -- value

SELECT LPUSH('list1', 'elem1');
SELECT LPUSH('list1', 'elem2');
SELECT LPOP('list1'); -- elem2

SELECT SADD('set1', 'elem1', 'elem2');
SELECT SISMEMBER('set1', 'elem1'); -- t

SELECT HSET('h1', 'f1', 'Hello');
SELECT HGET('h1', 'f1'); -- Hello

Extending Postgres With Python

Mostly done with C, but not necessarily
Has a complete extension-building infrastructure Makefile-based.
PL/Python: Procedural Python in SQL

`py_string_join` in PL/Python

CREATE OR REPLACE FUNCTION py_string_join(
    elements text[],
    separator text DEFAULT ', '
)
RETURNS text
AS 
$$
    return separator.join(elements)
$$ LANGUAGE plpython3u
   IMMUTABLE
   COST 10
   PARALLEL SAFE;

-- Custom separator
SELECT py_string_join(ARRAY['2025','08','29'], '-');
-- Result: "2025-08-29"

-- Empty separator
SELECT py_string_join(ARRAY['a','b','c'], '');
-- Result: "abc"

moving_avg in PL/Python

CREATE OR REPLACE FUNCTION moving_avg(
    vals float8[], win int DEFAULT 3) RETURNS float8[]
AS $$
    result = []
    n = len(vals)
    for i in range(n):
        start = max(0, i - win + 1)
        window_vals = vals[start:i+1]
        avg = sum(window_vals) / len(window_vals)
        result.append(avg)
    return result
$$ LANGUAGE plpython3u IMMUTABLE COST 100 PARALLEL UNSAFE;

SELECT moving_avg(ARRAY[10, 20, 30, 40, 50], 2);

    moving_avg
------------------
{10,15,25,35,45}
(1 row)

CREATE OR REPLACE FUNCTION py_zip(
    arr1 text[],
    arr2 text[]
)
RETURNS TABLE (first text, second text)
AS $$
    for a, b in zip(arr1, arr2):
        yield (a, b)
$$ LANGUAGE plpython3u 
       IMMUTABLE PARALLEL SAFE;

SELECT * FROM py_zip(ARRAY['a','b','c'], ARRAY['1','2','3']);
 first | second 
-------+--------
 a     | 1
 b     | 2
 c     | 3
(3 rows)

-- Different lengths: Python zip stops at the shortest
SELECT * FROM py_zip(ARRAY['x','y','z'], ARRAY['10','20']);
 first | second 
-------+--------
 x     | 10
 y     | 20
(2 rows)

PL/Python Considerations

Ergonomics with imported libs can be tricky
Notice plpython3u. It can be unsafe!
Need super priviliges to update
You can’t really version control: src is in DB!

SQLite & DuckDB

Both are serverless, in-process, file-based

Both embed easily in apps
No network daemon needed
Great for analytics, prototyping & data sharing.
Can probably fit all of your company’s data.
SQLite: OLTP-oriented
DuckDB: OLAP-oriented

Extending SQLite With Python

import sqlite3
Probably the most underappreciated std Python feature

`py_string_join` in SQLite

import sqlite3
conn = sqlite3.connect(":memory:")
# or conn = sqlite3.connect("/path/to/db")

def py_string_join(elements, separator=", "):
    if isinstance(elements, str):
        elements = elements.split(",")
    return separator.join(elements)

conn.create_function("py_string_join", 2, py_string_join)

cur = conn.cursor()
cur.execute("SELECT py_string_join('apple,banana,cherry', ' | ')")
print(cur.fetchone()[0])  # apple | banana | cherry

`py_string_join` in DuckDB

import duckdb
import numpy as np   # DuckDB UDFs rely on NumPy types internally

con = duckdb.connect()

def py_string_join(elements, separator=", "):
    if isinstance(elements, str):
        elements = elements.split(",")
    return separator.join(elements)

con.create_function(
    "py_string_join",
    py_string_join,
    return_type=duckdb.typing.VARCHAR
)

print(con.execute("SELECT py_string_join('apple,banana,cherry', ' | ')").fetchone()[0])
# → apple | banana | cherry

Custom Aggregates: Finding Hidden Correlations

Built-in aggregates (SUM, AVG, COUNT) = core analytics tools
Plain SELECT = mostly descriptive reporting
Custom aggregates can uncover deeper patterns

`SELECT corr(x, y)`

Built-in correlation → detects mostly linear relationships
Real data often shows non-linear dependencies
Modern correlation metrics (e.g. Maximal Information Coefficient, ξ (xi) correlation coefficient) go beyond linearity
Ideal case for extensibility.

pgxicor: XI (ξ) Correlation Coefficient in Postgres

That’s written in C.

CREATE TABLE xicor_test (id serial PRIMARY KEY, x float8, y float8);
INSERT INTO xicor_test (x, y)
VALUES
    (1.0, 2.0),
    (2.5, 3.5),
    (3.0, 4.0),
    (4.5, 5.5),
    (5.0, 6.0);

SELECT xicor(x, y) FROM xicor_test;
    xicor   
------------
    0.5
(1 row)

XI (ξ) Correlation Coefficient in SQLite3

import sqlite3
import numpy as np
from scipy.stats import chatterjeexi
class ChatterjeeXiAgg:
    def __init__(self):
        self.x, self.y = [], []
    def step(self, x, y):
        if x is not None and y is not None:
            self.x.append(x)
            self.y.append(y)
    def finalize(self):
        if not self.x:
            return None
        return float(chatterjeexi(np.array(self.x), 
                                  np.array(self.y), method="asymptotic"))

XI (ξ) Correlation Coefficient in SQLite3

con = sqlite3.connect(":memory:")
con.create_aggregate("chatterjee_xi", 2, ChatterjeeXiAgg)

cur = con.cursor()
cur.execute("CREATE TABLE t(x REAL, y REAL)")
cur.executemany("INSERT INTO t VALUES (?, ?)", [
    (1, 2),
    (2, 5),
    (3, 7),
    (4, 8),
    (5, 10),
])
print(cur.execute("SELECT chatterjee_xi(x, y) FROM t").fetchone()[0])

Advice for Pythonistas

Apollo Guidance Computer (AGC): 72KB. Think before you ask for more resources.
Learn SQL.
Learn how DBs Work.
Before you move data, move queries & programs.
Move data only if necessary.
Doing BI? Learn about window functions. Vibe code!
Be declarative! See Zen of Python.
Be diligent with your imports. Stick to standard lib.

Thank You!

Florents (Flo) Tselai

Tsel.Ai

flo@tsel.ai

Extending SQL Databases with Python

Lé Me

Procedural vs Relational

SQL Database

Enter Pandas

Too much of anything hurts

Let’s Reinvent SQL

Use SQL More!

SQL

SQL = Extensible

How to import in SQL ?

SQL Extensions

Why & Extend SQL

PostgreSQL

pgPDF: pdf type for PostgreSQL

pgPDF: pdf type for PostgreSQL

pgPDF & http

pgJQ: JQ in PostgreSQL

spat: Like Redis, but it’s PostgreSQL!

Extending Postgres With Python

py_string_join in PL/Python

moving_avg in PL/Python

PL/Python Considerations

SQLite & DuckDB

Extending SQLite With Python

py_string_join in SQLite

py_string_join in DuckDB

Custom Aggregates: Finding Hidden Correlations

SELECT corr(x, y)

pgxicor: XI (ξ) Correlation Coefficient in Postgres

XI (ξ) Correlation Coefficient in SQLite3

XI (ξ) Correlation Coefficient in SQLite3

Advice for Pythonistas

Thank You!

How to `import` in SQL ?

pgPDF: `pdf` type for PostgreSQL

pgPDF: `pdf` type for PostgreSQL

`py_string_join` in PL/Python

`py_string_join` in SQLite

`py_string_join` in DuckDB

`SELECT corr(x, y)`