Empowering Government Transparency with PostgreSQL

Florents (Flo) Tselai

tselai.com

About Me

Florents (Flo) Tselai | tselai.com

  • Analytics & Data Engineering Background
  • PostgreSQL extensions
  • SQLite & DuckDB too.
  • Custom algos as extensions pgxicor, vasco
  • Postgres in ETL pipelines: pgJQ pgPDF PDF type for Postgres
  • Redis-like in-memory DB in PostgreSQL: spat

What’s diofanti.org

It’s a BI tool for citizens & taxpayers.

diofanti = PostgreSQL(JSONB + FTS)

Impact

  • Routinely used by investigative journalists.
  • Many big co’s have had their shady contracts discovered & canceled. #sorry_not_sorry
  • Government opposition (at all levels)
  • User stories are always surprising!

Data ?

diavgeia

  • Launched in 2011 to answer: “Where did the money go?”
  • Decision not published = not executed.
  • Publishes decisions made by the broader Greek state
  • Payments big or payments, gov contracts, building permits,
  • Decision = pdf = json document
  • Based on ElasticSearch
  • On purpose Missing data (see jsonb_strip_nulls)

70M decisions so far

20K decisions / day

How it started?

Visa Problems

  • 2018 waiting for my mom’s citizenship decision to be published
  • Dilemma: Renew for a 10-year visa ($$$) or renew for a short-term one ?
  • Decision could be published any time.

My Mom’s not tall! She has to wait!

But her son is a geek.

Can I Automate ?

Scrap & search for mom’s name in daily decisions

Discovered Diavgeia API!

Personal philosophy

Unix Tools + Declarative SQL

1st attempt (2018 - PG12)

  • Crawl with curl
  • query with jq
  • Index in SQLite
  • Coordinate with Makefiles
  • Too brittle for analytics

I want to use more SQL for orchestration

Discovered psql-http Extension

https://github.com/pramsey/pgsql-http

CREATE EXTENSION http;

SELECT content
FROM http_get('http://httpbun.com/ip');

Problem: User Agent is Blocked (anti-scraping)

pgPDF: pdf type for PostgreSQL

CREATE EXTENSION pgpdf;
SELECT '/tmp/pgintro.pdf'::pdf;
            pdf                                        
-------------------------
 PostgreSQL Introduction 
 Digoal.Zhou 
 7/20/2011Catalog 
  PostgreSQL Origin 
SELECT pdf_title('/tmp/pgintro.pdf');
SELECT pdf_author('/tmp/pgintro.pdf');

pgPDF: pdf type for PostgreSQL

CREATE TABLE pdfs(name text primary key, doc pdf);

INSERT INTO pdfs VALUES ('pgintro', '/tmp/pgintro.pdf');
INSERT INTO pdfs VALUES ('pgintro', '/tmp/sample.pdf');

SELECT 'DOC START ' || doc;

SELECT name FROM pdfs WHERE doc::text LIKE '%Postgres%';

pgPDF & http

Combining extensions.

CREATE EXTENSION pgpdf;
CREATE EXTENSION http;

SELECT pdf_read_bytes(text_to_bytea(content))
FROM http_get('https://....pdf');

pgJQ: JQ in PostgreSQL

CREATE EXTENSION pgjq;
       
SELECT jq('[{"bar": "baz", "balance": 7.77, "active":false}]'::jsonb, 
          '.[0].bar');
    jq  
---------
"baz"
(1 row)

select jq('{
  "runner": 1,
  "message": "jobStatus",
  "jobs": [
    {
      "id": 9,
      "is_successfull": true
    },
    {
      "id": 100,
      "is_successfull": false,
      "metdata": {
        "environ": "prod"
      }
    }
  ]
}'::jsonb, 
          '.jobs[] | select(.is_successfull == $is_success and .id == 100) | .', 
          '{"is_success": false, "id": 100}');

                                  jq                                  
----------------------------------------------------------------------
 {"id": 100, "metdata": {"environ": "prod"}, "is_successfull": false}
(1 row)

Workflows

Using the best data workflow engineering tool!

Greek FTS-specific Improvements

  • Core PG FTS config for Greek lacks some stopwords
  • Tried to push to core but there was more bureaucracy
  • Extension https://github.com/Florents-Tselai/pg_fts_greek

Looks complex. Why ?

Started with PG12, now we’re at PG18!

  • Functionality didn’t exist yet (i.e. JSON_TABLE or certain SQL/JSON operationss)
  • I didn’t know about some functionality. (e.g. GENERATED COLUMNS –> Tablespaces)
  • I was lazy to replace already-working functionality

Today

As simple as

make get-new-decisions
make index-missing
make get-decisions-text

Why

  • I learned better SQL/JSON as it was being improved
  • Virtual generated columns (soon)

The (Near) Future

Thank you!

florents@tselai.com