We Need More DSLs in the Database
TL;DR: Below are some thoughts around pgJQ: Use jq in Postgres
In my projects, I typically work with data residing outside the database at
first.
Be it scraped data, API responses,
or data from different stores glued together in pipelines.
The first version of a working pipeline involves many
curl
, jq
, sed
, awk,
and regular expressions
between database queries.
Such an approach allows having a working pipeline
asap and iterate on top of it.
As the project evolves, I resist the temptation to persist data and instead pipe them together,
When I need some persistence, I prefer /dev/shm
and shared memory as a storage layer and IPC medium.
Curl
writes to /dev/shm
, and jq
reads from there.
When I need expensive computations, I prefer to write database-level logic
using procedural languages or refine my queries.
But this approach has its limits, and at some point, certain symptoms appear. At first, syntax suffers, and some one-liners become heredocs, then heredocs become auto-generated files communicating via pipes. Then, available memory is steadily reduced, and I find myself creating a “buffer” between memory and disc. Then, disc space fills too fast and I start considering binary representations and compression.
And then the realization kicks in:
I feel an intimidating presence behind me,
I turn my head, and I see the elephant in the room:
I am recreating that thing we call a database.
It is then I know that I should succumb to the temptation
and use the proper tool for the job.
I smile back at Slonik,
he extends his trunk and offers me psql
.
Motivation
In a recent repeat
of that old dance routine,
I found myself siphoning data
from a Postgres instance to a sequence of jq
filters
and then write the results back to Postgres.
Some of these filters were pretty complicated, and I couldn’t remember the
details of them, but I knew they worked,
as the schema had stayed the same.
I pushed the data to Postgres and started trying to do the same extractions
in the typical Postgres fashion:
Using jsonb
and its assorted operators and functions.
Some of these transformations involved heavy usage of jsonpath
expressions
as a replacement for jq
filters.
That was a tedious and complicated process with lots of trial and error,
and as I did not have stable tests, I was not sure if I had covered all the
corner cases.
It was then that I realized what I was doing.
I was using a
DSL
(jsonpath
),
within a DSL (SQL)
as a replacement for another DSL (jq
),
both operating on the same data abstraction,
only with different syntax.
Indeed, the way
jsonpath
works in Postgres is pretty straightforward and convenient:
In a function like jsonb_path_func(jsonb, jsonpath)
you supply a normal SQL string
which is casted as a special jsonpath
object
and given special semantics.
For example
select *
from jsonb_path_query('{"a":[1,2,3,4,5]}',
'$.a[*] ? (@ >= $min && @ <= $max)',
'{"min":2, "max":4}')
You express your program’s logic between two single quotes. Everything else happens behind the scenes and is opaque to the user.
That got me thinking:
What if I could use
jq
expressions instead ofjsonpath
ones?
Just plug them in and forget about the tedious translation process. Then, I could lift and shift all the existing logic and reuse it in Postgres.
What if I had a Postgres function that could look like this:
select jq(jsonb, jqprog)
I could re-use the jq
programs I already have,
but have them executed within the database,
without moving data around.
It would be a pushdown optimization.
Sending the program to the data instead of fetching data from the disc
and pipe them to another process, which is the typical jq
way.
This would be simpler, cleaner, and more performant.
And, of course, it would allow me to quickly choose between
jsonpath
and jq
syntax, depending on what I’d like to achieve.
For example, when working with arrays, I prefer jq
to jsonpath
syntax.
In the jsonpath
example above, you’ll notice those arguments
to the expression program are passed as a jsonb object,
so I would like to be able to do the same with jqprog
as well.
Postgres also offers operators (mainly used in the WHERE
clause of a query)
to filter jsonb values against a predicate like.
select * from jsonb @? jsonpath
Similar constructs (functions and operators) are available throughout Postgres;
For instance, in full-text search operators where the
analogous abstractions are ts_vector
and ts_query
.
This is merely syntactic sugar, but it has a big impact on the conciseness
and maintainability of the resulting query.
So I would like something similar, but instead of filtering only,
I would prefer to chain multiple jq
filters together
to extract parts of a jsonb
object
sequentially, like:
select jsonb @? jqprog @? jqprog ...
Implementation
I imagined having this as a Postgres extension,
gave it a name (pgJQ
), and I started
working on the implementation,
Following the same logic behind jsonpath
,
a jqprog
is just a text-based type,
nothing fancy there.
Jsonpath
, though, is scanned and parsed
by a full-fledged compiler within Postgres (using flex
and bison
)
and is executed as a chain (tree) of path items,
by an internal execution engine as a tree of items that work on the target
jsonb
object
You can read more about
this here
if you’re interested.
But I certainly wasn’t interested in creating another implementation of jq.
I wanted something that could be hacked in 2-3 days.
Thus, my only option was to re-use the vanilla jq
compiler.
There’s no such thing as a jq
compiler.
There is, but it is hidden within jq
.
I knew from experience that cli-first tools tend to tailor their API on
their CLI, and jq
was no exception.
It is stream-oriented,
and its C API
is tailored around the CLI version of jq
as a tool.
Sure, there are the structs and their interfaces,
but there’s no clear separation
For instance, between library and interface, like libcurl
and curl_tool
.
It didn’t take me long to realize I had to do some heavy code reading.
I ended up deciphering, breaking, and fixing
jq
’s main
function to see how the whole thing works.
To make this work with Postgres,
I had to trick the jq
compiler into thinking it reads from a stream instead of
a SQL text variable packaged as a Datum
( a blob of bytes).
Check this commit
The fmemopen
came in
handy for this:
FILE *file_json = fmemopen(json_string, strlen(json_string), "r");
Then, the final jv
result (jv
is the core abstraction within jq
) is
passed to the function below
that recursively parses it to build a JsonbValue
, which is the result we’re
interested in.
JsonbValue
is the in-memory representation of the JSON object in Postgres.
JsonbValue *JvObject_to_JsonbValue(jv *obj, JsonbParseState **jsonb_state, bool is_elem)
A similar effort
in jsonb_plpython
that converts a Python dict
to a jsonb
object inspired this.
Usage
You can see examples in pgJQ
’s README,
but the gist is that with pgJQ,
you can now write basic filters like this
SELECT jq('[{"bar": "baz", "balance": 7.77, "active": false}]'::jsonb, '.[0].bar');
jq
-------
"baz"
(1 row)
And more complex jq
programs like this:
SELECT jq('[true,false,[5,true,[true,[false]],false]]',
'(..|select(type=="boolean")) |= if . then 1 else 0 end');
jq
-----------------------------
[1, 0, [5, 1, [1, [0]], 0]]
(1 row)
You can for example, use jq
’s select function within SQL’s select
expression.
SELECT jq('[1,5,3,0,7]', '(.[] | select(. >= 2)) |= empty');
jq
--------
[1, 0]
(1 row)
If you want to pass dynamic arguments to jqprog
,
you can pass them as jsonb
objects
and refer to them as $var
.
select jq('{
"runner": 1,
"message": "jobStatus",
"jobs": [
{
"id": 9,
"is_successfull": true
},
{
"id": 100,
"is_successfull": false,
"metdata": {
"environ": "prod"
}
}
]
}'::jsonb,
'.jobs[] | select(.is_successfull == $is_success and .id == 100) | .',
'{"is_success": false, "id": 100}');
jq
----------------------------------------------------------------------
{"id": 100, "metdata": {"environ": "prod"}, "is_successfull": false}
(1 row)
Reflections
Developing pgJQ and the process I followed prompted me to think more about the topic.
Postgres is very welcoming to DSLs. For those not familiar with the Postgres extension infrastructure it is pretty straightforward yet powerful
In my case, though,
that simplicity is, to a great extent
because text is the
universal interface.
JSON is text, jq
is text, jsonpath
is text, so the whole plumbing
works easily.
Text is the universal interface, after all.
Sure,
Postgres has a beautiful source code,
an abundance of documentation,
and a well-defined API,
but that factor alone greatly simplified things.
A lot of logic can fit between two single quotes.
Datum
as a universal interface.
Indeed, everything is based on text,
but that text is parsed into a Datum
.
Datum is the core abstraction behind Postgres
and every SQL type is handled as a Datum
.
Considering the rich type system of Postgres,
plus the network of extensions,
there is probably a Datum
-based
implementation of any semantic type you might need:
Text, JSON, arrays, geometric
and geographic types,
time-series
and of course vectors.
All of these can now live within the same process space.
That is no small thing and can have (and is having) a significant impact.
In vasco for example,
another Postgres extension I built recently,
I had to model the problem of an association score between two variables.
That was equally straightforward, too.
One or two CREATE TYPE
s here and there,
some gluing C code,
and I brought my domain requirements to the database.
I wonder if there could be a more general pattern there.
Particularly around using Postgres
transforms
more systematically.
I’m thinking of something Datum
-based
but serving purposes similar to protobuf or Avro.
Could that glue C code be code-generated even?
The database is living its Pandas moment.
I consider Datum
the equivalent of DataFrame.
and can succeed where that failed (?).
People came to Pandas
for pd.read_csv
,
stayed for the performance and convenience of pd.concat
and pd.join
,
but left when the API became cumbersome, painful, and bloated.
At least, that’s the case for me.
There’s maybe a learning there.
As more domain logic comes to the database, the more expressive power we’ll need to express that logic. SQL will remain the workhorse behind the scenes, but let’s be honest: SQL is fairly “general” as a DSL; it can express probably any logic conceivable. But that’s when it starts being abused. A lot of SQL propagates itself due to its set-oriented nature. We want just a simple transformation ; we have to remodel it with CTEs, subqueries, and lateral joins.
We can do better there; we can augment SQL with much, much smaller DSLs with the goal not to replace SQL but to make it less awkward for particular domains and scenarios. And let it shine where it does best.