Florents Tselai Think, Code, Read, Sleep, Repeat

We Need More DSLs in the Database

27 Sep 2023

TL;DR: Below are some thoughts around pgJQ: Use jq in Postgres

In my projects, I typically work with data residing outside the database at first. Be it scraped data, API responses, or data from different stores glued together in pipelines. The first version of a working pipeline involves many curl, jq, sed, awk, and regular expressions between database queries. Such an approach allows having a working pipeline asap and iterate on top of it.

As the project evolves, I resist the temptation to persist data and instead pipe them together, When I need some persistence, I prefer /dev/shm and shared memory as a storage layer and IPC medium. Curl writes to /dev/shm, and jq reads from there. When I need expensive computations, I prefer to write database-level logic using procedural languages or refine my queries.

But this approach has its limits, and at some point, certain symptoms appear. At first, syntax suffers, and some one-liners become heredocs, then heredocs become auto-generated files communicating via pipes. Then, available memory is steadily reduced, and I find myself creating a "buffer" between memory and disc. Then, disc space fills too fast and I start considering binary representations and compression.

And then the realization kicks in: I feel an intimidating presence behind me, I turn my head, and I see the elephant in the room: I am recreating that thing we call a database. It is then I know that I should succumb to the temptation and use the proper tool for the job. I smile back at Slonik, he extends his trunk and offers me psql.

Motivation

In a recent repeat of that old dance routine, I found myself siphoning data from a Postgres instance to a sequence of jq filters and then write the results back to Postgres. Some of these filters were pretty complicated, and I couldn't remember the details of them, but I knew they worked, as the schema had stayed the same.

I pushed the data to Postgres and started trying to do the same extractions in the typical Postgres fashion: Using jsonb and its assorted operators and functions. Some of these transformations involved heavy usage of jsonpath expressions as a replacement for jq filters. That was a tedious and complicated process with lots of trial and error, and as I did not have stable tests, I was not sure if I had covered all the corner cases.

It was then that I realized what I was doing. I was using a DSL (jsonpath), within a DSL (SQL) as a replacement for another DSL (jq), both operating on the same data abstraction, only with different syntax.

Indeed, the way jsonpath works in Postgres is pretty straightforward and convenient: In a function like jsonb_path_func(jsonb, jsonpath) you supply a normal SQL string which is casted as a special jsonpath object and given special semantics.

For example

select * 
from jsonb_path_query('{"a":[1,2,3,4,5]}', 
                      '$.a[*] ? (@ >= $min && @ <= $max)', 
                      '{"min":2, "max":4}')

You express your program's logic between two single quotes. Everything else happens behind the scenes and is opaque to the user.

That got me thinking:

What if I could use jq expressions instead of jsonpath ones?

Just plug them in and forget about the tedious translation process. Then, I could lift and shift all the existing logic and reuse it in Postgres.

What if I had a Postgres function that could look like this:

select jq(jsonb, jqprog)

I could re-use the jq programs I already have, but have them executed within the database, without moving data around.

It would be a pushdown optimization. Sending the program to the data instead of fetching data from the disc and pipe them to another process, which is the typical jq way. This would be simpler, cleaner, and more performant. And, of course, it would allow me to quickly choose between jsonpath and jq syntax, depending on what I'd like to achieve. For example, when working with arrays, I prefer jq to jsonpath syntax.

In the jsonpath example above, you'll notice those arguments to the expression program are passed as a jsonb object, so I would like to be able to do the same with jqprog as well.

Postgres also offers operators (mainly used in the WHERE clause of a query) to filter jsonb values against a predicate like.

select * from jsonb @? jsonpath

Similar constructs (functions and operators) are available throughout Postgres; For instance, in full-text search operators where the analogous abstractions are ts_vector and ts_query. This is merely syntactic sugar, but it has a big impact on the conciseness and maintainability of the resulting query.

So I would like something similar, but instead of filtering only, I would prefer to chain multiple jq filters together to extract parts of a jsonb object sequentially, like:

select jsonb @? jqprog @? jqprog ...

Implementation

I imagined having this as a Postgres extension, gave it a name (pgJQ), and I started working on the implementation, Following the same logic behind jsonpath, a jqprog is just a text-based type, nothing fancy there.

Jsonpath, though, is scanned and parsed by a full-fledged compiler within Postgres (using flex and bison) and is executed as a chain (tree) of path items, by an internal execution engine as a tree of items that work on the target jsonb object You can read more about this here if you're interested. But I certainly wasn't interested in creating another implementation of jq.

I wanted something that could be hacked in 2-3 days. Thus, my only option was to re-use the vanilla jq compiler. There's no such thing as a jq compiler. There is, but it is hidden within jq.

I knew from experience that cli-first tools tend to tailor their API on their CLI, and jq was no exception. It is stream-oriented, and its C API is tailored around the CLI version of jq as a tool.

Sure, there are the structs and their interfaces, but there’s no clear separation For instance, between library and interface, like libcurl and curl_tool.

It didn't take me long to realize I had to do some heavy code reading. I ended up deciphering, breaking, and fixing jq's main function to see how the whole thing works.

To make this work with Postgres, I had to trick the jq compiler into thinking it reads from a stream instead of a SQL text variable packaged as a Datum ( a blob of bytes). Check this commit

The fmemopen came in handy for this:

FILE *file_json = fmemopen(json_string, strlen(json_string), "r");

Then, the final jv result (jv is the core abstraction within jq) is passed to the function below that recursively parses it to build a JsonbValue, which is the result we're interested in. JsonbValue is the in-memory representation of the JSON object in Postgres.

JsonbValue *JvObject_to_JsonbValue(jv *obj, JsonbParseState **jsonb_state, bool is_elem)

A similar effort in jsonb_plpython that converts a Python dict to a jsonb object inspired this.

Usage

You can see examples in pgJQ's README, but the gist is that with pgJQ, you can now write basic filters like this

SELECT jq('[{"bar": "baz", "balance": 7.77, "active": false}]'::jsonb, '.[0].bar');
  jq   
-------
 "baz"
(1 row)

And more complex jq programs like this:

SELECT jq('[true,false,[5,true,[true,[false]],false]]',
          '(..|select(type=="boolean")) |= if . then 1 else 0 end');
             jq              
-----------------------------
 [1, 0, [5, 1, [1, [0]], 0]]
(1 row)

You can for example, use jq's select function within SQL's select expression.

SELECT jq('[1,5,3,0,7]', '(.[] | select(. >= 2)) |= empty');
   jq   
--------
 [1, 0]
(1 row)

If you want to pass dynamic arguments to jqprog, you can pass them as jsonb objects and refer to them as $var.

select jq('{
  "runner": 1,
  "message": "jobStatus",
  "jobs": [
    {
      "id": 9,
      "is_successfull": true
    },
    {
      "id": 100,
      "is_successfull": false,
      "metdata": {
        "environ": "prod"
      }
    }
  ]
}'::jsonb,
          '.jobs[] | select(.is_successfull == $is_success and .id == 100) | .',
          '{"is_success": false, "id": 100}');
                                  jq                                  
----------------------------------------------------------------------
 {"id": 100, "metdata": {"environ": "prod"}, "is_successfull": false}
(1 row)

Reflections

Developing pgJQ and the process I followed prompted me to think more about the topic.

Postgres is very welcoming to DSLs. For those not familiar with the Postgres extension infrastructure it is pretty straightforward yet powerful

In my case, though, that simplicity is, to a great extent because text is the universal interface. JSON is text, jq is text, jsonpath is text, so the whole plumbing works easily. Text is the universal interface, after all. Sure, Postgres has a beautiful source code, an abundance of documentation, and a well-defined API, but that factor alone greatly simplified things. A lot of logic can fit between two single quotes.

Datum as a universal interface. Indeed, everything is based on text, but that text is parsed into a Datum. Datum is the core abstraction behind Postgres and every SQL type is handled as a Datum. Considering the rich type system of Postgres, plus the network of extensions, there is probably a Datum-based implementation of any semantic type you might need: Text, JSON, arrays, geometric and geographic types, time-series and of course vectors. All of these can now live within the same process space. That is no small thing and can have (and is having) a significant impact.

In vasco for example, another Postgres extension I built recently, I had to model the problem of an association score between two variables. That was equally straightforward, too. One or two CREATE TYPEs here and there, some gluing C code, and I brought my domain requirements to the database.

I wonder if there could be a more general pattern there. Particularly around using Postgres transforms more systematically. I'm thinking of something Datum-based but serving purposes similar to protobuf or Avro. Could that glue C code be code-generated even?

The database is living its Pandas moment. I consider Datum the equivalent of DataFrame. and can succeed where that failed (?). People came to Pandas for pd.read_csv, stayed for the performance and convenience of pd.concat and pd.join, but left when the API became cumbersome, painful, and bloated. At least, that's the case for me. There’s maybe a learning there.

As more domain logic comes to the database, the more expressive power we’ll need to express that logic. SQL will remain the workhorse behind the scenes, but let’s be honest: SQL is fairly "general" as a DSL; it can express probably any logic conceivable. But that's when it starts being abused. A lot of SQL propagates itself due to its set-oriented nature. We want just a simple transformation ; we have to remodel it with CTEs, subqueries, and lateral joins.

We can do better there; we can augment SQL with much, much smaller DSLs with the goal not to replace SQL but to make it less awkward for particular domains and scenarios. And let it shine where it does best.