Native Full-Text Search in PostgreSQL JSONPath: Proposing tsmatch

If you have ever tried to perform a linguistic text search on specific parts of a complex JSONB document in PostgreSQL, you know the struggle. You are typically forced to choose between two imperfect options: fast-but-imprecise global indexing, or precise-but-painful SQL unnesting.

I recently submitted a patch to the PostgreSQL Commitfest to solve this exact problem. It introduces a new boolean operator, tsmatch, directly into the SQL/JSON path engine. It works like this

select jsonb_path_query('[null, 1, "running", "runs", "ran", "run", "runner", "jogging"]', 
                        '$[*] ? (@ tsmatch "run" tsconfig "english")');
 jsonb_path_query 
------------------
 "running"
 "runs"
 "run"
(3 rows)

This patch integrates Full-Text Search (FTS) natively into path expressions. Modeled after the standard like_regex operator, tsmatch provides:

Granular Precision: Filter nested structures without flattening the document.
Horizontal Correlation: Match multiple conditions (e.g., “User is Alice” AND “Comment contains ‘Performance’”) within the same object scope.
Configurable Parsing: Supports custom dictionaries and query parsers on a per-predicate basis.

To show the motivation behind this, let’s set a working example:

CREATE TABLE fts_json_test (id serial PRIMARY KEY, doc jsonb);

INSERT INTO fts_json_test (doc)
SELECT jsonb_build_object(
   'title', CASE WHEN i % 10 = 0 THEN 'database performance optimization' ELSE 'standard database' END,
   'metadata', jsonb_build_object('tags', jsonb_build_array('tech', 'sql')),
   'comments', jsonb_build_array(
           jsonb_build_object('body', 'this is a comment'),
           jsonb_build_object('body', CASE WHEN i % 20 = 0 THEN 'i saw a performance gain' ELSE 'no gain' END)
       )
   )
FROM generate_series(1, 100000) i;

We start with the new tsmatch operator. We want to find documents where the path $.comments[*].body matches the term "performance".

SELECT count(*)
FROM fts_json_test
WHERE doc @? '$.comments[*] ? (@.body tsmatch "performance" tsconfig "english")';

Aggregate  (actual time=322.206..322.206 rows=1 loops=1)
   ->  Seq Scan on fts_json_test (actual time=0.066..322.091 rows=5000 loops=1)
         Filter: (doc @? '$."comments"[*]?(@."body" tsmatch "performance" ...)')
         Rows Removed by Filter: 95000
 Execution Time: 322.219 ms

This is the baseline for correctness. It correctly identified exactly 5,000 rows (the 20th percentiles). The execution time was 322 ms using a standard sequential scan. It is readable, concise, and structurally accurate.

Now, let’s try the most common approach: converting the whole document to a vector.

SELECT count(*)
FROM fts_json_test
WHERE jsonb_to_tsvector('english', doc, '["all"]') @@ to_tsquery('english', 'performance');

Finalize Aggregate  (actual time=296.065..296.688 rows=1 loops=1)
   ->  Gather  (actual time=295.838..296.684 rows=2 loops=1)
         Workers Planned: 1
         ->  Parallel Seq Scan on fts_json_test (actual time=0.123..294.106 rows=5000 loops=2)
               Filter: (jsonb_to_tsvector(...) @@ '''perform'''::tsquery)
               Rows Removed by Filter: 45000
 Execution Time: 296.703 ms

DANGER here: Incorrect Count: Look at the Parallel Scan rows=5000 combined with loops=2. That is 10,000 matches total.

Why? Because jsonb_to_tsvector flattens the document, it matched the 5,000 real comments plus the 5,000 false positives where “performance” only appeared in the title.

So the trade-off is clear: You gain a few milliseconds (296ms vs 322ms), but you lose data integrity.

Finally, the traditional way to achieve precision: exploding the array with SQL.

SELECT count(*)
FROM fts_json_test t,
     jsonb_array_elements(t.doc->'comments') AS c
WHERE to_tsvector('english', c->>'body') @@ to_tsquery('english', 'performance');

Finalize Aggregate  (actual time=141.692..142.274 rows=1 loops=1)
   ->  Gather  (actual time=141.567..142.272 rows=2 loops=1)
         ->  Nested Loop  (actual time=0.142..140.579 rows=2500 loops=2)
               ->  Parallel Seq Scan on fts_json_test t
               ->  Function Scan on jsonb_array_elements c
 Execution Time: 142.316 ms

Correct Count: rows=2500 * loops=2 = 5,000. It is accurate. Speed: 142 ms. It is faster here because the documents are small and parallelization kicked in efficiently.

Complexity: Look at the Query Plan. We are running a Nested Loop over a Function Scan. We are effectively joining the table against itself and unnesting arrays. In real-world scenarios with larger documents, this “explode-and-filter” strategy creates massive memory pressure.

The implementation relies on GIN path-matching for index pruning and heap re-checks for precision. Caching is scoped to the JsonPathExecContext, ensuring ‘compile-once’ efficiency for the tsquery and OID lookup per execution, respecting the stability requirements of prepared statements.

If you care about better JSON support in Postgres, check out the patch here: https://commitfest.postgresql.org/patch/6436/