With all these LLMs flying around, PDFs have become popular data sources again. Wait, can a blob of binary be considered a data source, just because human eyes can interpret it visually? Well, no. Unless you can run SQL queries on top of them. This post discusses that.

Parsing PDFs with SQL

PDFs are usually parsed via a tool like pdftotext, first and piped into the database. I find it more convenient, though, to query the filesystem with SQL.

That’s why I built pgPDF GitHub; a Postgres extension that allows you to

SELECT pdf_read_file('/path/file.pdf') → text

It’s a thin wrapper around poppler, really, nothing fancy in itself.

For that to work, you’ll need the following packages installed, along with their header files: poppler glib and cairo

cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo

See installation if you get trouble with building the extension.

Getting Data

Let’s download some PDF files in a /tmp/pdfs directory. This should be a directory where Postgres can read from: i.e. the (usually) postgres user executing the server process. That’s different than the user executing the psql command!

mkdir -p /tmp/pdfs && cd /tmp/pdfs && wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf https://afiskon.github.io/static/2017/postgresql-sharding-and-ha-pgconf-asia-2017.pdf https://momjian.us/main/writings/pgsql/mvcc.pdf https://momjian.us/main/writings/pgsql/inside_shmem.pdf https://momjian.us/main/writings/pgsql/locking.pdf

Storing the Data

We’ll store the data in a simple table. The columns txt and bytes will store the PDF files in text and binary format respectively.

We’ll also store a tsvector for each PDF. A tsvector represents a document in a form optimized for text search. It stores a list of distinct lexemes (words) and their positions in the document, allowing for efficient searching and ranking of results based on their relevance to a search query. Lexemes are language-specific, thus, depending on the language we’re interested in, we should store the appropriate tsvector.

This operation is computationally expensive, but it needs to be run only once for each document, so it makes sense to store it in a generated column.

Subsequent FTS queries are executed against these tsvectors, not against the txt column. We’re not doing simple pattern matching here.

CREATE TABLE pdfs (
    id      int     generated always as identity,
    path    text,
    name    text     generated always as ( (substring(path FROM '.*/([^/]+)$')) ) stored,
    txt     text,
    bytes   bytea,
    
    tsvec_si tsvector generated always as (to_tsvector('simple', txt)) stored,
    tsvec_en tsvector generated always as (to_tsvector('english',txt)) stored
);
INSERT INTO pdfs(path)
SELECT '/tmp/pdfs/' || pg_ls_dir('/tmp/pdfs');
CREATE EXTENSION pgpdf;

Now we’ll use the pdf_read_file function from pgpdf to actually read the file from its path, parse it into text (poppler is used behind the scenes) and store it in the txt column.

UPDATE  pdfs
SET     txt = pdf_read_file(path);

Notice, that this UPDATE will also trigger the update on the generated columns tsvec_si and tsvec_en which store the tsvector of each document.

Parsing PDF from bytea

If you don’t have the actual file available, and you have stored its contents in a bytea column, you can use the pdf_read_bytes(bytea) function to populate the txt column.

Running FTS Queries

FTS in Postgres can be extremely powerful (see docs) but it generally takes the form of tsvector @@ tsquery operator.

A tsquery defines the filter for matching documents against the tsvector data type. There are also different flavors available: plainto_tsquery, phraseto_tsquery and websearch_to_tsquery. (see docs)

To search for documents mentioning Postgres

SELECT name
FROM pdfs
WHERE tsvec_en @@ to_tsquery('english', 'Postgres');
                      name                       
-------------------------------------------------
 mvcc.pdf
 PostgreSQL_Introduction.pdf
 inside_shmem.pdf
 postgresql-sharding-and-ha-pgconf-asia-2017.pdf
 locking.pdf
(5 rows)

To search for documents mentioning Postgres and sharding

SELECT name
FROM pdfs
WHERE tsvec_en @@ to_tsquery('english', 'Postgres & Sharding');
                      name                       
-------------------------------------------------
 postgresql-sharding-and-ha-pgconf-asia-2017.pdf
(1 row)

Improving Performance

Our sample table is too small to worry about performance, but in general, one can crate GIN indices on tsvector columns.

CREATE INDEX ON pdfs USING GIN(tsvec_si);
CREATE INDEX ON pdfs USING GIN(tsvec_en);