Full Text Search on PDFs With Postgres
With all these LLMs flying around, PDFs have become popular data sources again. Wait, can a blob of binary be considered a data source, just because human eyes can interpret it visually? Well, no. Unless you can run SQL queries on top of them. This post discusses that.
Parsing PDFs with SQL
PDFs are usually parsed via a tool like pdftotext, first and piped into the database. I find it more convenient, though, to query the filesystem with SQL.
That’s why I built pgPDF GitHub; a Postgres extension that allows you to
SELECT pdf_read_file('/path/file.pdf') → text
It’s a thin wrapper around poppler, really, nothing fancy in itself.
For that to work, you’ll need the following packages installed, along with their header files: poppler glib and cairo
cd /tmp
git clone https://github.com/Florents-Tselai/pgpdf.git
cd pgpdf
make
make install # may need sudo
See installation if you get trouble with building the extension.
Getting Data
Let’s download some PDF files in a /tmp/pdfs
directory.
This should be a directory where Postgres can read from:
i.e. the (usually) postgres
user executing the server process.
That’s different than the user executing the psql
command!
mkdir -p /tmp/pdfs && cd /tmp/pdfs && wget https://wiki.postgresql.org/images/e/ea/PostgreSQL_Introduction.pdf https://afiskon.github.io/static/2017/postgresql-sharding-and-ha-pgconf-asia-2017.pdf https://momjian.us/main/writings/pgsql/mvcc.pdf https://momjian.us/main/writings/pgsql/inside_shmem.pdf https://momjian.us/main/writings/pgsql/locking.pdf
Storing the Data
We’ll store the data in a simple table.
The columns txt
and bytes
will
store the PDF files in text and binary format respectively.
We’ll also store a tsvector for each PDF. A tsvector represents a document in a form optimized for text search. It stores a list of distinct lexemes (words) and their positions in the document, allowing for efficient searching and ranking of results based on their relevance to a search query. Lexemes are language-specific, thus, depending on the language we’re interested in, we should store the appropriate tsvector.
This operation is computationally expensive, but it needs to be run only once for each document, so it makes sense to store it in a generated column.
Subsequent FTS queries are executed against these tsvectors,
not against the txt
column.
We’re not doing simple pattern matching here.
CREATE TABLE pdfs (
id int generated always as identity,
path text,
name text generated always as ( (substring(path FROM '.*/([^/]+)$')) ) stored,
txt text,
bytes bytea,
tsvec_si tsvector generated always as (to_tsvector('simple', txt)) stored,
tsvec_en tsvector generated always as (to_tsvector('english',txt)) stored
);
INSERT INTO pdfs(path)
SELECT '/tmp/pdfs/' || pg_ls_dir('/tmp/pdfs');
CREATE EXTENSION pgpdf;
Now we’ll use the pdf_read_file
function from pgpdf
to actually read the file from its path
,
parse it into text (poppler
is used behind the scenes)
and store it in the txt
column.
UPDATE pdfs
SET txt = pdf_read_file(path);
Notice, that this UPDATE
will also trigger the update
on the generated columns tsvec_si
and tsvec_en
which store the tsvector
of each document.
Parsing PDF from bytea
If you don’t have the actual file available,
and you have stored its contents in a bytea
column,
you can use the pdf_read_bytes(bytea)
function
to populate the txt
column.
Running FTS Queries
FTS in Postgres can be extremely powerful (see docs)
but it generally takes the form of tsvector @@ tsquery
operator.
A tsquery defines the filter
for matching documents against the tsvector data type.
There are also different flavors available:
plainto_tsquery
, phraseto_tsquery
and websearch_to_tsquery
.
(see docs)
To search for documents mentioning Postgres
SELECT name
FROM pdfs
WHERE tsvec_en @@ to_tsquery('english', 'Postgres');
name
-------------------------------------------------
mvcc.pdf
PostgreSQL_Introduction.pdf
inside_shmem.pdf
postgresql-sharding-and-ha-pgconf-asia-2017.pdf
locking.pdf
(5 rows)
To search for documents mentioning Postgres
and sharding
SELECT name
FROM pdfs
WHERE tsvec_en @@ to_tsquery('english', 'Postgres & Sharding');
name
-------------------------------------------------
postgresql-sharding-and-ha-pgconf-asia-2017.pdf
(1 row)
Improving Performance
Our sample table is too small
to worry about performance,
but in general, one can crate GIN
indices
on tsvector
columns.
CREATE INDEX ON pdfs USING GIN(tsvec_si);
CREATE INDEX ON pdfs USING GIN(tsvec_en);