pgpdf: pdf type for Postgres

In a previous post I presented pgPDF GitHub, a Postgres extension to access PDF files directly from Postgres. I received some positive feedback, so I decided to polish it a bit more.

The key update is that pdf is now a valid Postgres type: A varlena object of bytes.

One can create a pdf using simple casts from text or bytea. In the former case, the input text is considered a PDF filepath. The latter is useful if you don’t have the PDF file in your filesystem but have already stored its content in a bytea column.

SELECT '/tmp/pgintro.pdf'::pdf;
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::pdf;

The actual PDF parsing and validation is done by poppler.

Having a pdf type allows us to define all sorts of useful SQL functions.

Check out the README.md For the list of all available functions. Below are a few highlights.

The simplest thing is extracting the text content of a PDF file from the filesystem.

SELECT '/tmp/pgintro.pdf'::pdf;

To access a specific page:

SELECT pdf_page('/tmp/pgintro.pdf', 1);

There are also functions available to o access the pdf metadata. For example, to get the author’s name and PDF creator software used:

SELECT pdf_author('/tmp/pgintro.pdf');
SELECT pdf_creator('/tmp/pgintro.pdf');

Sometimes, it is also useful, for example, the creation and modification dates. Typical for reports generated periodically. You can do that, too:

SELECT pdf_creation('/tmp/pgintro.pdf');
SELECT pdf_modification('/tmp/pgintro.pdf');

Full-text search is naturally supported, since one can cast a pdf type into text.

SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');