pgpdf: pdf type for Postgres
In a previous post I presented pgPDF GitHub, a Postgres extension to access PDF files directly from Postgres. I received some positive feedback, so I decided to polish it a bit more.
The key update is that pdf
is now a valid Postgres type:
A varlena
object of bytes.
One can create a pdf
using simple casts from text
or bytea
.
In the former case, the input text is considered a PDF filepath.
The latter is useful if you don’t have the PDF file in your filesystem but
have already stored its content in a bytea
column.
SELECT '/tmp/pgintro.pdf'::pdf;
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::pdf;
The actual PDF parsing and validation is done by poppler.
Having a pdf
type allows us to define all sorts of useful SQL functions.
Check out the README.md For the list of all available functions. Below are a few highlights.
The simplest thing is extracting the text content of a PDF file from the filesystem.
SELECT '/tmp/pgintro.pdf'::pdf;
To access a specific page:
SELECT pdf_page('/tmp/pgintro.pdf', 1);
There are also functions available to o access the pdf
metadata.
For example, to get the author’s name and PDF creator software used:
SELECT pdf_author('/tmp/pgintro.pdf');
SELECT pdf_creator('/tmp/pgintro.pdf');
Sometimes, it is also useful, for example, the creation and modification dates. Typical for reports generated periodically. You can do that, too:
SELECT pdf_creation('/tmp/pgintro.pdf');
SELECT pdf_modification('/tmp/pgintro.pdf');
Full-text search is naturally supported,
since one can cast a pdf
type into text
.
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');