pgpdf: pdf type for Postgres
In a previous post I presented pgPDF GitHub, a Postgres extension to access PDF files directly from Postgres. I received some positive feedback, so I decided to polish it a bit more.
The key update is that pdf is now a valid Postgres type:
A varlena object of bytes.
One can create a pdf using simple casts from text or bytea.
In the former case, the input text is considered a PDF filepath.
The latter is useful if you don’t have the PDF file in your filesystem but
have already stored its content in a bytea column.
SELECT '/tmp/pgintro.pdf'::pdf;
SELECT pg_read_binary_file('/tmp/pgintro.pdf')::pdf;
The actual PDF parsing and validation is done by poppler.
Having a pdf type allows us to define all sorts of useful SQL functions.
Check out the README.md For the list of all available functions. Below are a few highlights.
The simplest thing is extracting the text content of a PDF file from the filesystem.
SELECT '/tmp/pgintro.pdf'::pdf;
To access a specific page:
SELECT pdf_page('/tmp/pgintro.pdf', 1);
There are also functions available to o access the pdf metadata.
For example, to get the author’s name and PDF creator software used:
SELECT pdf_author('/tmp/pgintro.pdf');
SELECT pdf_creator('/tmp/pgintro.pdf');
Sometimes, it is also useful, for example, the creation and modification dates. Typical for reports generated periodically. You can do that, too:
SELECT pdf_creation('/tmp/pgintro.pdf');
SELECT pdf_modification('/tmp/pgintro.pdf');
Full-text search is naturally supported,
since one can cast a pdf type into text.
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('postgres');
SELECT '/tmp/pgintro.pdf'::pdf::text @@ to_tsquery('oracle');