Extracting Remote PDFs in Postgres with pgpdf and pgsql-http

Are you tired of the repetitive process of downloading PDFs manually, extracting their content, and then inserting the parsed text into your PostgreSQL database? If so, you’re not alone. The traditional approach—downloading files, using external tools to extract text, and then importing it into a database—is cumbersome and time-consuming.

Fortunately, PostgreSQL is a powerful database that, with the right extensions, can automate this workflow seamlessly. By combining my pgpdf extension with Paul Ramsey’s excellent pgsql-http extension, you can fetch and parse PDFs directly inside PostgreSQL. This enables a fully automated, SQL-driven pipeline for handling PDFs, making document storage, search, and analysis effortless.

The Power of pgpdf + pgsql-http

With just a simple SQL query, you can:

Fetch a PDF from the web.
Read its binary content directly within PostgreSQL.
Parse and extract the text.
Make the extracted text available for queries.

All of this happens entirely inside PostgreSQL—no need for external scripts, manual downloads, or third-party applications.

Getting Started

To take advantage of this streamlined workflow, first install the required extensions:

CREATE EXTENSION pgpdf;
CREATE EXTENSION http;

Once these are installed, you can fetch, parse, and store PDFs effortlessly. Here’s a simple SQL query that does it all:

SELECT pdf_read_bytes(text_to_bytea(content))
FROM http_get('https://wiki.postgresql.org/images/e/e3/Hooks_in_postgresql.pdf');

What’s Happening Here?

http_get(url) – This function, provided by pgsql-http, makes an HTTP GET request to fetch the PDF from the given URL. The result includes fields such as status, content_type, and content (which holds the raw response body).
text_to_bytea(content) – Since http_get returns the response body as text, we need to convert it to bytea to properly handle the binary PDF format.
pdf_read_bytes(bytea) – This function, provided by pgpdf, takes the raw PDF bytes, parses the document, and extracts its text content.

Practical Use Cases

By leveraging this workflow, you can:

Automate PDF ingestion – No need to manually download files. Just store URLs and let PostgreSQL handle fetching and parsing.
Enable full-text search on PDFs – Store parsed text in tsvector columns and use GIN indexes for lightning-fast search.
Archive and analyze documents – Store structured metadata alongside extracted text for better organization and insights.

Taking It Further

Storing Extracted Content in a Table

To keep track of parsed documents, create a table and insert extracted text:

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    url TEXT NOT NULL,
    content TEXT
);

INSERT INTO documents (url, content)
SELECT 'https://wiki.postgresql.org/images/e/e3/Hooks_in_postgresql.pdf',
       pdf_read_bytes(text_to_bytea(content))
FROM http_get('https://wiki.postgresql.org/images/e/e3/Hooks_in_postgresql.pdf');

Searching Extracted Content

For full-text search, use tsvector and GIN indexing:

ALTER TABLE documents ADD COLUMN tsv_content tsvector;
UPDATE documents SET tsv_content = to_tsvector('english', content);
CREATE INDEX idx_documents_tsv ON documents USING GIN(tsv_content);

Now you can efficiently search for keywords within all stored PDFs:

SELECT url, content 
FROM documents 
WHERE tsv_content @@ to_tsquery('PostgreSQL');

Conclusion

By combining pgpdf and pgsql-http, you unlock a powerful, SQL-native workflow for handling PDFs in PostgreSQL. No more manual downloads, external parsing scripts, or cumbersome imports—just simple, elegant SQL queries that do all the work for you.

If you deal with PDFs regularly and want a seamless, automated way to store and query them in PostgreSQL, this approach is a game-changer. Try it out and transform how you handle PDFs in your database-driven applications!