LLMs: The Missing Compiler for Unix Tools

Most of my data engineering work begins with a collection of scripts stitched together by a Makefile and a README.md to act as a documentation logbook.

The Makefile is a collection of targets that use psql, python3 -c (Python program passed in as string instead of a file), jq, curl, and some of the more popular GNU Core Utils, such cat, head, join, sort and xargs. Nothing fancy, really.

More often than not, this first iteration ends up being the final implementation, too. What starts as a quick prototype often solidifies into the actual pipeline simply because it works well enough, and there’s little incentive to rewrite it.

After years of following this pattern, I’ve built up my toolbox of go-to scripts and make targets. Whenever I join a new project, I can hit the ground running without getting bogged down by the usual bureaucracy or tooling setup. That kind of overhead is common when switching between freelancing gigs, but having a personal toolkit helps me skip straight to the work that matters.

One doesn’t have to know the advanced features of an underlying tool to be productive. I mean, I know what sed does and I can spot use cases for it. But I never managed to learn Regular Expressions properly—whatever “properly” means. I never became an awk expert either; I understand how it works generally, I can spot when it’s useful, and there are a handful of recipes I’ve memorized and can use. But I wouldn’t say I can program in awk like I can in Python. I often opt for a python3 -c approach, with a full 20-line Python script piped through xargs instead.

When I want expressivity and have a Postgres instance available, I can push a lot of logic into an SQL query and run it through psql—not to query data, but to use an SQL function (e.g., generate_series to generate time ranges).

There are so many data pipelines out in the wild that I’ve refactored into this shape. I can hear the argument about readability and obfuscation, one-liners, and so on. But in the grand scheme of things, I’ve found that if one is careful enough, diligent enough, and embraces the declarative nature of make, things can be kept tidy for years to come. I won’t get into that, though.

The main problem with this approach has always been that these tools are deceptively tricky to get right—especially their syntax. Unix tools are indeed powerful but can become obscure and intimidating for newcomers. For example, once you start combining make variables and Bash variables, things quickly get out of hand. It doesn’t take long before you’re riding the insanity express—with no brakes and no clear way off.

However, today, LLMs have rendered scripting a commodity. What used to require a mix of tribal knowledge, geek culture, and years of experience is suddenly just a prompt away. The barrier is no longer mastering syntax—it’s simply knowing that a tool exists and roughly what it does. With that, the LLM fills in the rest, making the Unix philosophy more accessible and composable than ever.

Pointing the LLM to a specific tool can make a huge difference. Suppose your raw data exists in two tables in two different databases, and you want to “join” them. If you ask how to do it by talking about SQL and databases, it may start suggesting foreign data wrappers, Presto, or some fancy multi-database execution engine. But if you explicitly tell it to use diff, it will get the idea.

To that extent, it wouldn’t be a reach to say that LLMs can act as a powerful compiler of turning an almost natural language description of data workflows, albeit through a Makefile into an executable of simple Unix tools.

Try it in your next data engineering project: Describe your project scenario in a README.md, sketch a Makefile with some basic variables and targets , and ask an LLM to fill in the actual implementation.