LLMs: The Missing Compiler for Unix Tools
Most of my data engineering work begins with a collection of scripts stitched together by a Makefile
and a README.md
to act as a documentation logbook.
The Makefile
is a collection of targets that use
psql
, python3 -c
(Python program passed in as string instead of a file),
jq
, curl
, and some of the more popular GNU Core Utils,
such cat
, head
, join
, sort
and xargs
.
Nothing fancy, really.
More often than not, this first iteration ends up being the final implementation, too. What starts as a quick prototype often solidifies into the actual pipeline simply because it works well enough, and there’s little incentive to rewrite it.
After years of following this pattern, I’ve built up my toolbox of go-to scripts and make targets. Whenever I join a new project, I can hit the ground running without getting bogged down by the usual bureaucracy or tooling setup. That kind of overhead is common when switching between freelancing gigs, but having a personal toolkit helps me skip straight to the work that matters.
One doesn’t have to know the advanced features of an underlying tool to be
productive. I mean, I know what sed
does and I can spot use cases for it.
But I never managed to learn Regular Expressions properly—whatever “properly”
means. I never became an awk
expert either; I understand how it works
generally, I can spot when it’s useful, and there are a handful of recipes I’ve
memorized and can use. But I wouldn’t say I can program in awk like I can in
Python. I often opt for a python3 -c
approach, with a full 20-line
Python script piped through xargs
instead.
When I want expressivity and have a Postgres instance available, I can push a lot
of logic into an SQL query and run it through psql—not to query data, but
to use an SQL function (e.g., generate_series
to generate time ranges).
There are so many data pipelines out in
the wild that I’ve refactored into this shape.
I can hear the argument about readability and obfuscation, one-liners, and so on.
But in the grand scheme of things, I’ve found that if one is careful enough,
diligent enough, and embraces the declarative nature of make
, things can be
kept tidy for years to come. I won’t get into that, though.
The main problem with this approach has always been that these tools are deceptively tricky to get right—especially their syntax. Unix tools are indeed powerful but can become obscure and intimidating for newcomers. For example, once you start combining make variables and Bash variables, things quickly get out of hand. It doesn’t take long before you’re riding the insanity express—with no brakes and no clear way off.
However, today, LLMs have rendered scripting a commodity. What used to require a mix of tribal knowledge, geek culture, and years of experience is suddenly just a prompt away. The barrier is no longer mastering syntax—it’s simply knowing that a tool exists and roughly what it does. With that, the LLM fills in the rest, making the Unix philosophy more accessible and composable than ever.
Pointing the LLM to a specific tool can make a huge difference. Suppose your
raw data exists in two tables in two different databases, and you want to
“join” them. If you ask how to do it by talking about SQL and databases, it may
start suggesting foreign data wrappers, Presto, or some fancy multi-database
execution engine. But if you explicitly tell it to use diff
, it will get the
idea.
To that extent, it wouldn’t be a reach to say that LLMs can act as a powerful compiler of turning an almost natural language description of data workflows, albeit through a Makefile into an executable of simple Unix tools.
Try it in your next data engineering project:
Describe your project scenario in a README.md
,
sketch a Makefile
with some basic variables and targets
, and ask an LLM to fill in the actual implementation.