07 Nov 2019
This made it to the Hacker News front-page too: some very interesting comments there
A specter is haunting Data Science—the specter of SQL. All the powers of new Data Management have entered into a holy alliance to exorcise this specter; Data Scientists and engineers, managers and developers, hipsters and disruptors.
Especially among the young breeds of data practitioners, there is a growing consensus that SQL is not “cool”, not good enough or even worse “SQL is not professional enough and real data scientists should code”. My own experience, however, has taught me otherwise. Whether it is at the first stages of the pipeline like data collection and cleaning or the latter ones like feature engineering and report generation, I have come to appreciate the power and versatility of SQL and the effectiveness of RDBMS.
In fact, it is not only the tool nature of SQL and RDBMS that appeals to me but rather the whole “SQL tradition” that goes with and encompasses them. It is the level of pragmatism that established and propelled data management since Codd introduced the relational model in the 70s.
In this essay, I collected some of my thoughts on the topic of SQL and how it fits into modern data practice. It was motivated after a wave of disbelief I faced when at a presentation last week I claimed that modern data projects could benefit immensely by adopting SQL early on and sticking with it: a “start with SQLite, scale with Postgres” philosophy.
Every data scientist occasionally writes a piece of code that “flattens” a dictionary. This nested structure either comes in as a response from an API endpoint or as a query result from a NoSQL database like MongoDB or ElasticSearch. More junior engineers even couple their discovery with a “look ma! no brackets!” enthusiasm.
The schemaless nature of modern NoSQL systems virtually eliminated the need to have a well-thought data model design - with “virtually” being the operative word. You could just dump the data into a flexible “collection” of “documents” instead of a stricter “table” of “rows” and use a powerful query language to extract what you want later. Why? Because you cannot possibly anticipate how would the requirements change later on, and changing the schema in an RDBMS can become tricky. It sounds like a fair point; definitely is so in theory.
The more I work with existing NoSQL deployments however, the more I believe that their schemaless nature has become an excuse for sloppiness and unwillingness to dwell on a project’s data model beforehand. I see too many applications that handle “plain old” transactional data relying on MongoDB as their primary database from time-zero. Gradually then, they end up having to “unnest fields of a particular collection”, “create intermediate collections that act as anchors” or even worse commissioning projects that offer “a tabular layer on top of their NoSQL database” either by a cronjob that transforms into the data into a tabular format to a relational database, or even introducing a “Python library that everyone should use in order to get their data in a tabular format because that is why Scikit-Learn expects”.
This path means that there is yet another database system that should be managed, yet another project under version control and yet some more accidental complexity introduced by the new moving pieces in the data pipeline. And of course, degraded data quality and performance are to be expected. On the one hand, because ensuring ACID-like compliance - especially durability - across the whole pipeline is too hard to achieve and on the other hand, too much data is copied around and through the disk and network IO or worse operated on with “in-memory join-like operations with Pandas”, because Mongo’s lookups and aggregations seem so awkward and difficult to get right.
Here is the elephant-in-the-room-question that too many people think during the daily SCRUMs but they fear to raise: “Why didn’t we use an RDBMS in the first place? “. There could be too many reasons for that decision - both technical and organizational, I will just focus on the “we were not sure about the requirements that early on”.
Most major RDBMS today do offer some kind of schemaless support, typically with a JSON data type with rich query semantics. These implementations, having already gone through some major releases can confidently serve even the most demanding of cases. One can now model the “known” part of his data model in a typical relational manner and dump his “raw and unstructured” data into JSON columns. No need to “denormalize all the things” just because some element of the domain is “unstructured”. Why should one have to sacrifice joins on the “customers” table and have to reside to lookups by customer_id which is a nested field under the “purchases” collection?
The good thing with this approach is that one can have a single database for both their structured and unstructured data without sacrificing ACID-compliance. Data integrity can be ensured both at the referential level (i.e. foreign keys), data quality can be tightly managed with constraints, indices, and triggers. Performance can be boosted at will, given that one is not reluctant to just think and look at the documentation. My favorite tool in that regard is none other than “indices on expressions”. Plus, there are numerous index types one can choose specifically tailored to their needs.
For extreme cases where data is mostly read-heavy one can go even further. In a recent project, for example, I had to collect data from an API (~100M of rich JSON documents) and make them easier to analyze by typical business intelligence analysts and Excel users. My implementation was simple yet powerful. Data was crawled and dumped periodically in a “raw data table” as JSON documents and all further analysis was done through SQL views (materialized or not). It took some time to set up the basic views manually but after some guidance, even the less technical-adept of BI analysts could define their own views or query existing ones. It was tricky to achieve great performance at the first iterations; but later, the indices required were made apparent and once defined, things run smoothly. Postgres has even support for indices on materialized views - see here.
ETL is the cash burn machine of modern data-driven endeavors and the necessary evil in every data scientist’s daily life. And yet it is probably the less well thought of piece in the data pipeline. Countless machine learning engineers start their model selection work with the hopes of using random forests and support vectors, only to realize later that not enough clean data was available and they have to use simple regression.
My main objection with data cleaning and transformation today is its “decentralized nature”. Each data scientist does their own cleaning procedure in their own client code with “df.dropna()” being thrown around like a ragdoll and before you know it each analyst have their own distorted - almost subjective - view of the data, It goes without saying that maintaining and tracking these so-called “cleaning processes” is impossible, as they are split in libraries under version control, notebooks usually not under version control and of course rules - usually residing in the database.
In my view, the data pipeline would be smoother and cleaner if more data cleaning processes were pushed to the database level. First of all: types. Modern programming paradigms and languages have given type definitions a bad name (I think Java is to blame for that, but that’s another story). Setting the database column, for example, to “timestamp with timezone” is the best way to enforce a “fail fast, fairly early” policy so that bad data is rejected upon insertion and does not propagate further into the pipeline.
Modern data types are usually accompanied with rich relevant algebra allowing for fine-grained operations without sacrificing power and expressivity - check out, for example, the “date algebra” or “geo algebra” one can do with Postgres. All these are possible with tools like Pandas of course, but why to pay the performance tax and most importantly: why pollute your client code with lambda functions and one-liners to extract simple features like “day of the week” or “hour of the day”. Leave these to the database and invest more brainpower and Python keystrokes on the more sophisticated parts of the analysis.
Let’s focus on another typical feature of relational databases: triggers and stored procedures. They both can be a significant tool in one’s data cleaning & transformation toolbox. In the project I mentioned above, each document had a set of fields rich with business information (IDs mainly) that could offer a clear relational structure to the data model. Whenever one document was inserted as a JSON payload, a piece of code had to run, extract the relevant fields and insert them as rows to other tables, by checking integrity constraints, adding foreign keys and so on. During the first iterations, I used jq and standard Python JSON processing to perform these CRUD operations. They worked just fine until a) performance took a hit and b) rules become increasingly more complex and verbosity was too much to bear. It was then that I decided to write a database trigger function, being called on every insert of the JSON documents. The function would extract each interesting field and insert appropriate rows in other tables accordingly. All within the same transaction! No need to move data up and down the data pipeline, parse dump JSON payloads and battle with encoding mismatches. The trigger function itself was also written in the “schema.sql” file which meant that everything was clear and documented along with the other relevant schema information, thus easier to track and manage.
Database server programming used to be quite difficult to get used to, due to the fact that one has to write procedural code in an inherently declarative environment. Today however things have improved dramatically: syntax is sweeter and one can even use procedural languages to write their trigger and stored procedures functions. With Postgres, one can even write Python and Perl code in the database!
Suppose you are a data scientist: Chances are your Jupyter notebook has a pretty standard format. The first cells are used to build a DataFrame populated with data from a CSV file or a database query. Pandas pros can even do this in one single line as they know that the DataFrame constructor provides a lot of arguments. Typically this DataFrame has a few columns at first. And then “feature engineering” begins and creative-constructive chaos ensues: more columns are added subsequently to the dataset and it grows horizontally. Novices will be tempted to write for-loops to add more columns; if you are a more seasoned professional though you will squeeze some more juice from your machine and use vectorized operations instead. Practically this means that you use one-liners sequentially to define your columns as there’s probably already an implemented function you’re just calling and applying on each cell. You may still be thinking procedurally in your head (that for-loop temptation is still there) but in essence, you are doing declarative programming. You are not defining how your independent variable is calculated but rather what is its meaning: this preference for “what” instead of “how” is the essence of declarative programming.
Modern data science is all about tailoring that dataset (the X matrix) to your model’s needs before you feed the data to it. If this process is not smooth and fast enough, you risk spending more mental capacity on building it yourself piece by piece rather than thinking of creative ways to add more features, describe them in a declarative way and iterate quickly. That is what SQL is good for and this is why it has been so successful for decades.
My experience has shown that the more features I create at the query level, the more flexibility I have in experimenting with different feature vectors and the quicker the model selection and evaluation becomes. When you are writing a query, your database becomes a canvas you can draw beautiful models on. No need to jump up and down between disk and memory - or database and Pandas if you prefer. You can freely combine data from different tables, do simple or complex operations between various columns and let the query optimizer do the heavy lifting of figuring out the best way to create the dataset for you!
SQL and relational databases have come a long way and nowadays offer almost any function a data scientist could ask. I mentioned above things like “date algebra” and “geo algebra”. Take text processing and free text search as other examples. It used to be the case that to have decent text processing and search you had to use NLTK or ElasticSearch. Yet another dependency, yet another database. Postgres (even SQLite and other major relational databases) offer some text manipulation functions and free-text search functionality that are just good enough for most applications. Does this eliminate the need for NLTK or ElasticSearch? Absolutely not. But why commit on such complexities upfront when you could test the waters first? Better yet, save your NLTK-fu for NLP-based features more sophisticated than “description text length”. Deploy your ElasticSearch cluster only if you need something more complex than “posts that contain at least two of these keywords my marketing guys sent me”.
Even collaboration becomes easier as different people can extend and build upon the same database views that contain the most fundamental of features - columns.
If you are lucky enough that your organization has a system deployed like PrestoDB, you can take your feature engineering game at whole other level, as you have ready-made SQL layer on top of all your potential data sources.
There are a couple of caveats in leveraging SQL however. First of all, due to its declarative nature, SQL will almost always give you results, but they just may not be what you’re asking for. SQL requires meticulousness and care as debugging is quite difficult to perform. You cannot print “I’m in” to check if there's a wrong loop condition and so on. Virtually, the only debugging you can do is to check and reverse-engineer the execution plan.
On another more “cultural” aspect, one thing I have noticed is that best practices and concepts like “clean code” and “maintainability” are not so prevalent in the SQL world. I could attribute this to many aspects but I will just underline the fact that too many “business people” use SQL. They see it as an “ad hoc tool of obtaining data” which it is correct and pragmatic as an approach, but we should try and steer them to approaching SQL as a codebase which will be used by others and should be leveraged as a tool for communicating both with the database and with fellow programmers.
Relational databases usually make more sense financially too. Distributed systems like MongoDB and ElasticSearch are money-hungry beasts and can kill your technology and human resources budget; unless you are absolutely certain and have run the numbers and decided that they do really make sense for your case.
In one project I had to replace a 3-node managed MongoDB cluster with only 40GB of disk space with a Postgres instance with 8GB of RAM and 1TB of disk space at one-fifth of the monthly cost. The performance was significantly better with the default configuration even and stability improved too. As a rule of thumb, vertical scalability is in most cases more economical than the horizontal one.
The keyword in the anecdote above is the “managed” part. With NoSQL databases, it is famously hard to get their deployment right. In many cases, you have to hire more manpower to make sure they are running 24-7 until a vendor comes in and promises to take this weight off your shoulders. Their pricing looks cheap at first but the cost can double or triple if you require marginally more disk space, or your indices need more memory to speed-up search results. In such cases, you decide to hire a data engineer like myself to make sure that your beast is tamed, works fine, is stable and really helps your business.
Am I claiming that NoSQL databases are a mistake or that they cannot help? Of course not. I am just claiming that their deployment and management can be needlessly complex and over-demanding for most companies - especially with transactional and read-heavy data.
Performance and stability with relational databases can be better out of the box. When it is not, you can browse the documentation and tweak the necessary configuration parameters. “So you can with NoSQL,” you say. True, but have you tried optimizing your MongoDB or ElasticSearch cluster at the deployment level? And I am talking about smarter things, beyond the “let’s add more Java heap size”. It’s not easy. In fact, I argue that is too complex in most cases. There are so many moving pieces and tradeoffs too hard to get right.
On the other hand, if you check Postgres’ configuration file, most of the parameters are straightforward and tradeoffs not so hard to spot. Restarting or reloading the database without significant downtime is usually smooth as it well-integrated and tested with process managers like systemd or supervisor. Such iterations are no easy feat for distributed systems as too many things can go wrong and leave your cluster in a “red state”.
PS: If you think your data pipeline could use some improvement, feel free to send me an e-mail to discuss