Florents Tselai Think, Code, Read, Sleep, Repeat

Introducing Pandas-Sets: Set-oriented Operations in Pandas

26 Dec 2018

I frequently find myself storing standard Python set objects in DataFrame columns. This usually happens when I have some kind of a tags or labels column for each observation. It can also be the output of a groupby operation where the end result needs to be a list-like (or set-like) object before it's aggregated. Using set operations (union, intersection etc.) can come in handy in such cases.

To tackle those scenarios however I end up writing code like df.tags.map(lambda x: set(x).add(elem) which apart from being ugly, also doesn't allow for pandas-like immutable-based compositions (aka one-liners).

Ideally, I would like to be able treat the tags column as a set-like one, so I could write code like df.tags.set.add(elem) or filter like df[df.tags.set.contains(elem)] and df[df.tags.set.union({`t1`,`t2,`t3`})

To achieve this, I wrote pandas-sets, a Pandas extension that adds set-like properties to existing Series objects, provided that they already store set objects.

You can check out the code on GitHub.

The pandas_sets package adds a .set accessor to any pandas Series object; it's like .dt for datetime or .str for string, but for set.

It exposes all public methods available in the standard set.

Using it is pretty simple. First install with pip.

pip install pandas-sets

Then, just import the pandas_sets package and it will register a .set accessor to any Series object.

import pandas_sets
import pandas as pd

df = pd.DataFrame({'post': [1, 2, 3, 4],
                    'tags': [{'python', 'pandas'}, 
                    {'philosophy', 'strategy'}, {
                    'scikit-learn'}, {'pandas'}]

pandas_posts = df[df.tags.set.contains('pandas')]


pandas_posts.tags.set.update({'data', 'analysis'})

The implementation is very primitive for now and draws heavily from pandas' core StringMethods implementation.

Next steps include: further testing with edge-case scenarios, adding detailed docstrings and more fine-grained NA handling.

Some day it may be incorporated into pandas core itself.

Fork me on GitHub

Tags #projects #code #python #pandas