26 Dec 2018
I frequently find myself storing standard Python set
objects in DataFrame
columns.
This usually happens when I have some kind of a tags
or labels
column for each observation.
It can also be the output of a groupby
operation where the end result needs to be a list-like (or set-like) object before it's aggregated.
Using set operations (union, intersection etc.) can come in handy in such cases.
To tackle those scenarios however I end up writing code like df.tags.map(lambda x: set(x).add(elem)
which apart from being ugly, also doesn't allow for pandas-like immutable-based compositions (aka one-liners).
Ideally, I would like to be able treat the tags
column as a set-like one,
so I could write code like df.tags.set.add(elem)
or filter like df[df.tags.set.contains(elem)]
and df[df.tags.set.union({`t1`,`t2,`t3`})
To achieve this, I wrote pandas-sets
, a Pandas extension that adds set-like properties to existing Series
objects,
provided that they already store set
objects.
You can check out the code on GitHub.
The pandas_sets
package adds a .set
accessor to any pandas Series
object;
it's like .dt
for datetime
or .str
for string
, but for set
.
It exposes all public methods available in the standard set
.
Using it is pretty simple. First install with pip
.
pip install pandas-sets
Then, just import the pandas_sets
package and it will register a .set
accessor to any Series
object.
import pandas_sets
import pandas as pd
df = pd.DataFrame({'post': [1, 2, 3, 4],
'tags': [{'python', 'pandas'},
{'philosophy', 'strategy'}, {
'scikit-learn'}, {'pandas'}]
})
pandas_posts = df[df.tags.set.contains('pandas')]
pandas_posts.tags.set.add('data')
pandas_posts.tags.set.update({'data', 'analysis'})
The implementation is very primitive for now and
draws heavily from pandas' core StringMethods
implementation.
Next steps include: further testing with edge-case scenarios, adding detailed docstrings and more fine-grained NA
handling.
Some day it may be incorporated into pandas core itself.