Florents Tselai Think, Code, Read, Sleep, Repeat

PyPocketExplore: Collecting, Exploring and Predicting Popular Items on Pocket

16 Aug 2017

First off, I just want to mention that I love Pocket and everything about it. I do consider myself a Pocket power-user as well; I have all sorts of scripts and cron jobs that fetch items from the web and add them to my Pocket.

Because of its simplicity and quality, a lot of people save a lot of pages there, making it a unique place to discover high-quality content from around the web through its recommendations feature. Because of this Pocket internally is able to match a URL to a simple but important metric: saves count.

To make this data available I built a tool that tries to "expose" it by scraping content from Pocket Explore. By using this tool, I was able to download >300k items (web pages) along with their save count. Needless to say how important and useful a dataset like this can be.

In this post:

  • I present the PyPocketExplore Python package.
  • Do some basic exploration on the (snapshot of) the data with Pandas.
  • I then go on to train a simple predictive model that tries to predict how popular (in Pocket terms) a web page will be.

PyPocketExplore is a CLI-based and web-based API to access Pocket Explore data. It can be used to collect data about the most popular Pocket items for different topics.

An example usage would be crawling the data and use it as a training set to predict the number of pocket saves for a web page.

The easiest way to install the package is through PyPi. This should get you up-and-running pretty quickly.

$ pip install PyPocketExplore

Through the CLI there are two modes: topic and batch

With the first one (pypocketexplore topic) you can download items from specific topics and output them to a nicely formatted JSON file.

Usage: pypocketexplore topic [OPTIONS] [LABEL]...

  Download items for specific topics

Options:
  --limit INTEGER  Limit items to download
  --out TEXT       JSON output filepath
  --nlp            If set, also downloads the page and applies NLP (through
                   NLTK)

For example, this command

$ pypocketexplore topic python data sex books --nlp --out life_topics.json

will go through the corresponding pages: https://getpocket.com/explore/python, https://getpocket.com/explore/data, https://getpocket.com/explore/sex, https://getpocket.com/explore/books one-by-one and then:

  • scrap and extract the immediately available data for each item (item_id, title, save count, excerpt and url)
  • run each item url through the awesome Newspaper library (in-parallel)
  • apply NLP to each item's text
  • save the results to life_topics.json

In the end you'll have a rich dataset full of text to play with and of course a popularity metric - pretty cool to experiment with. You can check out a sample here (e-mail me if you want the full dataset)

Wondering how it works? It's simple enough.

Each topic on Pocket Explore has its own nicely-formatted page; it is so nicely-formatted actually that it was too-easy-to-resist-scraping it. For each topic on Pocket Explore, there are a set of related topics, which one can crawl-through pretty easily. For example after scraping https://getpocket.com/explore/python on can then scrap the related topics: programming javascript google windows java linux data science python 3 developer.

This essentially means that one can crawl through the whole graph of topics by following the related topics as edges.

To do this one of course needs a set of seed topics to initiate the crawling process. To get these seeds, the pypocketexplore batch mode fetches the taxonomy labels provided by IBM Watson. and then walks through the graph. I figured Pocket uses the IBM Watson to label its items, so this kind of reverse-engineering make sense. (Sorry Pocket guys)

Usage: pypocketexplore batch [OPTIONS]

  Download items for all topics recursively.  USE WITH CAUTION!

Options:
  --n INTEGER      Max number of total items to download
  --limit INTEGER  Limit items to download per topic
  --out TEXT       JSON output filepath
  --nlp            If set, also downloads the page and applies NLP (through
                   NLTK)
  --mongo TEXT     Mongo DB URI to save items
  --help           Show this message and exit.

CAUTION This mode with all goodies enabled will take few days to run and then collect around 300k unique items through 8k topics. I have tried to space the requests to Pocket's servers and handle rate limit errors, but one can never be sure with such things.

To have access to a standalone web API you need to clone the repo locally first.

$ git clone git@github.com:Florents-Tselai/PyPocketExplore.git
$ cd PyPocketExplore
$ pip install -r requirements.txt

To run this API application, use the flask command as same as Flask Quickstart

$ cd PyPocketExplore
$ export FLASK_APP=./PyPocketExplore/pypocketexplore/api/api.py
$ export FLASK_DEBUG=1 ## if you run in debug mode.
$ flask run
 * Running on http://localhost:5000/
  • GET /api/topic/{topic} - Get topic data

Example topics: python, finance, business and more

Example GET /api/topic/python

Response

[
    {
        "excerpt": "For part 1, see here. All the software written for this project is in Python. I’m not an expert python programmer, far from it but the huge number of available libraries and the fact that I can make some sense of it all without having spent a lifetime in Python made this a fairly obvious choice.",
        "image": "https://d33ypg4xwx0n86.cloudfront.net/direct?"url"=https%3A%2F%2Fjacquesmattheij.com%2Fusb-microscope.jpg&resize=w750",
        "item_id": "1731527024",
        "saves_count": 223,
        "title": "Sorting 2 Tons of Lego, The software Side · Jacques Mattheij",
        "topic": "python",
        "url": "https://jacquesmattheij.com/sorting-lego-the-software-side"
    },

        {
        "excerpt": "There are lots of free resources for learning Python available now. I wrote about some of them way back in 2013, but there’s even more now then there was then! In this article, I want to share these resources with you.",
        "image": "https://d33ypg4xwx0n86.cloudfront.net/direct?"url"=https%3A%2F%2Fdz2cdn1.dzone.com%2Fstorage%2Farticle-thumb%2F5158392-thumb.jpg&resize=w750",
        "item_id": "1727350036",
        "saves_count": 59,
        "title": "Free Python Resources",
        "topic": "python",
        "url": "https://dzone.com/articles/free-python-resources"
    },

    {
        "excerpt": "A surprisingly versatile Swiss Army knife — with very long blades!TL;DRWe (an investment bank in the Eurozone) are deploying Jupyter and the Python scientific stack in a corporate environment to provide employees and contractors with an interactive computing environment with to help them leve",
        "image": "https://d33ypg4xwx0n86.cloudfront.net/direct?"url"=https%3A%2F%2Fcdn-"image"s-1.medium.com%2Fmax%2F1600%2F1%2AmeN9gfB_nuwmGGwLQzhVQA.png&resize=w750",
        "item_id": "1726489646",
        "saves_count": 41,
        "title": "Jupyter & Python in the corporate LAN",
        "topic": "python",
        "url": "https://medium.com/@olivier.borderies/jupyter-python-in-the-corporate-lan-109e2ffde897"
    },
    ...
]

Let's walk-through a case:

$ pypocketexplore batch

YOU SHOULD NOT RUN THIS LIGHT-HEARTEDLY

(... many hours later ...)

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import json
plt.style.use('fivethirtyeight')

%matplotlib inline
In [2]:
items = pd.DataFrame(json.load(open('./topics.json')))
items['saves_count_datetime'] = pd.to_datetime(items['saves_count_datetime'], unit='s')
items.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 562325 entries, 0 to 562324
Data columns (total 9 columns):
article                 556046 non-null object
excerpt                 562325 non-null object
image                   539239 non-null object
item_id                 562325 non-null object
saves_count             562325 non-null int64
saves_count_datetime    562325 non-null datetime64[ns]
title                   562325 non-null object
topic                   562325 non-null object
url                     562325 non-null object
dtypes: datetime64[ns](1), int64(1), object(7)
memory usage: 38.6+ MB
In [3]:
print("We have {} unique items and {} unique topics".format(items.item_id.nunique(), items.topic.nunique()))
We have 332661 unique items and 8576 unique topics

As an item identified by item_id may have been scraped via multiple topics, we only keep the latest (max saves_count_datetime) snapshot. We do want however to keep track of these multiple topics, as it may be useful for feature engineering; for example if an item is matched to multiple topics it may appear more often in recommendations thus is saved more frequently and so on.

To handle this we create an index that maps each item_id to a set of topics.

In [4]:
gb = items.groupby('item_id', as_index=True)
topics_index = gb.agg({'topic': lambda labels: set(labels)})
topics_index = topics_index.rename(columns={'topic': 'topics'})
topics_index.sample(n=5)
Out[4]:
topics
item_id
1676127241 {speedmaster, omega speedmaster}
902764933 {snack}
21252824 {puzzles}
379180236 {prepping}
389094029 {sean rad, humane society of, converse, david k.}

We now join this index to our DataFrame while keeping the latest (based on saves_count_datetime) instances for each item and since after that operation we don't have duplicate item_ids, we can set the item_id column as index.

In [5]:
items = items.sort_values('saves_count_datetime', ascending=True)
gb = items.groupby('item_id', as_index=True)
data = gb.nth(-1).join(topics_index)

Let's have a look to the important columns

In [6]:
data[['title','saves_count_datetime', 'topics', 'saves_count']].sample(n=5)
Out[6]:
title saves_count_datetime topics saves_count
item_id
1209414455 20 Best Kodi Addons for 2016: Updated, working... 2017-07-22 05:49:46.380750 {kodi} 484
1597282913 Kevin Feige Addresses Marvel Movie Villain Cri... 2017-07-23 11:28:02.375134 {kevin feige} 20
15271194 Domain Models: Employing the Domain Model Pattern 2017-07-21 16:34:38.523443 {martin fowler} 293
845346439 What Really Is The Best Headline Length? 2017-07-22 03:33:49.808923 {headlines} 160
1650149196 Intel's $15 billion purchase of Mobileye shake... 2017-07-21 08:09:08.763839 {east jerusalem} 169

We can now also drop the obsolete topic column

In [7]:
data = data.drop('topic', axis=1)

Let's take a look to the DataFrame we have now

In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 332661 entries, 1000004597 to 999965884
Data columns (total 8 columns):
article                 328553 non-null object
excerpt                 332661 non-null object
image                   317307 non-null object
saves_count             332661 non-null int64
saves_count_datetime    332661 non-null datetime64[ns]
title                   332661 non-null object
url                     332661 non-null object
topics                  332661 non-null object
dtypes: datetime64[ns](1), int64(1), object(6)
memory usage: 32.8+ MB
In [9]:
from itertools import chain

ax = pd.Series(list(chain.from_iterable(data.topics))
         ).value_counts(ascending=False
                       ).head(20).plot.barh(figsize=(5,15))

ax.set_title('Most Saved Topics (Top 20)')
ax.invert_yaxis()

All of these topics have 100 items because, as I noticed Pocket Explore displays somewhat around 50-100 items per topic.

The article column is one we haven't touched yet but it is full of information about the parsed and cleaned article.

Let's take a look on one sample.

In [10]:
data.article[0]
Out[10]:
{'additional_data': {},
 'authors': ['Azam Ahmed', 'Paulina Villegas'],
 'images': ['https://static01.nyt.com/images/2015/08/03/world/03mexico-web/03mexico-web-facebookJumbo.jpg',
  'https://static01.nyt.com/images/2015/07/13/world/americas/how-el-chapo-got-out-1436802970864/how-el-chapo-got-out-1436802970864-thumbStandard-v2.png',
  'https://static01.nyt.com/images/2015/08/03/world/03mexico-web/03mexico-web-master1050.jpg',
  'https://static01.nyt.com/images/2015/07/18/world/18MEXICO5/18MEXICO5-thumbStandard-v2.jpg'],
 'keywords': ['death',
  'mexico',
  'killing',
  'huge',
  'veracruz',
  'rally',
  'harassment',
  'worked',
  'killed',
  'espinosa',
  'journalists',
  'city',
  'journalist'],
 'meta_description': 'Ruben Espinosa, who worked for the prominent magazine Proceso, and four other people were found bound and tortured in an apartment in Mexico City.',
 'meta_favicon': 'https://static01.nyt.com/favicon.ico',
 'meta_img': 'https://static01.nyt.com/images/2015/08/03/world/03mexico-web/03mexico-web-facebookJumbo.jpg',
 'meta_keywords': ['Murders  Attempted Murders and Homicides',
  'Espinosa  Ruben (d 2015)',
  'News and News Media',
  'Demonstrations  Protests and Riots',
  'Proceso (Magazine)',
  'Mexico'],
 'movies': [],
 'publish_date': 1438552800.0,
 'source_url': 'https://www.nytimes.com',
 'summary': 'PhotoMEXICO CITY — A crowd of several thousand people gathered Sunday in the capital to denounce the death of a Mexican photographer killed early Saturday morning, the seventh journalist killed in Mexico this year.\nThe bodies of Ruben Espinosa, who worked for the prominent magazine Proceso, and four other people were found bound and tortured in an apartment in the Narvarte neighborhood of Mexico City, according to news reports.\nMr. Espinosa often covered politics in Veracruz, a state in southeast Mexico known to be a hostile place for journalists, and he spoke out against the harassment of fellow journalists.\nOf the seven journalists killed in 2015, four worked in Veracruz.\nAdvertisement Continue reading the main storyTo those at the rally, the killing was a reminder of the violence journalists face here.',
 'tags': [],
 'text': 'Photo\n\nMEXICO CITY — A crowd of several thousand people gathered Sunday in the capital to denounce the death of a Mexican photographer killed early Saturday morning, the seventh journalist killed in Mexico this year.\n\nThe bodies of Ruben Espinosa, who worked for the prominent magazine Proceso, and four other people were found bound and tortured in an apartment in the Narvarte neighborhood of Mexico City, according to news reports.\n\nMr. Espinosa often covered politics in Veracruz, a state in southeast Mexico known to be a hostile place for journalists, and he spoke out against the harassment of fellow journalists.\n\nOf the seven journalists killed in 2015, four worked in Veracruz. Since 2010, 13 journalists have been killed there in the tenure of Gov. Javier Duarte, of the ruling Institutional Revolutionary Party, according to Article 19, a media rights group. In all, 41 journalists have been killed since 2010.\n\nAdvertisement Continue reading the main story\n\nTo those at the rally, the killing was a reminder of the violence journalists face here. “I can’t put responsibility for his death on the government directly, but we can hold this government responsible for the climate of harassment and impunity that prevails in Veracruz,” said Jenaro Villamil, an investigative journalist.',
 'title': 'Huge Mexico City Rally Over Killing of Journalist',
 'top_img': 'https://static01.nyt.com/images/2015/08/03/world/03mexico-web/03mexico-web-facebookJumbo.jpg'}

Each article is a complex dict object and for it to be useful we have to "flatten" it and make it's values first-class columns on our DataFrame. We do that by adding a article_ prefix for each dict key that becomes a column.

In [11]:
from pypocketexplore.parser import PocketArticleDownloader

for article_attribute in PocketArticleDownloader.ARTICLE_ATTRIBUTES_TO_KEEP:
    data.loc[~data.article.isnull(), 'article_{}'.format(article_attribute)] = data.loc[~data.article.isnull(), 'article'].map(lambda x: x.get(article_attribute))

Let's take a look at the columns we just created

In [12]:
data.loc[:, data.columns.str.startswith('article_')].iloc[0]
Out[12]:
article_title               Huge Mexico City Rally Over Killing of Journalist
article_text                Photo\n\nMEXICO CITY — A crowd of several thou...
article_top_img             https://static01.nyt.com/images/2015/08/03/wor...
article_meta_keywords       [Murders  Attempted Murders and Homicides, Esp...
article_summary             PhotoMEXICO CITY — A crowd of several thousand...
article_additional_data                                                    {}
article_source_url                                    https://www.nytimes.com
article_keywords            [death, mexico, killing, huge, veracruz, rally...
article_meta_img            https://static01.nyt.com/images/2015/08/03/wor...
article_publish_date                                              1.43855e+09
article_meta_favicon                     https://static01.nyt.com/favicon.ico
article_movies                                                             []
article_tags                                                               []
article_authors                                [Azam Ahmed, Paulina Villegas]
article_images              [https://static01.nyt.com/images/2015/08/03/wo...
article_meta_description    Ruben Espinosa, who worked for the prominent m...
Name: 1000004597, dtype: object

We can now drop the original article column

In [13]:
data = data.drop('article', axis=1)

Now, to make things more clear, let's add another column that explicitely identifies the publisher of each item (e.g. nytimes.com, theatlantic.com etc)

In [14]:
from urllib.parse import urlparse
data['website'] = data['url'].map(lambda x: urlparse(x).netloc).str.replace('www.', '')

Wondering which publishers has more appearances in our dataset? Here is the TOP-30. nytimes.com as expected (I expected at least) is in the 1st place. The interesting thing to note though is that youtube.com is in the 2nd place. Interesting in the sense that people save items they cannot really consume offline. Which reinforces in my mind the fact that people use Pocket to a great extent as a bookmarking service rather than "consume offline" service.

In [15]:
ax = data.website.value_counts().head(20).plot.barh(figsize=(10,12))
ax.invert_yaxis()
ax.set_title('Most Saved Publishers (Top 20)', fontsize=12, fontweight='bold')
Out[15]:
<matplotlib.text.Text at 0x7fac1da761d0>

I guess many people save Hacker News posts due to the quality of the comments we get. Let's see what are the most populare HN items. Notice that the HN front-page iself.

In [16]:
ax = data.loc[data.website == 'news.ycombinator.com', 
         ['title', 'saves_count']].sort_values('saves_count', 
                                               ascending=False).set_index('title').head(20).plot.barh(figsize=(10,15), title='Most Saved Hacker News Posts')
ax.invert_yaxis()
ax.set_title('Most Saved Hacker News Items (Top 20)', fontsize=12, fontweight='bold')
Out[16]:
<matplotlib.text.Text at 0x7fac1d9dd3c8>

Let's see what are the most frequent tags (based on Newspaper)

In [17]:
all_tags = list(chain.from_iterable(data.article_tags.dropna())) 
ax = pd.Series(all_tags).value_counts().head(20).plot.barh(figsize=(10,12))
ax.invert_yaxis()
ax.set_title('Most Popular Tags (Top 20)', fontsize=12, fontweight='bold')
Out[17]:
<matplotlib.text.Text at 0x7fac1c152e10>
In [18]:
all_keywords = list(chain.from_iterable(data.article_keywords.dropna())) 
ax = pd.Series(all_keywords).value_counts().head(20).plot.barh(figsize=(10,12))
ax.invert_yaxis()
ax.set_title('Most Popular Keywords (Top 20)', fontsize=12, fontweight='bold')
Out[18]:
<matplotlib.text.Text at 0x7fac141375c0>

Let's no create some features that can be useful in creating a predictive model. This should be enough to wet one's appetite in building more complex stuff. I think their semantics are pretty straightforward.

In [19]:
general_features = {
    'feature_excerpt_len_words': data.excerpt.str.split(' ').str.len(),
    'feature_excerpt_len_chars': data.excerpt.str.len(),
    'feature_has_image': ~data.image.isnull(),
    'feature_n_topics': data.topics.apply(len),
    'feature_has_top_image': ~data.article_top_img.isnull(),
    'feature_title_n_words': data.title.fillna(value='').str.split(' ').str.len(),
    'feature_text_n_words': data.article_text.fillna(value='').str.split(' ').str.len(),
    'feature_summary_n_words': data.article_summary.fillna(value='').str.split(' ').str.len(),
    'feature_article_n_keywords': data.article_keywords.fillna(value='').apply(len),
    'feature_article_n_videos': data.article_movies.fillna(value='').apply(len),
    'feature_article_n_tags': data.article_tags.fillna(value='').apply(len),
    'feature_article_n_images': data.article_images.fillna(value='').apply(len),
}

data = data.assign(**general_features)

Here are the features we created

In [20]:
data.loc[:, data.columns.str.startswith('feature_')].head()
Out[20]:
feature_article_n_images feature_article_n_keywords feature_article_n_tags feature_article_n_videos feature_excerpt_len_chars feature_excerpt_len_words feature_has_image feature_has_top_image feature_n_topics feature_summary_n_words feature_text_n_words feature_title_n_words
item_id
1000004597 4 13 0 0 210 35 True True 1 127 200 8
1000025547 0 0 0 0 222 45 False False 1 1 1 11
1000026486 87 14 0 0 115 22 True True 5 80 87 8
1000055792 3 13 7 0 236 35 True True 16 114 8753 2
1000056752 5 11 4 0 185 29 True True 4 135 3109 4

Let's check the distributions for some of those features

In [21]:
import seaborn as sns
ax = sns.distplot(data.feature_text_n_words)
ax.set_title('Distribution: Number of words per item')
plt.gcf().set_size_inches((8,7))
In [22]:
ax = sns.distplot(data.feature_article_n_images)
ax.set_title('Distribution: Number of images per item')
plt.gcf().set_size_inches((8,7))

Let's ignore the outliers (limit to feature_article_n_images < 100 )

In [23]:
ax = sns.distplot(data[data.feature_article_n_images < 100].feature_article_n_images)
ax.set_title('Distribution: Number of images per item `feature_article_n_images < 100`')
plt.gcf().set_size_inches((8,7))

Let's train a simple random forest with these features

In [24]:
X, y = data.loc[:, data.columns.str.startswith('feature_')], data.saves_count
In [25]:
X.shape, y.shape
Out[25]:
((332661, 12), (332661,))
In [26]:
from sklearn.ensemble import RandomForestRegressor
In [27]:
reg = RandomForestRegressor(n_estimators=100, n_jobs=-1)
In [28]:
reg.fit(X, y)
Out[28]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

What about feature importance ?

In [29]:
feature_importance = pd.Series(dict(zip(X.columns, reg.feature_importances_)))
ax = feature_importance.sort_values().plot.barh(figsize=(8,5))
ax.set_title('Feature Importance')
Out[29]:
<matplotlib.text.Text at 0x7fabfda73208>

So, the number of topics a particular web page is matched with makes a pretty good predictor of a pages's saves_count.

This looks interesting enough, but we'll pause here for now and continue a more in-depth analysis in the next post.

But I think everyone gets the picture so, feel free to get the data yourself and work on building more sophisticated models.

Fork me on GitHub

Tags #python #pandas #nltk #nlp #machine learning #pocket #scraping #projects