Using Python to Create and Explore a subreddit overlap

techgroom.com 3 November 2023

0 0 2 minutes read

Using Python to Create and Explore a Reddit Map

Creating and analysing a map showing the 10,000 most well-liked Reddit subreddits is the aim of this notebook. We need a way to gauge how similar two subreddits are in order to accomplish this. Trevor Martin conducted a thorough analysis of subreddits in a fantastic piece on FiveThirtyEight by taking into account user overlaps between comments on two distinct subreddits. Their focus was on applying vector algebra to representative vectors to examine the effects of removing elements from r/politics, such as r/The_Donald. Our goal is to map and visualise the subreddit space and try to arrange subreddits into logical groups, but our fascination is a little more general. After everything is finished, we can investigate a few of the groups and discover intriguing tales to share.

Getting the pertinent information from subreddits overlap is the first phase in all of this. The BigQuery code, albeit slightly altered, is now available on github. The outcome is a file that can be found here that contains more than 15 million estimates of pairwise commenter overlapping across subreddits.

Getting ready

We must gather pertinent information and transform it into a format that can be used to create a map in order to create a subreddit map. We start by loading each and every necessary Python module.

creating a map out of subreddit vectors

Maybe we should start examining subreddits now that we’ve got comparable vectors. Not exactly. The 56187 dimensions of each vector make them a little difficult to work with. There is a lot extra redundancy incorporated into those 56187-three-dimensional vectors because in reality we don’t expect our data to be that high dimensional. We’re going to reduce to 500-dimensional vectors using a truncated simple value decomposition to make issues easier to work with. Although it seems complex, all it really means is that we need to identify a 500-dimensional vector matrix in order to (almost) rebuild the entire 56187-dimensional vector set. This is merely information compression—albeit a little lossy.

Grouping the Map

Even though we now have a map, it will be challenging to interpret because it contains 10,000 data points. We can examine a limited (about) number of groupings of communities that each pack onto the map by classifying the subreddits on the map to help with that effort.

I’ve decided to apply the HDBSCAN* clustering method for clustering. As a density-based subreddit overlap, HDBSCAN* sees clusters as dense regions divided from one another by less dense regions. Notably, it supports the concept of “noise” (data items that are outliers and not definitely in any cluster) and, unlike K-Means, can locate clusters of various forms.

Illustration

After mapping and clustering the subreddits, the subreddit overlap the data so we can understand the situation. A basic scatterplot will be really busy and have a lot of overplotting because we have 10,000 points.