In this article, we will learn how to quickly scrape Reddit posts using Puppeteer.
Web Scraping Basics — Scraping a Betting Site in 10 Minutes In this 10-minute tutorial, I’ll show you how to scrape websites with Python even if you don’t code at all!
Web scraping Reddit using Node JS and Puppeteer In this article, we will learn how to quickly scrape Reddit posts using Puppeteer. Puppeteer uses the Chromium browser behind the scenes to actually render HTML and Javascript and so is very useful if getting the. For the story and visualization, we decided to scrape Reddit to better understand the chatter surrounding drugs like modafinil, noopept and piracetam. In this Python tutorial, I will walk you through how to access Reddit API to download data for your own project. I've built a few web scrapers before, for fun and profit. Honestly, they're really easy to make, but impossible to maintain. The problem is that you can't see the back-end, so you end up making lots of assumptions about their architecture. I was scraping a web gallery and sometimes it would randomly fail. We will be using pandas, matplotlib, seaborn, and praw, the main reddit web scrapping API tool. Praw will most likely not be loaded on to your system so you will have to open your terminal.
Puppeteer uses the Chromium browser behind the scenes to actually render HTML and Javascript and so is very useful if getting the content that is loaded by javascript/AJAX functions.
For this, you will need to install Puppeteer inside a directory where you will write the scripts to scrape the data. For example, make a directory like this...
That will take a moment to install puppeteer and Chromium.
Once done, let's start with a script like this...
Even though it looks like a lot, It just loads up the puppeteer browser, creates a new page, and loads the URL we want and waits for the full of the HTML to be loaded.
The evaluate function now gets into the page's content and allows you to query it with puppeteer's query functions and CSS selectors.
The second line where the launch happens instructs puppeteer to load in the headless mode so you dont see the browser but it's there behind the scenes. The —user-agent string imitates a Chrome browser on a Mac so you dont get blocked.
Web Scraping For Profit Reddit
Save this file as get_reddit.js and if you run it, it should not return any errors.
Now let's see if we can scrape some data...
Open Chrome and navigate to the node subreddit
We are going to scrape all the posts. Let's open the inspect tool to see what we are up against.
You can see with some tinkering around that each post is encapsulated in a tag with a class name Post amongst a lot of other gibberish.
Since everything is inside this one class, we are going to use a forEach loop to get the data inside them and get all the individual pieces separately.
So the code will look like this...
You can see how the tag always has the title so we fetch that.
The query...
Gets us the first tag after the element with the id starting with upvote-button.
Something similar is happening here as well...Something similar is happening here as well...
If you run this it should print all the posts like so...
If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by Reddit. In this scenario using a rotating proxy service to rotate IPs is almost a must.
Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.
Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.
- With millions of high speed rotating proxies located all over the world,
- With our automatic IP rotation
- With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
- With our automatic CAPTCHA solving technology,
Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.
The whole thing can be accessed by a simple API like below in any programming language.
In fact, you dont even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so...
We have a running offer of 1000 API calls completely free. Register and get your free API Key here.
The goal is to extract or “scrape” information from the postson the front page of a subreddit e.g. http://reddit.com/r/learnpython/new/
You should know that Reddit has an apiand PRAW exists to make using it easier.
- You use it, taking the blue pill—the article ends.
- You take the red pill—you stay in Wonderland, and I show you how deep a
JSON
response goes.
Remember: all I’m offering is the truth. Nothing more.
Reddit allows you to add a .json
extension to the end of your request and will give you back a JSON
response instead of HTML
.
We’ll be using requests
as our “HTTP client” which you can install using pip install requests --user
if you have not already.
We’re setting the User-Agent
header to Mozilla/5.0
as the default requests
value is blocked.
Web Scraping Reddit
r.json()
We know that we’re receiving a JSON
response from this request so we use the .json()method on a Response object which turns a JSON
“string” intoa Python structure (also see json.loads())
To see a pretty-printed version of the JSON
data we can use json.dumps()
with its indent
argument.
The output generated for this particular response is quite largeso it makes sense to write the output to a file for further inspection.
Note if you’re using Python 2
you’ll need from __future__ import print_function
to have access to the print()
function that has the file
argument (or you could just usejson.dump()
).
Upon further inspection we can see that r.json()['data']['children']
is a list of dicts and each dict represents a submission or “post”.
There is also some “subreddit” information available.
These before
and after
values are used for result page navigationjust like when you click on the next
and prev
buttons.
To get to the next page we can pass after=t3_64o6gh
as a GET param.
When making multiple requests however, you will usually want to use a session object.
So as mentioned each submission is a dict and the important information is available inside the data
key:
I’ve truncated the output here but important values include author
,selftext
, title
and url
It’s pretty annoying having to use ['data']
all the time so we could have instead declared posts
using a list comprehension.
One example of why you may want to do this perhaps is to “scrape” the linksfrom one of the “image posting” subreddits to access the images.
r/aww
One such subreddit isr/aww home of “teh cuddlez”.
Some of these URLs would require further processing though as not all of them are direct links to images and not all of them are images.
In the case of the direct image links we could fetch them andsave the result to disk.
BeautifulSoup
Scraping Reddit Using Python
You could of course just request the regular URL, processing the HTML
with BeautifulSoup
and html5lib
which you can install usingpip install beautifulsoup4 html5lib --user
if you do not already have them.
BeautifulSoup’s select()
method locates items using CSS Selectorsand div.thing
here matches <div>
tags that contain thing
as a class namee.g. class='thing'
We can then use dict indexing on a BeautifulSoup
Tag object to extract the value of a specific tag attribute.
In this case the URL is contained in the data-url='...'
attribute of the <div>
tag.
Web Scraping Reddit Python
As already mentioned Reddit does have an API withrules / guidelines and if you’re wanting to do any type of “large-scale” interaction with Reddit you should probably use it via the PRAW library.