Web Scraping For Profit Reddit



In this article, we will learn how to quickly scrape Reddit posts using Puppeteer.

Scraping

Web Scraping Basics — Scraping a Betting Site in 10 Minutes In this 10-minute tutorial, I’ll show you how to scrape websites with Python even if you don’t code at all!

Web scraping Reddit using Node JS and Puppeteer In this article, we will learn how to quickly scrape Reddit posts using Puppeteer. Puppeteer uses the Chromium browser behind the scenes to actually render HTML and Javascript and so is very useful if getting the. For the story and visualization, we decided to scrape Reddit to better understand the chatter surrounding drugs like modafinil, noopept and piracetam. In this Python tutorial, I will walk you through how to access Reddit API to download data for your own project. I've built a few web scrapers before, for fun and profit. Honestly, they're really easy to make, but impossible to maintain. The problem is that you can't see the back-end, so you end up making lots of assumptions about their architecture. I was scraping a web gallery and sometimes it would randomly fail. We will be using pandas, matplotlib, seaborn, and praw, the main reddit web scrapping API tool. Praw will most likely not be loaded on to your system so you will have to open your terminal.

Puppeteer uses the Chromium browser behind the scenes to actually render HTML and Javascript and so is very useful if getting the content that is loaded by javascript/AJAX functions.

For this, you will need to install Puppeteer inside a directory where you will write the scripts to scrape the data. For example, make a directory like this...

Web

That will take a moment to install puppeteer and Chromium.

Once done, let's start with a script like this...

Even though it looks like a lot, It just loads up the puppeteer browser, creates a new page, and loads the URL we want and waits for the full of the HTML to be loaded.

The evaluate function now gets into the page's content and allows you to query it with puppeteer's query functions and CSS selectors.

The second line where the launch happens instructs puppeteer to load in the headless mode so you dont see the browser but it's there behind the scenes. The —user-agent string imitates a Chrome browser on a Mac so you dont get blocked.

Web Scraping For Profit Reddit

Save this file as get_reddit.js and if you run it, it should not return any errors.

Now let's see if we can scrape some data...

Open Chrome and navigate to the node subreddit

We are going to scrape all the posts. Let's open the inspect tool to see what we are up against.

You can see with some tinkering around that each post is encapsulated in a tag with a class name Post amongst a lot of other gibberish.

Since everything is inside this one class, we are going to use a forEach loop to get the data inside them and get all the individual pieces separately.

Web scraping for profit reddit download

So the code will look like this...

You can see how the tag always has the title so we fetch that.

The query...

Gets us the first tag after the element with the id starting with upvote-button.

Something similar is happening here as well...

Something similar is happening here as well...

If you run this it should print all the posts like so...

If you want to use this in production and want to scale to thousands of links then you will find that you will get IP blocked easily by Reddit. In this scenario using a rotating proxy service to rotate IPs is almost a must.

Otherwise, you tend to get IP blocked a lot by automatic location, usage, and bot detection algorithms.

Our rotating proxy server Proxies API provides a simple API that can solve all IP Blocking problems instantly.

  • With millions of high speed rotating proxies located all over the world,
  • With our automatic IP rotation
  • With our automatic User-Agent-String rotation (which simulates requests from different, valid web browsers and web browser versions)
  • With our automatic CAPTCHA solving technology,

Hundreds of our customers have successfully solved the headache of IP blocks with a simple API.

The whole thing can be accessed by a simple API like below in any programming language.

In fact, you dont even have to take the pain of loading Puppeteer as we render Javascript behind the scenes and you can just get the data and parse it any language like Node, Puppeteer or PHP or using any framework like Scrapy or Nutch. In all these cases you can just call the URL with render support like so...


We have a running offer of 1000 API calls completely free. Register and get your free API Key here.

The goal is to extract or “scrape” information from the postson the front page of a subreddit e.g. http://reddit.com/r/learnpython/new/

You should know that Reddit has an apiand PRAW exists to make using it easier.

  • You use it, taking the blue pill—the article ends.
  • You take the red pill—you stay in Wonderland, and I show you how deep a JSON response goes.

Remember: all I’m offering is the truth. Nothing more.

Reddit allows you to add a .json extension to the end of your request and will give you back a JSON response instead of HTML.

We’ll be using requests as our “HTTP client” which you can install using pip install requests --user if you have not already.

We’re setting the User-Agent header to Mozilla/5.0 as the default requests value is blocked.

Web Scraping Reddit

r.json()

We know that we’re receiving a JSON response from this request so we use the .json()method on a Response object which turns a JSON“string” intoa Python structure (also see json.loads())

To see a pretty-printed version of the JSON data we can use json.dumps() with its indent argument.

The output generated for this particular response is quite largeso it makes sense to write the output to a file for further inspection.

Note if you’re using Python 2 you’ll need from __future__ import print_function to have access to the print()function that has the file argument (or you could just usejson.dump()).

Upon further inspection we can see that r.json()['data']['children'] is a list of dicts and each dict represents a submission or “post”.

There is also some “subreddit” information available.

These before and after values are used for result page navigationjust like when you click on the next and prev buttons.

To get to the next page we can pass after=t3_64o6gh as a GET param.

When making multiple requests however, you will usually want to use a session object.

So as mentioned each submission is a dict and the important information is available inside the data key:

I’ve truncated the output here but important values include author,selftext, title and url

It’s pretty annoying having to use ['data'] all the time so we could have instead declared posts using a list comprehension.

One example of why you may want to do this perhaps is to “scrape” the linksfrom one of the “image posting” subreddits to access the images.

r/aww

One such subreddit isr/aww home of “teh cuddlez”.

Some of these URLs would require further processing though as not all of them are direct links to images and not all of them are images.

In the case of the direct image links we could fetch them andsave the result to disk.

BeautifulSoup

Scraping Reddit Using Python

You could of course just request the regular URL, processing the HTMLwith BeautifulSoup and html5lib which you can install usingpip install beautifulsoup4 html5lib --user if you do not already have them.

BeautifulSoup’s select() method locates items using CSS Selectorsand div.thing here matches <div> tags that contain thing as a class namee.g. class='thing'

We can then use dict indexing on a BeautifulSoupTag object to extract the value of a specific tag attribute.

Web Scraping For Profit RedditWeb scraping for profit reddit 2020

In this case the URL is contained in the data-url='...' attribute of the <div> tag.

Web Scraping Reddit Python

As already mentioned Reddit does have an API withrules / guidelines and if you’re wanting to do any type of “large-scale” interaction with Reddit you should probably use it via the PRAW library.