Web Scraping For Research

Web Scraping Online
Web Scraping For Research Methodology

Web-scraping tools help to ensure the regular receipt of quantifiable and qualitative information, helping you analyze the customer needs and interests and tailor your products accordingly. Additionally, you are also able to keep up with the pace of the market by overviewing the latest trends. A web crawler is a bot that goes out exploring for you to search for keywords and find relevant sources of data to uncover for your research data needs. It has the enormous job of chasing down every single use of your keyword used on the internet, ever. That is one far ask for a human to complete!

Web scraping as ranked as the #1 source for capturing alternative data. Finance companies are trying to adapt to the demands of gathering these 'big data' alternative data sets but still having to resort to legacy web scraping methods which leaves them with unreliable and a slow data gathering processes. Web scraping, also known as web extraction or harvesting, is a technique to extract data from the World W ide Web (WWW) and save it to a ﬁ le system or database for later retrieval or analysis. Hosting websites costs money, and scraping takes up bandwidth. If you are familiar with Denial-of-service-attacks, scraping or sending bots to a website is similar. Write responsible programs that limit bandwidth use. Wait a few seconds between requests, and try to scrape during off-peak hours. Finally, scrape only what you need. 3) Respect the law.

The Research Computing team recognizes the ever-growing need for researchers to be able to harvest data from the web and is constantly on the look out for the best tools for your scraping needs. We currently partner with Mozenda to provide web scraping services for Wharton researchers. In addition, we also have some suggested frameworks for custom building scrapers when Mozenda just won’t do.

Web Scraping Online

What is Mozenda?

Mozenda (http://www.mozenda.com) is a hosted, WYSIWYG suite of tools that allow you to create and run scraping agents from their cloud. The data is stored on their servers and is downloadable in a variety of standard formats.

How does it work?

Using the Agent Builder (Windows 7/8/10 only!), you can quickly and easily “teach” an agent to perform certain actions on any website and then test and launch the agent to carry out those actions automatically. These actions can be scheduled to run over time and on a set interval. Once the agent is finished, you can have it send you a notification. You may then download the data in CSV, TSV or XML formats. Check out this overview video:

Web Scraping For Research Methodology

How do I get access?

E-mail us at research-computing@wharton.upenn.edu for access, and provide the following details:

What data will you be scraping? Please include URLs to all sources.
How many pages will you be scraping (a rough estimate is fine)?
How long will the project last?
What budget code should be charged for the pages scraped?
Cost is $1 per 1000 pages scraped, in $1 increments, chargeable via journal to any Wharton budget code.

What if I have problems getting started?

Please contact Mozenda support. They are extremely helpful and have successfully walked a number of Wharton researchers through various scraping scenarios. They also have a wealth of demos and walkthroughs in their Help Center (http://www.mozenda.com/help/).

Custom Scraping Options

While Mozenda is a powerful, easy to use tool, there are times when your scraping needs are more complex and might require custom programming. In this case, we say good luck and godspeed! Seriously, while we can’t provide the programming for you, we can give you a list of suggested tools that you might want to try once you have found the right person to work with. Below is a brief list of open-source tools and frameworks that you might want to try:

Scrapy – open source scraping framework for Python.
scrape.py – Python module for scraping content from webpages.
ScraperWiki – Techniques and tips for scraping.
BeautifulSoup – Python library for quickly building out web scraping projects.