Run Your Own Scraping API with PhearJS

By Tom Aizenberg on December 14, 2015

So-called 'client-side dynamic rendering' gives clients cool experiences, but makes it harder for machines to comprehend. In case you want to do data mining, scrape websites or send static versions of your slick single-page application to Altavista, you essentially need a browser in the loop. This is especially important given the amount of sites that use React, Angular, jQuery or some other fancy Javascript framework.

PhearJS is an open-source software that exposes the power of the PhantomJS headless browser through an HTTP API. You make HTTP-requests to your PhearJS API to fetch a web page and get a nice JSON, containing the rendered HTML and relevant meta data.

In this tutorial we'll check out how you can have this.

Setting up

PhearJS at least runs on popular, recent Linux distros and Mac OS X. First we need some dependencies:

Memcached, do: brew install memcached. Replace brew with something like apt-get depending on your OS.
NodeJS, you probably have it, but if not, get it.
PhantomJS 2+, installation for version 2+ currently differs quite a bit between OS's, so it's best to follow their installation instructions.

Woo! Dependencies down, now get PhearJS:

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install

Boom, that's it! You can verify PhearJS is well by running it, you should see some info on the terminal:

node phear.js

If you open your browser and go to http://localhost:8100/status it should show you some stats on the server.

Making requests

Okay, so by now we have PhearJS running. Rendering a web page is simple, I'll use cUrl here, but you can also use your browser with a JSON viewer plugin:

# URL is URL-encoded, like you'd do with encodeURIComponent()
curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F"

In about five seconds you will see a response JSON with the rendered HTML and meta data, like request headers. Try it again and you will get it in an instant.

But wait, why does it take five seconds the first time? Well, these five seconds are a delay that we use on purpose. It allows PhearJS some time for fetching AJAX requests and rendering. Subsequent requests are served from cache and hence quick.

Now if you are on a slow connection or know that you will be scraping heavy pages you could increase this delay:

curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F" \
      "&parse_delay=10000" \ # milliseconds
      "&force=true" # force a cache refresh

This is the simplest usage of PhearJS. There are many more configuration and run-time options that are documented on Github.

Scraping

Let's look at a common use case for PhearJS: scraping. Say we want to get images from a blog page that are not visible without Javascript enabled, e.g. https://davidwalsh.name/.

Dependencies

We will use Cheerio and Request for parsing and making requests:

npm install cheerio requests

Writing scrape.js

Once that's done we can go through some simple steps to retrieve all images on this page:

// 1. load dependencies
var cheerio = require('cheerio'),
    request = require('request'),
    url = require('url');

var page_url = 'https://davidwalsh.name';
var results = [];

// 2. encode the URL and add to PhearJS endpoint
var target = 'http://localhost:8100?fetch_url=' + encodeURIComponent(page_url);

// 3. use request to GET the page
request.get(target, function(error, response, body) {

    // 4. load the DOM from the response JSON
    var $ = cheerio.load(JSON.parse(body).content);

    // 5. use cheerio's jQuery-style selectors to get all images
    $("img").each(function(i, image) {

        // 6. resolve absolute URL and add to our results array
        results.push(url.resolve(page_url, $(image).attr('src')));
    });

    // 7. and boom! there's our images
    console.log(results);
});

Run it!

Running this script will give you a list of all the images on the page:

# run PhearJS
node phear.js

# in another shell run the script
node scrape.js
[ <url>, ..., <url> ]

This is a very trivial of scraping with PhearJS. It's up to you to apply it to different scenarios, like crawling or automating for batch scraping, whatever. I'd be interested to hear what you've used PhearJS for!

Conclusion

PhearJS is open-source software that allows you to run your own scraping or prerendering "microservice". It renders web pages and returns them as JSON over HTTP.

Here we focussed on how to set up PhearJS for a very simple scraping task. SEO is another important one, for which the phearjs-express middleware might be relevant.

About Tom Aizenberg

Tom Aizenberg is software developer from Amsterdam, Holland. He is the author of PhearJS. Besides programming he enjoys playing the bass, surfing and fishing.

github.com Posts

Run Your Own Scraping API with PhearJS

Setting up

Making requests

Scraping

Dependencies

Writing scrape.js

Run it!

Next

Conclusion

About Tom Aizenberg

Recent Features

Page Visibility API

Regular Expressions for the Rest of Us

Incredible Demos

JavaScript Canvas Image Conversion

SmoothScroll Using MooTools 1.2

Discussion