Run Your Own Scraping API with PhearJS

By  on  

So-called 'client-side dynamic rendering' gives clients cool experiences, but makes it harder for machines to comprehend. In case you want to do data mining, scrape websites or send static versions of your slick single-page application to Altavista, you essentially need a browser in the loop. This is especially important given the amount of sites that use React, Angular, jQuery or some other fancy Javascript framework.

PhearJS is an open-source software that exposes the power of the PhantomJS headless browser through an HTTP API. You make HTTP-requests to your PhearJS API to fetch a web page and get a nice JSON, containing the rendered HTML and relevant meta data.

In this tutorial we'll check out how you can have this.

Setting up

PhearJS at least runs on popular, recent Linux distros and Mac OS X. First we need some dependencies:

  • Memcached, do: brew install memcached. Replace brew with something like apt-get depending on your OS.
  • NodeJS, you probably have it, but if not, get it.
  • PhantomJS 2+, installation for version 2+ currently differs quite a bit between OS's, so it's best to follow their installation instructions.

Woo! Dependencies down, now get PhearJS:

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install

Boom, that's it! You can verify PhearJS is well by running it, you should see some info on the terminal:

node phear.js

If you open your browser and go to http://localhost:8100/status it should show you some stats on the server.

Making requests

Okay, so by now we have PhearJS running. Rendering a web page is simple, I'll use cUrl here, but you can also use your browser with a JSON viewer plugin:

# URL is URL-encoded, like you'd do with encodeURIComponent()
curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F"

In about five seconds you will see a response JSON with the rendered HTML and meta data, like request headers. Try it again and you will get it in an instant.

But wait, why does it take five seconds the first time? Well, these five seconds are a delay that we use on purpose. It allows PhearJS some time for fetching AJAX requests and rendering. Subsequent requests are served from cache and hence quick.

Now if you are on a slow connection or know that you will be scraping heavy pages you could increase this delay:

curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F" \
      "&parse_delay=10000" \ # milliseconds
      "&force=true" # force a cache refresh

This is the simplest usage of PhearJS. There are many more configuration and run-time options that are documented on Github.

Scraping

Let's look at a common use case for PhearJS: scraping. Say we want to get images from a blog page that are not visible without Javascript enabled, e.g. https://davidwalsh.name/.

Dependencies

We will use Cheerio and Request for parsing and making requests:

npm install cheerio requests

Writing scrape.js

Once that's done we can go through some simple steps to retrieve all images on this page:

// 1. load dependencies
var cheerio = require('cheerio'),
    request = require('request'),
    url = require('url');

var page_url = 'https://davidwalsh.name';
var results = [];

// 2. encode the URL and add to PhearJS endpoint
var target = 'http://localhost:8100?fetch_url=' + encodeURIComponent(page_url);

// 3. use request to GET the page
request.get(target, function(error, response, body) {

    // 4. load the DOM from the response JSON
    var $ = cheerio.load(JSON.parse(body).content);

    // 5. use cheerio's jQuery-style selectors to get all images
    $("img").each(function(i, image) {

        // 6. resolve absolute URL and add to our results array
        results.push(url.resolve(page_url, $(image).attr('src')));
    });

    // 7. and boom! there's our images
    console.log(results);
});

Run it!

Running this script will give you a list of all the images on the page:

# run PhearJS
node phear.js

# in another shell run the script
node scrape.js
[ <url>, ..., <url> ]

Next

This is a very trivial of scraping with PhearJS. It's up to you to apply it to different scenarios, like crawling or automating for batch scraping, whatever. I'd be interested to hear what you've used PhearJS for!

Conclusion

PhearJS is open-source software that allows you to run your own scraping or prerendering "microservice". It renders web pages and returns them as JSON over HTTP.

Here we focussed on how to set up PhearJS for a very simple scraping task. SEO is another important one, for which the phearjs-express middleware might be relevant.

Tom Aizenberg

About Tom Aizenberg

Tom Aizenberg is software developer from Amsterdam, Holland. He is the author of PhearJS. Besides programming he enjoys playing the bass, surfing and fishing.

Recent Features

  • By
    Camera and Video Control with HTML5

    Client-side APIs on mobile and desktop devices are quickly providing the same APIs.  Of course our mobile devices got access to some of these APIs first, but those APIs are slowly making their way to the desktop.  One of those APIs is the getUserMedia API...

  • By
    5 HTML5 APIs You Didn&#8217;t Know Existed

    When you say or read "HTML5", you half expect exotic dancers and unicorns to walk into the room to the tune of "I'm Sexy and I Know It."  Can you blame us though?  We watched the fundamental APIs stagnate for so long that a basic feature...

Incredible Demos

Discussion

  1. Thank dog someone has finally addressed the issue of how to send my slick SPA to Altavista.

  2. A new way to scrape, woot! :D

  3. Wow. This is really great :D It saves time re-inventing the wheel!

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!