Run Your Own Scraping API with PhearJS

By  on  

So-called 'client-side dynamic rendering' gives clients cool experiences, but makes it harder for machines to comprehend. In case you want to do data mining, scrape websites or send static versions of your slick single-page application to Altavista, you essentially need a browser in the loop. This is especially important given the amount of sites that use React, Angular, jQuery or some other fancy Javascript framework.

PhearJS is an open-source software that exposes the power of the PhantomJS headless browser through an HTTP API. You make HTTP-requests to your PhearJS API to fetch a web page and get a nice JSON, containing the rendered HTML and relevant meta data.

In this tutorial we'll check out how you can have this.

Setting up

PhearJS at least runs on popular, recent Linux distros and Mac OS X. First we need some dependencies:

  • Memcached, do: brew install memcached. Replace brew with something like apt-get depending on your OS.
  • NodeJS, you probably have it, but if not, get it.
  • PhantomJS 2+, installation for version 2+ currently differs quite a bit between OS's, so it's best to follow their installation instructions.

Woo! Dependencies down, now get PhearJS:

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install

Boom, that's it! You can verify PhearJS is well by running it, you should see some info on the terminal:

node phear.js

If you open your browser and go to http://localhost:8100/status it should show you some stats on the server.

Making requests

Okay, so by now we have PhearJS running. Rendering a web page is simple, I'll use cUrl here, but you can also use your browser with a JSON viewer plugin:

# URL is URL-encoded, like you'd do with encodeURIComponent()
curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F"

In about five seconds you will see a response JSON with the rendered HTML and meta data, like request headers. Try it again and you will get it in an instant.

But wait, why does it take five seconds the first time? Well, these five seconds are a delay that we use on purpose. It allows PhearJS some time for fetching AJAX requests and rendering. Subsequent requests are served from cache and hence quick.

Now if you are on a slow connection or know that you will be scraping heavy pages you could increase this delay:

curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F" \
      "&parse_delay=10000" \ # milliseconds
      "&force=true" # force a cache refresh

This is the simplest usage of PhearJS. There are many more configuration and run-time options that are documented on Github.

Scraping

Let's look at a common use case for PhearJS: scraping. Say we want to get images from a blog page that are not visible without Javascript enabled, e.g. https://davidwalsh.name/.

Dependencies

We will use Cheerio and Request for parsing and making requests:

npm install cheerio requests

Writing scrape.js

Once that's done we can go through some simple steps to retrieve all images on this page:

// 1. load dependencies
var cheerio = require('cheerio'),
    request = require('request'),
    url = require('url');

var page_url = 'https://davidwalsh.name';
var results = [];

// 2. encode the URL and add to PhearJS endpoint
var target = 'http://localhost:8100?fetch_url=' + encodeURIComponent(page_url);

// 3. use request to GET the page
request.get(target, function(error, response, body) {

    // 4. load the DOM from the response JSON
    var $ = cheerio.load(JSON.parse(body).content);

    // 5. use cheerio's jQuery-style selectors to get all images
    $("img").each(function(i, image) {

        // 6. resolve absolute URL and add to our results array
        results.push(url.resolve(page_url, $(image).attr('src')));
    });

    // 7. and boom! there's our images
    console.log(results);
});

Run it!

Running this script will give you a list of all the images on the page:

# run PhearJS
node phear.js

# in another shell run the script
node scrape.js
[ <url>, ..., <url> ]

Next

This is a very trivial of scraping with PhearJS. It's up to you to apply it to different scenarios, like crawling or automating for batch scraping, whatever. I'd be interested to hear what you've used PhearJS for!

Conclusion

PhearJS is open-source software that allows you to run your own scraping or prerendering "microservice". It renders web pages and returns them as JSON over HTTP.

Here we focussed on how to set up PhearJS for a very simple scraping task. SEO is another important one, for which the phearjs-express middleware might be relevant.

Tom Aizenberg

About Tom Aizenberg

Tom Aizenberg is software developer from Amsterdam, Holland. He is the author of PhearJS. Besides programming he enjoys playing the bass, surfing and fishing.

Recent Features

  • By
    Convert XML to JSON with JavaScript

    If you follow me on Twitter, you know that I've been working on a super top secret mobile application using Appcelerator Titanium.  The experience has been great:  using JavaScript to create easy to write, easy to test, native mobile apps has been fun.  My...

  • By
    Facebook Open Graph META Tags

    It's no secret that Facebook has become a major traffic driver for all types of websites.  Nowadays even large corporations steer consumers toward their Facebook pages instead of the corporate websites directly.  And of course there are Facebook "Like" and "Recommend" widgets on every website.  One...

Incredible Demos

  • By
    Background Animations Using MooTools

    One of the sweet effects made easy by JavaScript frameworks like MooTools and jQuery is animation. I ran across this great jQuery tutorial that walks you through animating a background image of a page. Here's a quick MooTools code snippet that...

  • By
    Fancy Navigation with MooTools JavaScript

    Navigation menus are traditionally boring, right? Most of the time the navigation menu consists of some imagery with a corresponding mouseover image. Where's the originality? I've created a fancy navigation menu that highlights navigation items and creates a chain effect. The XHTML Just some simple...

Discussion

  1. Thank dog someone has finally addressed the issue of how to send my slick SPA to Altavista.

  2. A new way to scrape, woot! :D

  3. Wow. This is really great :D It saves time re-inventing the wheel!

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!