Run Your Own Scraping API with PhearJS

By  on  

So-called 'client-side dynamic rendering' gives clients cool experiences, but makes it harder for machines to comprehend. In case you want to do data mining, scrape websites or send static versions of your slick single-page application to Altavista, you essentially need a browser in the loop. This is especially important given the amount of sites that use React, Angular, jQuery or some other fancy Javascript framework.

PhearJS is an open-source software that exposes the power of the PhantomJS headless browser through an HTTP API. You make HTTP-requests to your PhearJS API to fetch a web page and get a nice JSON, containing the rendered HTML and relevant meta data.

In this tutorial we'll check out how you can have this.

Setting up

PhearJS at least runs on popular, recent Linux distros and Mac OS X. First we need some dependencies:

  • Memcached, do: brew install memcached. Replace brew with something like apt-get depending on your OS.
  • NodeJS, you probably have it, but if not, get it.
  • PhantomJS 2+, installation for version 2+ currently differs quite a bit between OS's, so it's best to follow their installation instructions.

Woo! Dependencies down, now get PhearJS:

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install

Boom, that's it! You can verify PhearJS is well by running it, you should see some info on the terminal:

node phear.js

If you open your browser and go to http://localhost:8100/status it should show you some stats on the server.

Making requests

Okay, so by now we have PhearJS running. Rendering a web page is simple, I'll use cUrl here, but you can also use your browser with a JSON viewer plugin:

# URL is URL-encoded, like you'd do with encodeURIComponent()
curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F"

In about five seconds you will see a response JSON with the rendered HTML and meta data, like request headers. Try it again and you will get it in an instant.

But wait, why does it take five seconds the first time? Well, these five seconds are a delay that we use on purpose. It allows PhearJS some time for fetching AJAX requests and rendering. Subsequent requests are served from cache and hence quick.

Now if you are on a slow connection or know that you will be scraping heavy pages you could increase this delay:

curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F" \
      "&parse_delay=10000" \ # milliseconds
      "&force=true" # force a cache refresh

This is the simplest usage of PhearJS. There are many more configuration and run-time options that are documented on Github.

Scraping

Let's look at a common use case for PhearJS: scraping. Say we want to get images from a blog page that are not visible without Javascript enabled, e.g. https://davidwalsh.name/.

Dependencies

We will use Cheerio and Request for parsing and making requests:

npm install cheerio requests

Writing scrape.js

Once that's done we can go through some simple steps to retrieve all images on this page:

// 1. load dependencies
var cheerio = require('cheerio'),
    request = require('request'),
    url = require('url');

var page_url = 'https://davidwalsh.name';
var results = [];

// 2. encode the URL and add to PhearJS endpoint
var target = 'http://localhost:8100?fetch_url=' + encodeURIComponent(page_url);

// 3. use request to GET the page
request.get(target, function(error, response, body) {

    // 4. load the DOM from the response JSON
    var $ = cheerio.load(JSON.parse(body).content);

    // 5. use cheerio's jQuery-style selectors to get all images
    $("img").each(function(i, image) {

        // 6. resolve absolute URL and add to our results array
        results.push(url.resolve(page_url, $(image).attr('src')));
    });

    // 7. and boom! there's our images
    console.log(results);
});

Run it!

Running this script will give you a list of all the images on the page:

# run PhearJS
node phear.js

# in another shell run the script
node scrape.js
[ <url>, ..., <url> ]

Next

This is a very trivial of scraping with PhearJS. It's up to you to apply it to different scenarios, like crawling or automating for batch scraping, whatever. I'd be interested to hear what you've used PhearJS for!

Conclusion

PhearJS is open-source software that allows you to run your own scraping or prerendering "microservice". It renders web pages and returns them as JSON over HTTP.

Here we focussed on how to set up PhearJS for a very simple scraping task. SEO is another important one, for which the phearjs-express middleware might be relevant.

Tom Aizenberg

About Tom Aizenberg

Tom Aizenberg is software developer from Amsterdam, Holland. He is the author of PhearJS. Besides programming he enjoys playing the bass, surfing and fishing.

Recent Features

  • By
    CSS Filters

    CSS filter support recently landed within WebKit nightlies. CSS filters provide a method for modifying the rendering of a basic DOM element, image, or video. CSS filters allow for blurring, warping, and modifying the color intensity of elements. Let's have...

  • By
    6 Things You Didn&#8217;t Know About Firefox OS

    Firefox OS is all over the tech news and for good reason:  Mozilla's finally given web developers the platform that they need to create apps the way they've been creating them for years -- with CSS, HTML, and JavaScript.  Firefox OS has been rapidly improving...

Incredible Demos

  • By
    Create Twitter-Style Dropdowns Using MooTools

    Twitter does some great stuff with JavaScript. What I really appreciate about what they do is that there aren't any epic JS functionalities -- they're all simple touches. One of those simple touches is the "Login" dropdown on their homepage. I've taken...

  • By
    Facebook Sliders With Mootools and CSS

    One of the great parts of being a developer that uses Facebook is that I can get some great ideas for progressive website enhancement. Facebook incorporates many advanced JavaScript and AJAX features: photo loads by left and right arrow, dropdown menus, modal windows, and...

Discussion

  1. Thank dog someone has finally addressed the issue of how to send my slick SPA to Altavista.

  2. A new way to scrape, woot! :D

  3. Wow. This is really great :D It saves time re-inventing the wheel!

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!