Run Your Own Scraping API with PhearJS

By  on  

So-called 'client-side dynamic rendering' gives clients cool experiences, but makes it harder for machines to comprehend. In case you want to do data mining, scrape websites or send static versions of your slick single-page application to Altavista, you essentially need a browser in the loop. This is especially important given the amount of sites that use React, Angular, jQuery or some other fancy Javascript framework.

PhearJS is an open-source software that exposes the power of the PhantomJS headless browser through an HTTP API. You make HTTP-requests to your PhearJS API to fetch a web page and get a nice JSON, containing the rendered HTML and relevant meta data.

In this tutorial we'll check out how you can have this.

Setting up

PhearJS at least runs on popular, recent Linux distros and Mac OS X. First we need some dependencies:

  • Memcached, do: brew install memcached. Replace brew with something like apt-get depending on your OS.
  • NodeJS, you probably have it, but if not, get it.
  • PhantomJS 2+, installation for version 2+ currently differs quite a bit between OS's, so it's best to follow their installation instructions.

Woo! Dependencies down, now get PhearJS:

git clone https://github.com/Tomtomgo/phearjs.git
cd phearjs
npm install

Boom, that's it! You can verify PhearJS is well by running it, you should see some info on the terminal:

node phear.js

If you open your browser and go to http://localhost:8100/status it should show you some stats on the server.

Making requests

Okay, so by now we have PhearJS running. Rendering a web page is simple, I'll use cUrl here, but you can also use your browser with a JSON viewer plugin:

# URL is URL-encoded, like you'd do with encodeURIComponent()
curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F"

In about five seconds you will see a response JSON with the rendered HTML and meta data, like request headers. Try it again and you will get it in an instant.

But wait, why does it take five seconds the first time? Well, these five seconds are a delay that we use on purpose. It allows PhearJS some time for fetching AJAX requests and rendering. Subsequent requests are served from cache and hence quick.

Now if you are on a slow connection or know that you will be scraping heavy pages you could increase this delay:

curl "http://localhost:8100/" \
      "?fetch_url=https%3A%2F%2Fdavidwalsh.name%2F" \
      "&parse_delay=10000" \ # milliseconds
      "&force=true" # force a cache refresh

This is the simplest usage of PhearJS. There are many more configuration and run-time options that are documented on Github.

Scraping

Let's look at a common use case for PhearJS: scraping. Say we want to get images from a blog page that are not visible without Javascript enabled, e.g. https://davidwalsh.name/.

Dependencies

We will use Cheerio and Request for parsing and making requests:

npm install cheerio requests

Writing scrape.js

Once that's done we can go through some simple steps to retrieve all images on this page:

// 1. load dependencies
var cheerio = require('cheerio'),
    request = require('request'),
    url = require('url');

var page_url = 'https://davidwalsh.name';
var results = [];

// 2. encode the URL and add to PhearJS endpoint
var target = 'http://localhost:8100?fetch_url=' + encodeURIComponent(page_url);

// 3. use request to GET the page
request.get(target, function(error, response, body) {

    // 4. load the DOM from the response JSON
    var $ = cheerio.load(JSON.parse(body).content);

    // 5. use cheerio's jQuery-style selectors to get all images
    $("img").each(function(i, image) {

        // 6. resolve absolute URL and add to our results array
        results.push(url.resolve(page_url, $(image).attr('src')));
    });

    // 7. and boom! there's our images
    console.log(results);
});

Run it!

Running this script will give you a list of all the images on the page:

# run PhearJS
node phear.js

# in another shell run the script
node scrape.js
[ <url>, ..., <url> ]

Next

This is a very trivial of scraping with PhearJS. It's up to you to apply it to different scenarios, like crawling or automating for batch scraping, whatever. I'd be interested to hear what you've used PhearJS for!

Conclusion

PhearJS is open-source software that allows you to run your own scraping or prerendering "microservice". It renders web pages and returns them as JSON over HTTP.

Here we focussed on how to set up PhearJS for a very simple scraping task. SEO is another important one, for which the phearjs-express middleware might be relevant.

Tom Aizenberg

About Tom Aizenberg

Tom Aizenberg is software developer from Amsterdam, Holland. He is the author of PhearJS. Besides programming he enjoys playing the bass, surfing and fishing.

Recent Features

  • By
    Write Better JavaScript with Promises

    You've probably heard the talk around the water cooler about how promises are the future. All of the cool kids are using them, but you don't see what makes them so special. Can't you just use a callback? What's the big deal? In this article, we'll...

  • By
    CSS Gradients

    With CSS border-radius, I showed you how CSS can bridge the gap between design and development by adding rounded corners to elements.  CSS gradients are another step in that direction.  Now that CSS gradients are supported in Internet Explorer 8+, Firefox, Safari, and Chrome...

Incredible Demos

  • By
    MooTools Overlay Plugin

    Overlays have become a big part of modern websites; we can probably attribute that to the numerous lightboxes that use them. I've found a ton of overlay code snippets out there but none of them satisfy my taste in code. Many of them are...

  • By
    Rotate Elements with CSS Transformations

    I've gone on a million rants about the lack of progress with CSS and how I'm happy that both JavaScript and browser-specific CSS have tried to push web design forward. One of those browser-specific CSS properties we love is CSS transformations. CSS transformations...

Discussion

  1. Thank dog someone has finally addressed the issue of how to send my slick SPA to Altavista.

  2. A new way to scrape, woot! :D

  3. Wow. This is really great :D It saves time re-inventing the wheel!

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!