O'Reilly

Scraping iTunes Charts Using Scrapy Python

By on  

Hacking is more fun when you have some data to play with, but from where do you get data when you are just hacking for fun? I use web scraping to make my hacks interesting and cool and have learned a lot in the process. In this post, I will tell you about how to get started with web scraping using Scrapy.

Everyone is talking about Apple this week so let’s begin by scraping iTunes. We will scrape iTunes Charts and get the list of the top free apps (along with their category, iTunes link, and image url).

I am assuming that you are familiar with basics of Python. As it is required to get most out of Scrapy. If not, I recommend you take a look at this list of the Python learning resources.

What is Scrapy?

Scrapy is a high level screen scraping and web crawling framework. It used for data mining and web crawling. It is written in pure Python.

So, let’s start with setting up Scrapy on your machine. I m assuming that you have Python installed (required 2.7+), as of now Scrapy is not compatible with Python 3. If you do not have Python Installed, you can download it here. And setup `pip` for installing Scrapy:

Scrapy can be installed with:

$ pip install Scrapy

Or you can use easy_install:

$ easy_install Scrapy

Creating a project:

You can create a scrapy project using:

$ scrapy startproject apple

Since, I m writing a scraper for Apple iTunes, I created a project `apple`. This will create a `apple` directory, along with the following contents

apple/

	scrapy.cfg # the project configuration file

	apple/ # project module

		__init__.py

		items.py # items file

		pipelines.py # pipelines file

		settings.py # settings file

	spiders/ # all your spiders will be stored in this file

		__init__.py

Well Scrapy did create a lot of files for us, there but you don’t have to be worried looking at them.

The only files we are concerned with is the ‘items.py’ and the spiders. The ‘spiders’ directory will store all the spiders for our project. In this case we will be creating ‘apple_spider.py’, that will have the logic for extracting items from iTunes pages.

Define Items to be stored:

The items act like simple Python dicts but they prevent against populating undeclared fields, for preventing typos.

Items will act as storage for the scraped data. This is where you define the attributes to Item by extending the scrapy.item.Item.

Here is how the items.py will look like our project, we will be storing the `app_name`, the `category` of the app, the `appstore_link` and the `img_src` for the icon of the app, for each app:

(items.py)

from scrapy.item import Item, Field

class AppItem(Item):

	# define the fields for your item here like:
	app_name = Field()
	category = Field()
	appstore_link = Field()
	img_src = Field()

Writing the Spider:

Now we will add the logic to extract data from the webpages in our Spiders. Spiders are the classes that are used to extract data from webpages. It is created by extending `scrapy.spider.BaseSpider`

It is where you provide the initial list of `urls` to start scraping from, and how to extract data(Items) from webpages.

While creating a spider you need to define 3 required attributes.

name: identification of the spider (unique) (string)
start_urls: it is the list of urls, where the crawling starts from (list)
parse(): this is the method that gets the response object when the url is downloaded.

The parse() method is where we add the logic to extract the data(items) from webpages and follow more URLs if specified. Here we’ve used XPath to select the elements..

Scrapy provides this awesome Command line tool, that you can use to play with the `Response` Object, using XPath Selectors. So you don’t have to create spiders to test your Xpath expressions.

This is the Spider for extracing the list of the apps from iTunes charts.

(apple_spider.py in the spiders directory)

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from apple.items import AppItem

class AppleSpider(BaseSpider):
    name = "apple"
    allowed_domains = ["apple.com"]
    start_urls = ["http://www.apple.com/itunes/charts/free-apps/"]


    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        apps = hxs.select('//*[@id="content"]/section/ul/li')
        count = 0
        items = []

	for app in apps:

		item = AppItem()
		item['app_name'] = app.select('//h3/a/text()')[count].extract()
		item['appstore_link'] = app.select('//h3/a/@href')[count].extract()
		item['category'] = app.select('//h4/a/text()')[count].extract()
		item['img_src'] = app.select('//a/img/@src')[count].extract()

		items.append(item)
		count += 1

	return items

The command to start crawling:

$ scrapy crawl apple -o apps.json -t json

This will start the crawling, and the extracted items will be stored in apps.json with JSON as the feed exporter.

The Scrapy docs are available here.

Hope you will find this useful and do some cool things with Scrapy. Be careful about how you use the data you scrape from other websites though, you might be violating their privacy policy. 

The code for this post is available on GitHub.

Virendra Rajput

About Virendra Rajput

Virendra Rajput is a self-taught Python hacker. He is a Co-Founder of Markitty, a marketing recommendations and reminders tool for small businesses. He is obsessed with web technologies and the world of startups! His current love is Django.

Track.js Error Reporting

Upcoming Events

Recent Features

  • 9 Mind-Blowing Canvas Demos

    The <canvas> element has been a revelation for the visual experts among our ranks.  Canvas provides the means for incredible and efficient animations with the added bonus of no Flash; these developers can flash their awesome JavaScript skills instead.  Here are nine unbelievable canvas demos that...

  • Facebook Open Graph META Tags

    It's no secret that Facebook has become a major traffic driver for all types of websites.  Nowadays even large corporations steer consumers toward their Facebook pages instead of the corporate websites directly.  And of course there are Facebook "Like" and "Recommend" widgets on every website.  One...

Incredible Demos

  • MooTools onLoad SmoothScrolling

    SmoothScroll is a fantastic MooTools plugin but smooth scrolling only occurs when the anchor is on the same page. Making SmoothScroll work across pages is as easy as a few extra line of MooTools and a querystring variable. The MooTools / PHP Of course, this is a...

  • MooTools dwCheckboxes Plugin

    Update / Fix: The checkboxes will no longer toggle when the "mouseup" event doesn't occur on a checkbox. Every morning I wake up to a bunch of emails in my Gmail inbox that I delete without reading. I end up clicking so many damn checkboxes...

Discussion

  1. Chetan Dhembre

    very good post !! keep it up !!

  2. Hey Thanks Chetan! Glad you liked it :)

  3. Why in the world would you scrape for this data when iTunes provides an API for all the data you are looking for and more?

    Say you have a link to an app already:
    https://itunes.apple.com/us/app/vine/id592447445?mt=8

    Take the id from the URL and append it to the lookup API:
    http://itunes.apple.com/lookup?id=592447445

    If you want the App Store Charts – try using the RSS feeds that iTunes provides:
    https://rss.itunes.apple.com

    Top Paid Applications in the US (300 limit) in both XML or JSON
    https://itunes.apple.com/us/rss/toppaidapplications/limit=300/xml
    https://itunes.apple.com/us/rss/toppaidapplications/limit=300/json

    • The post is more about scraping in general — iTunes is just the example.

    • Tia

      Is the lookup API considered scraping, in any form? Can this be blocked or barred? Thanks!

  4. shaun

    Hey awesome tutorial mate!

    gonna play with it for a while

  5. Hi virendra . Great post ! well after doing this all the data will be stored in the json file . Now what i would like to do is display this contents of json file on a web page … like the logo and its download link .. how can i do that ? So what should happen is – When a user clicks on the link he will be redirected to the itunes page for that app to download it ..

  6. oluwaseun

    The apps variable in your code is in a local context, hence the way your wrote your code would raise an error. I tried to clean the code up.

    Here is my revised snippet below.

    class AppleSpider(BaseSpider):
        name = "apple"
        allowed_domains = ["apple.com"]
        start_urls = ["http://www.apple.com/itunes/charts/free-apps/"]
    
    
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            apps = hxs.select('//*[@id="content"]/section/ul/li')
            return extractData(apps)
    
        
    def extractData(apps):
        items = []
        count = 0
        for app in apps:
            item = AppItem()
            item['app_name'] = app.select('//h3/a/text()')[count].extract()
            item['appstore_link'] = app.select('//h3/a/@href')[count].extract()
            item['category'] = app.select('//h4/a/text()')[count].extract()
            item['img_src'] = app.select('//a/img/@src')[count].extract()
    
            items.append(item)
            count += 1
        return items
    
  7. Pete

    For anyone coming across this now simply change “content” in

    apps = hxs.select('//*[@id="content"]/section/ul/li')
    

    to “main” and it works.

    Thanks for the tutorial!

  8. will

    Hi, How I can use cookies in scrapy?

  9. GeneRickyShaw

    Doesn’t work for me; my json file only has “[” in it; if I choose to make a .csv file, it’s completely blank.

  10. GeneRickyShaw

    My json file only contains one character and if I use csv files they’re totally blank – what am I doing wrong?

    I also got a lot of deprecated warnings.

  11. Hi Virendra. I added it to my website’s list of great Python- and Scrapy-based website crawler tutorials. Thank you for the wonderful resource!

Wrap your code in <pre class="{language}"></pre> tags, link to a GitHub gist, JSFiddle fiddle, or CodePen pen to embed!

Recently on David Walsh Blog

  • Get Node.js Command Line Arguments with yargs

    Using command line arguments within Node.js apps is par for the course, especially when you're like me and you use JavaScript to code tasks (instead of bash scripts).  Node.js provides process.argv but that doesn't provide a key: value object like you'd expect: Bleh.  If you want to work with a...

  • OâReilly Velocity Conference â New York

    My favorite front-end conference has always been O'Reilly's Velocity Conference because the conference series has focused on one of the most undervalued parts of client side coding:  speed.  So often we're so excited that our JavaScript works that we forget that speed, efficiency, and performance are just as important. The next Velocity...

  • Free Download: Font Bundle Featuring 17 Incredible Typefaces

    The only thing we love more than a good font, is a good free font. So we’ve combed the Web for some of our favorite free fonts, and gathered them here in a single download. You’ll find a variety of useful typefaces, from highly geometric designs...

  • OâReilly Velocity Conference â Amsterdam

    My favorite front-end conference has always been O'Reilly's Velocity Conference because the conference series has focused on one of the most undervalued parts of client side coding:  speed.  So often we're so excited that our JavaScript works that we forget that speed, efficiency, and performance are just as important. The next Velocity...

  • CanIUse Command Line

    Every front-end developer should be well acquainted with CanIUse, the website that lets you view browser support for browser features.  When people criticize my blog posts for not detailing browser support for features within the post, I tell them to check CanIUse:  always up to date, unlike...