Scraping iTunes Charts Using Scrapy Python
Hacking is more fun when you have some data to play with, but from where do you get data when you are just hacking for fun? I use web scraping to make my hacks interesting and cool and have learned a lot in the process. In this post, I will tell you about how to get started with web scraping using Scrapy.
Everyone is talking about Apple this week so let's begin by scraping iTunes. We will scrape iTunes Charts and get the list of the top free apps (along with their category, iTunes link, and image url).
What is Scrapy?
Scrapy is a high level screen scraping and web crawling framework. It used for data mining and web crawling. It is written in pure Python.
So, let's start with setting up Scrapy on your machine. I m assuming that you have Python installed (required 2.7+), as of now Scrapy is not compatible with Python 3. If you do not have Python Installed, you can download it here. And setup `pip` for installing Scrapy:
Scrapy can be installed with:
$ pip install Scrapy
Or you can use easy_install:
$ easy_install Scrapy
Creating a project:
You can create a scrapy project using:
$ scrapy startproject apple
Since, I m writing a scraper for Apple iTunes, I created a project `apple`. This will create a `apple` directory, along with the following contents
apple/ scrapy.cfg # the project configuration file apple/ # project module __init__.py items.py # items file pipelines.py # pipelines file settings.py # settings file spiders/ # all your spiders will be stored in this file __init__.py
Well Scrapy did create a lot of files for us, there but you don't have to be worried looking at them.
The only files we are concerned with is the ‘items.py' and the spiders. The ‘spiders' directory will store all the spiders for our project. In this case we will be creating ‘apple_spider.py', that will have the logic for extracting items from iTunes pages.
Define Items to be stored:
The items act like simple Python dicts but they prevent against populating undeclared fields, for preventing typos.
Items will act as storage for the scraped data. This is where you define the attributes to Item by extending the scrapy.item.Item.
Here is how the items.py will look like our project, we will be storing the `app_name`, the `category` of the app, the `appstore_link` and the `img_src` for the icon of the app, for each app:
(items.py) from scrapy.item import Item, Field class AppItem(Item): # define the fields for your item here like: app_name = Field() category = Field() appstore_link = Field() img_src = Field()
Writing the Spider:
Now we will add the logic to extract data from the webpages in our Spiders. Spiders are the classes that are used to extract data from webpages. It is created by extending `scrapy.spider.BaseSpider`
It is where you provide the initial list of `urls` to start scraping from, and how to extract data(Items) from webpages.
While creating a spider you need to define 3 required attributes.
name: identification of the spider (unique) (string) start_urls: it is the list of urls, where the crawling starts from (list) parse(): this is the method that gets the response object when the url is downloaded.
The parse() method is where we add the logic to extract the data(items) from webpages and follow more URLs if specified. Here we've used XPath to select the elements..
Scrapy provides this awesome Command line tool, that you can use to play with the `Response` Object, using XPath Selectors. So you don't have to create spiders to test your Xpath expressions.
This is the Spider for extracing the list of the apps from iTunes charts.
(apple_spider.py in the spiders directory) from scrapy.spider import BaseSpider from scrapy.selector import HtmlXPathSelector from apple.items import AppItem class AppleSpider(BaseSpider): name = "apple" allowed_domains = ["apple.com"] start_urls = ["http://www.apple.com/itunes/charts/free-apps/"] def parse(self, response): hxs = HtmlXPathSelector(response) apps = hxs.select('//*[@id="content"]/section/ul/li') count = 0 items =  for app in apps: item = AppItem() item['app_name'] = app.select('//h3/a/text()')[count].extract() item['appstore_link'] = app.select('//h3/a/@href')[count].extract() item['category'] = app.select('//h4/a/text()')[count].extract() item['img_src'] = app.select('//a/img/@src')[count].extract() items.append(item) count += 1 return items
The command to start crawling:
$ scrapy crawl apple -o apps.json -t json
This will start the crawling, and the extracted items will be stored in apps.json with JSON as the feed exporter.
The Scrapy docs are available here.
The code for this post is available on GitHub.