Scrape Images with wget
The desire to download all images or video on the page has been around since the beginning of the internet. Twenty years ago I would accomplish this task with a python script I downloaded. I then moved on to browser extensions for this task, then started using a PhearJS Node.js JavaScript utility to scrape images. All of these solutions are nice but I wanted to know how I could accomplish this task from command line.
To scrape images (or any specific file extensions) from command line, you can use wget
:
wget -nd -H -p -A jpg,jpeg,png,gif -e robots=off http://boards.4chan.org/sp/
The script above downloads images across hosts (i.e. from a CDN or other subdomain) to the directory from which the command is run from. You'll see downloaded media as they come down:
Reusing existing connection to s.4cdn.org:80. HTTP request sent, awaiting response... 200 OK Length: 1505 (1.5K) [image/jpeg] Saving to: '1490571194319s.jpg' 1490571194319s.jpg 100%[=====================>] 1.47K --.-KB/s in 0s 2017-03-26 18:33:26 (205 MB/s) - '1490571194319s.jpg' saved [1505/1505] FINISHED --2017-03-26 18:33:26-- Total wall clock time: 2.7s Downloaded: 66 files, 412K in 0.2s (2.10 MB/s)
Everyone loves cURL, which is another awesome resource, but don't foget about wget
, which is arguably easier to use!