Scrape the Wayback machine with this little script
Here’s a little script I use to scrape archived pages from the Alexa Wayback Machine . Basically, it works like this:
- Query Alexa for an old URL you’re looking for and the Years you’re interested in
- Use Hpricot to look in the results for links to archived pages. The pattern is http://web.archive.org/web/200301../url. Where the number is the timestamp and the url on the end is the old page you’re looking for. Return and array of successful matches
- Loop over the results of above and download the pages locally using curl (you could also use wget)
- Save the pages with the name “archive_timestamp.html”
Here’s the code:
require 'hpricot'
require 'open-uri'
urls = %w[http://sample.com http://sample2.com ...]
years = %w[2002 2003 2004]
# Search Alexa for the following URLS and Years
# extract the relevent links from the search result pages
def extract_links_from_search(search_urls=[],years=[])
results = []
search_urls.each do |u|
years.each do |y|
search_alexa = "http://web.archive.org/web/#{y}*/#{u}"
doc = Hpricot(open(search_alexa))
(doc/:a).each do |link|
ul = link.attributes['href']
# Search result pages have the following url, followed
# by the timestamp (20030313094512)
# followed by the search url
if ul =~ /http://web.archive.org/web/d+/http:/
results << ul
end
end
end
end
results
end
def download_and_store_pages(results=[])
results.each do |url|
#Create a file name based on the Timestamp
fn = "archive_#{$&}.html" if url =~ /d+/
puts "Saving as: #{fn}"
`curl #{url} -o #{fn}`
end
end
outp = extract_links_from_search(urls,years)
puts "Getting the data"
download_and_store_pages(outp)
This is quick and dirty and took about 10 minutes to write. It could probably be simplified, but it does the job for me.
Trackbacks
Unfortunately, due to spammers I've had to close both trackbacks and comments.