Scrape the Wayback machine with this little script
Here’s a little script I use to scrape archived pages from the Alexa Wayback Machine . Basically, it works like this:
- Query Alexa for an old URL you’re looking for and the Years you’re interested in
- Use Hpricot to look in the results for links to archived pages. The pattern is http://web.archive.org/web/200301../url. Where the number is the timestamp and the url on the end is the old page you’re looking for. Return and array of successful matches
- Loop over the results of above and download the pages locally using curl (you could also use wget)
- Save the pages with the name “archive_timestamp.html”
Here’s the code:
require 'hpricot'
require 'open-uri'
urls = %w[http://sample.com http://sample2.com ...]
years = %w[2002 2003 2004]
# Search Alexa for the following URLS and Years
# extract the relevent links from the search result pages
def extract_links_from_search(search_urls=[],years=[])
results = []
search_urls.each do |u|
years.each do |y|
search_alexa = "http://web.archive.org/web/#{y}*/#{u}"
doc = Hpricot(open(search_alexa))
(doc/:a).each do |link|
ul = link.attributes['href']
# Search result pages have the following url, followed
# by the timestamp (20030313094512)
# followed by the search url
if ul =~ /http://web.archive.org/web/d+/http:/
results << ul
end
end
end
end
results
end
def download_and_store_pages(results=[])
results.each do |url|
#Create a file name based on the Timestamp
fn = "archive_#{$&}.html" if url =~ /d+/
puts "Saving as: #{fn}"
`curl #{url} -o #{fn}`
end
end
outp = extract_links_from_search(urls,years)
puts "Getting the data"
download_and_store_pages(outp)
This is quick and dirty and took about 10 minutes to write. It could probably be simplified, but it does the job for me.