Scrape the Wayback machine with this little script

Posted by Dave Bryson on August 17, 2007

Here’s a little script I use to scrape archived pages from the Alexa Wayback Machine . Basically, it works like this:

  1. Query Alexa for an old URL you’re looking for and the Years you’re interested in
  2. Use Hpricot to look in the results for links to archived pages. The pattern is http://web.archive.org/web/200301../url. Where the number is the timestamp and the url on the end is the old page you’re looking for. Return and array of successful matches
  3. Loop over the results of above and download the pages locally using curl (you could also use wget)
  4. Save the pages with the name “archive_timestamp.html”

Here’s the code:

require 'hpricot'
require 'open-uri'

urls = %w[http://sample.com http://sample2.com ...]
years = %w[2002 2003 2004]

# Search Alexa for the following URLS and Years
# extract the relevent links from the search result pages
def extract_links_from_search(search_urls=[],years=[])
  results = []
  search_urls.each do |u|
    years.each do |y|
      search_alexa = "http://web.archive.org/web/#{y}*/#{u}"
      doc = Hpricot(open(search_alexa))
      (doc/:a).each do |link|
        ul = link.attributes['href']
        # Search result pages have the following url, followed
        # by the timestamp (20030313094512)
        # followed by the search url
        if ul =~ /http://web.archive.org/web/d+/http:/
          results << ul
        end
      end
    end
  end
  results
end

def download_and_store_pages(results=[])
  results.each do |url|
    #Create a file name based on the Timestamp
    fn =  "archive_#{$&}.html" if url =~ /d+/
    puts "Saving as: #{fn}"
    `curl #{url} -o #{fn}`
  end
end

outp = extract_links_from_search(urls,years)
puts "Getting the data"
download_and_store_pages(outp)

This is quick and dirty and took about 10 minutes to write. It could probably be simplified, but it does the job for me.