Scrape the Wayback machine with this little script

August 17, 2007

Here’s a little script I use to scrape archived pages from the Alexa Wayback Machine . Basically, it works like this:

  1. Query Alexa for an old URL you’re looking for and the Years you’re interested in
  2. Use Hpricot to look in the results for links to archived pages. The pattern is http://web.archive.org/web/200301../url. Where the number is the timestamp and the url on the end is the old page you’re looking for. Return and array of successful matches
  3. Loop over the results of above and download the pages locally using curl (you could also use wget)
  4. Save the pages with the name “archive_timestamp.html”

Here’s the code:

require 'hpricot'
require 'open-uri'

urls = %w[http://sample.com http://sample2.com ...]
years = %w[2002 2003 2004]

# Search Alexa for the following URLS and Years
# extract the relevent links from the search result pages
def extract_links_from_search(search_urls=[],years=[])
  results = []
  search_urls.each do |u|
    years.each do |y|
      search_alexa = "http://web.archive.org/web/#{y}*/#{u}"
      doc = Hpricot(open(search_alexa))
      (doc/:a).each do |link|
        ul = link.attributes['href']
        # Search result pages have the following url, followed
        # by the timestamp (20030313094512)
        # followed by the search url
        if ul =~ /http://web.archive.org/web/d+/http:/
          results << ul
        end
      end
    end
  end
  results
end

def download_and_store_pages(results=[])
  results.each do |url|
    #Create a file name based on the Timestamp
    fn =  "archive_#{$&}.html" if url =~ /d+/
    puts "Saving as: #{fn}"
    `curl #{url} -o #{fn}`
  end
end

outp = extract_links_from_search(urls,years)
puts "Getting the data"
download_and_store_pages(outp)

This is quick and dirty and took about 10 minutes to write. It could probably be simplified, but it does the job for me.

Modify the XML output from your Model

August 16, 2007

So your app needs to generate XML. No problem, ActiveRecord gives you it for free. Simply call mymodel.to_xml and your done. But what happens if you need to generate more complicated…specialized XML? There are a few options:

  1. Don’t call to_xml, generate the XML using a template (.rxml)
  2. Override the to_xml method. As mentioned in the docs
  3. Create a separate method for generating the XML

To keep things simple, and for reasons we’ll see later, let’s use 3.

The example

Ok. I have a model, Car, with 3 attributes (year,make,model). Here’s what the default XML looks like:

car = Car.find(:first)
car.to_xml
=>

<car>
 <make>Nissan</make>
 <model>Pickup</make>
 <year>1995</year>
</car>

So let’s customize the XML to add a namespace for the elements and change the tag type. In the Car model we’ll create a new method called my_xml instead of overriding to_xml:

def my_xml(options={})
  options[:indent] ||= 2
  xml = options[:builder] ||= Builder::XmlMarkup.new(:indent => options[:indent])
  xml.instruct! unless options[:skip_instruct]
  xml.mycar(:Vehicle, "xmlns:mycar" => "http://crazystuff.org/car/ns") do
    xml.mycar(:make, self.make)
    xml.mycar(:model, self.model)
    xml.mycar(:year, self.year)
  end
end

The Model uses the Builder library for creating the XML. That was easy. Now when I call car.my_xml I get this:

<?xml version="1.0" encoding="UTF-8"?>
<mycar:Vehicle xmlns:mycar="http://crazystuff.org/car/ns">
  <mycar:make>Nissan</mycar:make>
  <mycar:model>Pickup</mycar:model>
  <mycar:year>1995</mycar:year>
</mycar:Vehicle>

Perfect! Now let’s try and query all Cars and see what we get:

all_cars = Car.find(:all)
all_cars.to_xml
=>
NoMethodError: undefined method 'my_xml_' for #<Array:0x1379810>

What the *$%@! That’s not right. Calling Car.find(:all) returns an Array. Array doesn’t have a method my_xml.

But how does Rails do it? If “all_cars” is an Array, then Array within Rails must support the to_xml method. As it turns out Rails adds some tricks to some of the core pieces of the Ruby language. Of interest to us right now is the module ActiveSupport::CoreExtensions::Array::Conversions. It defines a to_xml method that is a mixin for the Array class.

We could open up the Module and change it. Or we could just create our own method and include it into Array. Let’s do something like that:

module MyConversion
  def my_xml
    options[:builder]  ||= Builder::XmlMarkup.new(:indent => options[:indent])
    # TODO: Move the xmlns from Vehicle to here...
    options[:builder].tag!("mycar:AllVehicles") do
      # Here's we loop over each entry (model) and call it's my_xml
      each { |e| e.my_xml(options.merge!({ :skip_instruct => true })) }
    end
  end
end

# Don't forget to do this!
class Array
  include MyConversion
end

Ok. That’s it. Now when we call my_xml regardless of whether it’s a Array or a single object it works as expected.

Have a look around in the ActiveSupport Core Ext. There’s a lot to learn there.

Clean all .svn or cvs diretories from your source

August 16, 2007

You need or want to import your code into a new svn or cvs repository. But, the source code is filled with CVS/.svn folders from an old repository. Here’s a quick way to do it (Unix):

find . -type d -name CVS | xargs rm -rf

Ruby OCI8 Library Unsupported Datatypes

April 17, 2007

If you find yourself in the nasty position of having to use Oracle with Ruby, watch out for a problem related to unsupported datatypes. Specifically these types:

  • SQLT_TIMESTAMP
  • BINARY_DOUBLE
  • BINARY_FLOAT

However, I found a little quick fix that seems to work. Add this to your code:


require 'oci8'

# handle the timestamp mapping
OCI8::BindType::Mapping[OCI8::SQLT_TIMESTAMP] =
OCI8::BindType::OraDate

# handle the binary_float
OCI8::BindType::Mapping[100] = OCI8::BindType::Float

# handle the binary_double
OCI8::BindType::Mapping[101] = OCI8::BindType::Float

The good news is it looks like the fix will be in the next release of OCI8.

Mongrel based Gem Server

March 15, 2007

As you know, the default Gem server included with RubyGems runs with WEBrick. I wanted something a little quicker and more reliable. So, here it is: mongrel_gem_server. Nothing complicated. Just the original Gem server adapted to use Mongrel.

What is defined?

March 01, 2007

if defined? “what does it mean?!”

Plowing through various Rails and Ruby source code, I keep running across the method defined?. However, a quick look though the API turned up nothing on it. Where’s it coming from? I can’t find it in Module, Kernel, or Object…time to explore! A quick Google search shows RedHanded solved this mystery way back in 2004.

As it turns out defined? lives in eval.c. The logic responsible for defined? actually has the identity is_defined starting around line 2249. As Redhanded explains: “defined? takes its argument and simply queries the symbol table to see if it is defined. If it is, you get a simple string of identification”:

$ x = 10
=> 10
$ defined? x
=> "local-variable"
$ def hello; puts "Hello" end
$ defined? hello
=> "method"
$ class Test; end
$ defined? Test
=>"constant"

Examples of use

Since I haven’t had the opportunity to use it yet, Here’s a couple examples I’ve found:

In Rails initializer.rb uses it to test for the existence of a constant:


   def initialize_logger
     return if defined?(RAILS_DEFAULT_LOGGER)

The method real_connect in mysql.rb uses it to check UNIXSocket:


 if (host == nil or host == "localhost")
    and defined? UNIXSocket then ...

Cliff notes version

  1. defined? is classified as an operator not a method
  2. defined? returns nil if the argument passed to it is NOT defined
  3. defined? returns a string identification of the argument passed as defined in the symbol table

Now, I wonder where and how the method is_defined is translated into the operator defined?

Serialize “stuff” in your tables with ActiveRecord

February 28, 2007

An Example

So, you’re building a Web application that allows users to customize a car before they purchase it.

Here’s what the current car table looks like ( in Migration speak ):


  create_table "cars" do |t|
    t.column "make",  :string
    t.column "model", :string
    t.column "new_smell", :boolean
  end
  

The tasking…

You boss comes in and says “Hey! I got a great idea! Let’s give users the ability to manipulate/change the color of the car…” Immediately your brain snaps into design mode. Hmmm…each car may have many color choices…do I need to add another table just for colors? You ask your boss, “Do we care what colors people are choosing? For example, will we ever want to ask the question how many people chose a Red car”? “No”, your Boss replies, “keep it simple and have it done yesterday!”

ActiveRecord to the rescue

Ok, I like keeping it simple. I’m just going to stuff it all into the car table. Fortunately the Rails Gods have dealt with this before. So like many other things in Rails, it’s easy to do. I want to store the colors in a simple Array.
Let’s modify the table.


  create_table "cars" do |t|
    t.column "make",  :string
    t.column "model", :string
    t.column "new_smell", :boolean
    t.column "colors", :text
  end

Notice I added the colors field with a type of :text. That’s because Rails will serialize my Ruby Array (or most any other object I specify) into YAML. And because it’s in the YAML format, it should be accessible from any other language that can read YAML. Ok, almost done. We just need to add a line to our model:


  class Car < ActiveRecord::Base
    serialize :colors, Array
  end
 

serialize is a class method that tells ActiveRecord to move the colors field back and forth between YAML and an Array. So in our example above, 1.) :colors tells it to use the colors field and 2.) Array (optional) tells it to check what I’m trying to store. So in our example if I try to pass a Hash to the colors field an exception is thrown.

Now I can use the Array on my model as normal:


 Car.create( :make => "Yugo", :model => "Hatchback",
                 :colors => ['red','yellow','magenta'] )
 c = Car.find_by_make("Yugo")

 puts "Available colors:"
 c.colors.each{ |color|
	puts "#{color} Yugo"
 }

And that’s it. Just two lines of code to add to your app! Check out the docs for more details.

RAILS_DEFAULT_LOGGER

February 27, 2007

Just a simple problem

It all started when I wanted to add the Rails default logger to some code I was working on. The code sits in the lib directory and I wanted to use the same logger the controllers and models use. Without thinking I threw a call to logger.info(”whatever”) in my code. Obviously that’s not going to work as I quickly got an error. Fortunately a quick google search turned up this. Basically to use the logger in my code I simply needed to call the constant RAILS_DEFAULT_LOGGER. Ok, great. Problem solved. Time to move on. But something caught my curiosity. How is Rails setting this constant and making it available to my Class ?

The plot thickens

So I decide to dig in. Luckily I start by looking at the code in initializer.rb. Rails::Initializer is called when your app boots up. It does most of the heavy lifting configuring the environment for Rails. Right around line 258, I find the answer to the constant mystery:


  Object.const_set "RAILS_DEFAULT_LOGGER", logger

Great. Ok now I can move on…but what the heck is


  Object.const_set

An aha moment

As it turns out the method const_set allows you to set a constant on the given object. In our case we’re setting the constant on Object - the parent of all classes in Ruby. So let’s see how this works…fire-up IRB and follow along:

Let’s start by seeing what constants Object already has. Ruby makes that easy:


  > Object.constants

As you can see (in your IRB terminal) Object comes with a large list of constants already. Ok let’s add our own


  > Object.const_set("WHOS_THE_MAN", "DAVE_IS")
  > Object.constants

Wow, you just added a constant to the supreme Ruby Object. Next, let’s see how we can get to this in our own classes:


  class Test
    def whos_the_man?
      puts WHOS_THE_MAN
    end
  end
  t = Test.new
  t.whos_the_man => DAVE_IS

So there you have it. A clever way to add constants that are available throughout your Ruby app. Of course you’re not limited to just adding to Object. Your Class can have it’s own constants that are scoped to it and its subclasses. You gotta love this language!

Who’d of thunk a ThingLink?

May 06, 2006

Thinglink.org provides a nice place for Makers to register and publicize their products and/or designs. When you register an item on the site, a unique ThingLink code is created (like a SKU) . This code is registered in the ThingLink open database and is intended to “stick” to the item for life. This is a neat idea, and appears they’ve put a bit of thought into the format of the code.

A ThingLink is a 6 character code that is intended to uniquely identify an object for life. The code follows a particular format: The first 3 characters are numbers (0-9) and the next 3 characters are letters (a-z), case doesn’t matter. So you may end up with a code like 345JrD. In fact, you can generate your own code here. The format - numbers then letters - seem to make it a bit easier to remember. And for some reason, I don’t know why, but the URL for a ThinkLink, gives it a bit of cool factor: http://thinglink.org/thing:123Abc

What’s wrong with IM?

May 04, 2006

As the old saying goes - a picture’s worth a thousand words:

IM Problem

No the problem is not with Meebo. The problem is with four different networks and four different protocols. I have family that uses MSN, Friends and co-workers on AOL, and I personally prefer Jabber (GTalk). Fortunately, applications like GAIM and Meebo make it possible to bridge the gap, but I still need an account for each. Why does it have to be so complicated?

Now, imagine if the Web was built this way? Hey, my website is on the AOL network, you have to create an account to access it and download their browser. Where’s your site and what browser does it require?

Unfortunately, this doesn’t appear to be causing much of a problem as millions of people are chatting each day. But then high gas prices have yet to stop people from driving SUV tanks either. Sooner or later, something is gonna give.

Personally, I think you’ll see more web-based chat systems like campfire. Why? Mainly because we’re seeing advances on the web that prevented usable web-based chat in the past - mainly AJAX for a snappy interface and Comet ( event driven, server push ) to push chat messages to the browser. Add to that some of the benefits the web-based approach may have over normal IM and you have some compelling reasons to move away from proprietary networks.

Time will tell.