Scrape the Wayback machine with this little script
Here’s a little script I use to scrape archived pages from the Alexa Wayback Machine . Basically, it works like this:
- Query Alexa for an old URL you’re looking for and the Years you’re interested in
- Use Hpricot to look in the results for links to archived pages. The pattern is http://web.archive.org/web/200301../url. Where the number is the timestamp and the url on the end is the old page you’re looking for. Return and array of successful matches
- Loop over the results of above and download the pages locally using curl (you could also use wget)
- Save the pages with the name “archive_timestamp.html”
Here’s the code:
require 'hpricot'
require 'open-uri'
urls = %w[http://sample.com http://sample2.com ...]
years = %w[2002 2003 2004]
# Search Alexa for the following URLS and Years
# extract the relevent links from the search result pages
def extract_links_from_search(search_urls=[],years=[])
results = []
search_urls.each do |u|
years.each do |y|
search_alexa = "http://web.archive.org/web/#{y}*/#{u}"
doc = Hpricot(open(search_alexa))
(doc/:a).each do |link|
ul = link.attributes['href']
# Search result pages have the following url, followed
# by the timestamp (20030313094512)
# followed by the search url
if ul =~ /http://web.archive.org/web/d+/http:/
results << ul
end
end
end
end
results
end
def download_and_store_pages(results=[])
results.each do |url|
#Create a file name based on the Timestamp
fn = "archive_#{$&}.html" if url =~ /d+/
puts "Saving as: #{fn}"
`curl #{url} -o #{fn}`
end
end
outp = extract_links_from_search(urls,years)
puts "Getting the data"
download_and_store_pages(outp)
This is quick and dirty and took about 10 minutes to write. It could probably be simplified, but it does the job for me.
Modify the XML output from your Model
So your app needs to generate XML. No problem, ActiveRecord gives you it for free. Simply call mymodel.to_xml and your done. But what happens if you need to generate more complicated…specialized XML? There are a few options:
- Don’t call to_xml, generate the XML using a template (.rxml)
- Override the to_xml method. As mentioned in the docs
- Create a separate method for generating the XML
To keep things simple, and for reasons we’ll see later, let’s use 3.
The example
Ok. I have a model, Car, with 3 attributes (year,make,model). Here’s what the default XML looks like:
car = Car.find(:first) car.to_xml => <car> <make>Nissan</make> <model>Pickup</make> <year>1995</year> </car>
So let’s customize the XML to add a namespace for the elements and change the tag type. In the Car model we’ll create a new method called my_xml instead of overriding to_xml:
def my_xml(options={})
options[:indent] ||= 2
xml = options[:builder] ||= Builder::XmlMarkup.new(:indent => options[:indent])
xml.instruct! unless options[:skip_instruct]
xml.mycar(:Vehicle, "xmlns:mycar" => "http://crazystuff.org/car/ns") do
xml.mycar(:make, self.make)
xml.mycar(:model, self.model)
xml.mycar(:year, self.year)
end
end
The Model uses the Builder library for creating the XML. That was easy. Now when I call car.my_xml I get this:
<?xml version="1.0" encoding="UTF-8"?> <mycar:Vehicle xmlns:mycar="http://crazystuff.org/car/ns"> <mycar:make>Nissan</mycar:make> <mycar:model>Pickup</mycar:model> <mycar:year>1995</mycar:year> </mycar:Vehicle>
Perfect! Now let’s try and query all Cars and see what we get:
all_cars = Car.find(:all) all_cars.to_xml => NoMethodError: undefined method 'my_xml_' for #<Array:0x1379810>
What the *$%@! That’s not right. Calling Car.find(:all) returns an Array. Array doesn’t have a method my_xml.
But how does Rails do it? If “all_cars” is an Array, then Array within Rails must support the to_xml method. As it turns out Rails adds some tricks to some of the core pieces of the Ruby language. Of interest to us right now is the module ActiveSupport::CoreExtensions::Array::Conversions. It defines a to_xml method that is a mixin for the Array class.
We could open up the Module and change it. Or we could just create our own method and include it into Array. Let’s do something like that:
module MyConversion
def my_xml
options[:builder] ||= Builder::XmlMarkup.new(:indent => options[:indent])
# TODO: Move the xmlns from Vehicle to here...
options[:builder].tag!("mycar:AllVehicles") do
# Here's we loop over each entry (model) and call it's my_xml
each { |e| e.my_xml(options.merge!({ :skip_instruct => true })) }
end
end
end
# Don't forget to do this!
class Array
include MyConversion
end
Ok. That’s it. Now when we call my_xml regardless of whether it’s a Array or a single object it works as expected.
Have a look around in the ActiveSupport Core Ext. There’s a lot to learn there.
Clean all .svn or cvs diretories from your source
You need or want to import your code into a new svn or cvs repository. But, the source code is filled with CVS/.svn folders from an old repository. Here’s a quick way to do it (Unix):
find . -type d -name CVS | xargs rm -rf
Ruby OCI8 Library Unsupported Datatypes
If you find yourself in the nasty position of having to use Oracle with Ruby, watch out for a problem related to unsupported datatypes. Specifically these types:
- SQLT_TIMESTAMP
- BINARY_DOUBLE
- BINARY_FLOAT
However, I found a little quick fix that seems to work. Add this to your code:
require 'oci8'
# handle the timestamp mapping
OCI8::BindType::Mapping[OCI8::SQLT_TIMESTAMP] =
OCI8::BindType::OraDate
# handle the binary_float
OCI8::BindType::Mapping[100] = OCI8::BindType::Float
# handle the binary_double
OCI8::BindType::Mapping[101] = OCI8::BindType::Float
The good news is it looks like the fix will be in the next release of OCI8.
Mongrel based Gem Server
As you know, the default Gem server included with RubyGems runs with WEBrick. I wanted something a little quicker and more reliable. So, here it is: mongrel_gem_server. Nothing complicated. Just the original Gem server adapted to use Mongrel.
What is defined?
if defined? “what does it mean?!”
Plowing through various Rails and Ruby source code, I keep running across the method defined?. However, a quick look though the API turned up nothing on it. Where’s it coming from? I can’t find it in Module, Kernel, or Object…time to explore! A quick Google search shows RedHanded solved this mystery way back in 2004.
As it turns out defined? lives in eval.c. The logic responsible for defined? actually has the identity is_defined starting around line 2249. As Redhanded explains: “defined? takes its argument and simply queries the symbol table to see if it is defined. If it is, you get a simple string of identification”:
$ x = 10 => 10 $ defined? x => "local-variable" $ def hello; puts "Hello" end $ defined? hello => "method" $ class Test; end $ defined? Test =>"constant"
Examples of use
Since I haven’t had the opportunity to use it yet, Here’s a couple examples I’ve found:
In Rails initializer.rb uses it to test for the existence of a constant:
def initialize_logger
return if defined?(RAILS_DEFAULT_LOGGER)
The method real_connect in mysql.rb uses it to check UNIXSocket:
if (host == nil or host == "localhost")
and defined? UNIXSocket then ...
Cliff notes version
- defined? is classified as an operator not a method
- defined? returns nil if the argument passed to it is NOT defined
- defined? returns a string identification of the argument passed as defined in the symbol table
Now, I wonder where and how the method is_defined is translated into the operator defined?
Serialize “stuff” in your tables with ActiveRecord
An Example
So, you’re building a Web application that allows users to customize a car before they purchase it.
Here’s what the current car table looks like ( in Migration speak ):
create_table "cars" do |t|
t.column "make", :string
t.column "model", :string
t.column "new_smell", :boolean
end
The tasking…
You boss comes in and says “Hey! I got a great idea! Let’s give users the ability to manipulate/change the color of the car…” Immediately your brain snaps into design mode. Hmmm…each car may have many color choices…do I need to add another table just for colors? You ask your boss, “Do we care what colors people are choosing? For example, will we ever want to ask the question how many people chose a Red car”? “No”, your Boss replies, “keep it simple and have it done yesterday!”
ActiveRecord to the rescue
Ok, I like keeping it simple. I’m just going to stuff it all into the car table. Fortunately the Rails Gods have dealt with this before. So like many other things in Rails, it’s easy to do. I want to store the colors in a simple Array.
Let’s modify the table.
create_table "cars" do |t|
t.column "make", :string
t.column "model", :string
t.column "new_smell", :boolean
t.column "colors", :text
end
Notice I added the colors field with a type of :text. That’s because Rails will serialize my Ruby Array (or most any other object I specify) into YAML. And because it’s in the YAML format, it should be accessible from any other language that can read YAML. Ok, almost done. We just need to add a line to our model:
class Car < ActiveRecord::Base
serialize :colors, Array
end
serialize is a class method that tells ActiveRecord to move the colors field back and forth between YAML and an Array. So in our example above, 1.) :colors tells it to use the colors field and 2.) Array (optional) tells it to check what I’m trying to store. So in our example if I try to pass a Hash to the colors field an exception is thrown.
Now I can use the Array on my model as normal:
Car.create( :make => "Yugo", :model => "Hatchback",
:colors => ['red','yellow','magenta'] )
c = Car.find_by_make("Yugo")
puts "Available colors:"
c.colors.each{ |color|
puts "#{color} Yugo"
}
And that’s it. Just two lines of code to add to your app! Check out the docs for more details.
RAILS_DEFAULT_LOGGER
Just a simple problem
It all started when I wanted to add the Rails default logger to some code I was working on. The code sits in the lib directory and I wanted to use the same logger the controllers and models use. Without thinking I threw a call to logger.info(”whatever”) in my code. Obviously that’s not going to work as I quickly got an error. Fortunately a quick google search turned up this. Basically to use the logger in my code I simply needed to call the constant RAILS_DEFAULT_LOGGER. Ok, great. Problem solved. Time to move on. But something caught my curiosity. How is Rails setting this constant and making it available to my Class ?
The plot thickens
So I decide to dig in. Luckily I start by looking at the code in initializer.rb. Rails::Initializer is called when your app boots up. It does most of the heavy lifting configuring the environment for Rails. Right around line 258, I find the answer to the constant mystery:
Object.const_set "RAILS_DEFAULT_LOGGER", logger
Great. Ok now I can move on…but what the heck is
Object.const_set
An aha moment
As it turns out the method const_set allows you to set a constant on the given object. In our case we’re setting the constant on Object - the parent of all classes in Ruby. So let’s see how this works…fire-up IRB and follow along:
Let’s start by seeing what constants Object already has. Ruby makes that easy:
> Object.constants
As you can see (in your IRB terminal) Object comes with a large list of constants already. Ok let’s add our own
> Object.const_set("WHOS_THE_MAN", "DAVE_IS")
> Object.constants
Wow, you just added a constant to the supreme Ruby Object. Next, let’s see how we can get to this in our own classes:
class Test
def whos_the_man?
puts WHOS_THE_MAN
end
end
t = Test.new
t.whos_the_man => DAVE_IS
So there you have it. A clever way to add constants that are available throughout your Ruby app. Of course you’re not limited to just adding to Object. Your Class can have it’s own constants that are scoped to it and its subclasses. You gotta love this language!
Who’d of thunk a ThingLink?
Thinglink.org provides a nice place for Makers to register and publicize their products and/or designs. When you register an item on the site, a unique ThingLink code is created (like a SKU) . This code is registered in the ThingLink open database and is intended to “stick” to the item for life. This is a neat idea, and appears they’ve put a bit of thought into the format of the code.
A ThingLink is a 6 character code that is intended to uniquely identify an object for life. The code follows a particular format: The first 3 characters are numbers (0-9) and the next 3 characters are letters (a-z), case doesn’t matter. So you may end up with a code like 345JrD. In fact, you can generate your own code here. The format - numbers then letters - seem to make it a bit easier to remember. And for some reason, I don’t know why, but the URL for a ThinkLink, gives it a bit of cool factor: http://thinglink.org/thing:123Abc
What’s wrong with IM?
As the old saying goes - a picture’s worth a thousand words:

No the problem is not with Meebo. The problem is with four different networks and four different protocols. I have family that uses MSN, Friends and co-workers on AOL, and I personally prefer Jabber (GTalk). Fortunately, applications like GAIM and Meebo make it possible to bridge the gap, but I still need an account for each. Why does it have to be so complicated?
Now, imagine if the Web was built this way? Hey, my website is on the AOL network, you have to create an account to access it and download their browser. Where’s your site and what browser does it require?
Unfortunately, this doesn’t appear to be causing much of a problem as millions of people are chatting each day. But then high gas prices have yet to stop people from driving SUV tanks either. Sooner or later, something is gonna give.
Personally, I think you’ll see more web-based chat systems like campfire. Why? Mainly because we’re seeing advances on the web that prevented usable web-based chat in the past - mainly AJAX for a snappy interface and Comet ( event driven, server push ) to push chat messages to the browser. Add to that some of the benefits the web-based approach may have over normal IM and you have some compelling reasons to move away from proprietary networks.
Time will tell.