Importing Data with Rails
8:54 AM EDT Saturday, April 19 2008
Often when working with Rails applications, you need to import data from other sources. A common source is an excel spreadsheet. A simple import consists of reading each line in the spreadsheet and creating a record in the database for each line. You could do this is a small Ruby script with SQL, you wouldn't need Rails. But sometimes the import is more complicated. For example, you may want to run your application validation logic on each record. Also, maybe you need to create associated record for each row.
To handle this kind of thing, it can be helpful to use your ActiveRecord data model. To do that, you can simply create a Ruby script and add these few lines at the top:
require File.join(File.dirname(__FILE__), "..", "..", "config", "boot")
require File.join(File.dirname(__FILE__), "..", "..", "config", "environment")
This will boot up the Rails environment when your script starts, and then you have full access to your Rails models. You could write a procedural script to handle that, but I've found that creating an object-oriented class gives you a little bit cleaner, more re-usable framework. So let's just get right to the code. Here is the code for a base class for your data import:
class DataImport
attr_reader :file, :fields, :row_map, :default_e
#Create DSL methods for subclasses
class << self
def default_environment(env)
self.send(:define_method, :default_environment) do
env
end
end
def default_file(file_name)
self.send(:define_method, :default_file) do
file_name
end
end
end
def initialize(env, file)
load_rails(env || respond_to?(:default_environment) ?
default_environment : "development")
@file = file || default_file
raise "You must specify a file" unless @file
end
def self.run(env, file)
new(file, env).run
end
def run
open(file).each_with_index do |line, i|
initialize_row!(line, i)
end
end
def initialize_row!(line, i)
tokenize_row!(line)
if i < 1
initialize_fields!
else
initialize_row_map!
process_row
end
end
def process_row
puts row_map.inspect
end
private
def tokenize_row!(line)
@row = line.split('|')
end
def initialize_fields!
@fields = @row.map{|e| e.chomp.to_sym}
end
def initialize_row_map!
@row_map = {}
@row.each_with_index do |c, i|
@row_map[fields[i]] = c.blank? ? nil : c.strip
end
end
def load_rails(env)
ENV['RAILS_ENV'] = env
require File.join(File.dirname(__FILE__), "..", "..", "config", "boot")
require File.join(File.dirname(__FILE__), "..", "..", "config", "environment")
end
end
Alright, that's a pretty big chunk of code, but this is the implementation of a base class that you will reuse. Don't worry, your actual import class will be much shorter. In other words, you can copy and paste this right into your app and use it as is, but if you are interested to find out how it works, read the next few paragraphs.
So the first interesting thing you'll encounter in this code is the DSL-ish methods. To understand how this works, you really need to read Why The Lucky Stiff's Seeing Metaclasses Clearly. The talk Dave Thomas gave just the other day at the NovaRUG would help too. But basically what it does is define 2 class methods that are intended to be used by subclasses during class definition. When called, they will define methods that the base class can then use. This the concept I blogged about the other day in action. They are conceptually the same thing as the definition of the belongs_to and has_many methods in ActiveRecord. It will make more sense when you see an implementation.
Next up is the constructor which handles setting the file instance variable for our data import class, as well as loading up rails with the right environment specified. After that are class and instance methods both called run. The idea here is that we want to work with an instance of the data import class, but it will be convenient to just call OurDataImport.run.
The work happens in the run instance method. This opens up the file and starts processing it line by line. In this method I'm trying to employ a technique, or more of a style I guess, that Marcel Molina spoke about at the DC Ruby Users Group. The idea is that you should strive to as much as possible have all of the code within a method be at the same level of abstraction. If you look at this whole method:
def initialize_row!(line, i)
tokenize_row!(line)
if i < 1
initialize_fields!
else
initialize_row_map!
process_row
end
end
It's easy to read it and understand what it is going to do. First we are going to tokenize the row, then if it is the first row, we will initialize the fields, otherwise, we will initialize the row map and process the row. For example, this method could be written like this:
def initialize_row!(line, i)
tokenize_row!(line)
if i < 1
initialize_fields!
else
@row_map = {}
@row.each_with_index do |c, i|
@row_map[fields[i]] = c.blank? ? nil : c.strip
end
process_row
end
end
But there is an abstraction-level switching that you have to go through mentally once you get to the first line after the else. The rest of the method is composed of intent-revealing methods, but then we just have this lower-level chunk of code that deals with setting instance variables. So don't do that, the other implementation is cleaner, leads to code that is composed well and is easier to test and extend.
So the meat of what happens here is that the run method reads in the file row by row. It assumes the data will be pipe-separated (that is, records separated with the "|" character), because I find that to be easiest to parse. It's trival to convert an excel spreadsheet to a pipe-separated text file using OpenOffice. If your data is not pipe-separated, you could override tokenize_row to split up the row some other way. It assumes the first row contains the field names that each column will map to, so if we are on the first row, it just stores away the field names. Then, on each subsequent row it constructs a map (a.k.a hash) containing the column name and values. Then it calls the process_row method. The implementation of the process_row doesn't do anything interesting in this base class because the intent is for you to override that in your subclass.
Ok, so now let's put this to use. Create a rails app with a simple user model:
$ rails myapp
$ cd myapp
$ script/generate model user name:string email:string
$ rake db:migrate
Now copy the whole base DataImport class from above into db/data/data_import.rb. Then create a data file at db/data/users.txt with something like this:
name|email
Paul Barry|mail@paulbarry.com
Someone Else|someone_else@example.com
And then finally we'll create an implementation of our data import at db/data/user_data_import.rb
require 'data_import'
class UserDataImport < DataImport
default_file "users.txt"
def process_row
user = User.create!(row_map)
puts "Created => #{user.inspect}"
end
end
UserDataImport.run(ARGV[0], ARGV[1])
So now we have a pretty clear, concise file that explains what we are doing. You can see the call to default_file that allows us to set our default file name using a clean, DSL-ish syntax. We could also call default_environment there as well if we wanted to, but we don't have to. This is a very simple import where we just create a user for each row. The last line of the script runs the import, passing in the command line arguments. If you pass no arguments, it will work, using "development" for the environment and "users.txt" for the file name. A real data import is likely to do some more interesting work with the data, but at least this gets all the plumbing of processing the data file out of the way for you and allows you to focus on the logic of what you need to do with the data. All that's left to do is simply run the db/data/user_data_import.rb script.
Sidenote: I've found that if your want to run the script from textmate, you need to add this line top of your script, due to a conflict in the ruby libraries provides with TextMate.
$:.reject! { |e| e.include? 'TextMate' }