data ruby taskrabbit
2016-04-22T22:45:50.164Z - Originally posted at https://tech.taskrabbit.com/blog/2013/04/12/forklift/
↞ See all posts
Forklift is a ruby gem that can help you collect, augment, and save copies of your mySQL databases. This is often called an "ETL" tool as the steps in this process mirror the actions of "Extracting the data," "Transforming the data," and finally "Loading the data" into its final place.
With Forklift, you create a Plan which describes how to manipulate your data. The process for this involves (at least) three databases:
The "Live Set" is first loaded into the "Working Set" to create a copy of your production data we can manipulate without fear of breaking replication. Then, any transformations/manipulations are run on the data in the working set. This might include normalizing or cleaning up data which was great for production but hard for analysts to use. Finally, when all of your transformations are complete, that data is loaded into the final database.
Forklift is appropriate to use by itself or integrated within a larger project. Forklift aims to be as fast as can be by using native mySQL copy commands and eschewing all ORMs and other RAM hogs.
At TaskRabbit, the website you see at www.taskrabbit.com is actually made up of many smaller rails applications. When analyzing our site, we need to collect all of this data into one place so we can easily join across it.
We replicate all of our databases into one server in our office, and then use Forklift to extract the data we want into a common place. This gives us the option to both look at live data and to have a more accessible transformed set which we create on a rolling basis. Our "Forklift Loop" also git-pulls to check for any new transformations before each run.
In Forklift, you build a plan. You can add any action to the plan in any order before you run it. You can have 0 or many actions of each type.
1require 'rubygems' 2require 'bundler' 3Bundler.require(:default) 4require 'forklift/forklift' # Be sure to have installed the gem! 5 6######### 7# SETUP # 8######### 9 10forklift = Forklift::Plan.new({ 11 12 :local_connection => { 13 :host => "localhost", 14 :username => "root", 15 :password => nil, 16 }, 17 18 :remote_connections => [ 19 { 20 :name => "remote_connection_a", 21 :host => "192.168.0.0", 22 :username => "XXX", 23 :password => "XXX", 24 }, 25 { 26 :name => "remote_connection_b", 27 :host => "192.168.0.1", 28 :username => "XXX", 29 :password => "XXX", 30 }, 31 ], 32 33 :final_database => "FINAL", 34 :working_database => "WORKING", 35 36 :do_dump? => true, 37 :dump_file => "/data/backups/dump-#{Time.new}.sql.gz", 38 39 :do_email? => true, 40 :email_to => ['XXX'], 41 :email_options => { 42 :via => :smtp, 43 :via_options => { 44 :address => 'smtp.gmail.com', 45 :port => '587', 46 :enable_starttls_auto => true, 47 :user_name => "XXX", 48 :password => "XXX", 49 :authentication => :plain, 50 } 51 } 52 53}) 54 55########## 56# CHECKS # 57########## 58 59forklift.check_local_source({ 60 :name => 'CHECK_FOR_NEW_DATA', 61 :database => 'test', 62 :query => 'select (select max(created_at) from new_table) > (select date_sub(NOW(), interval 1 day))', 63 :expected => '1' 64}) 65 66forklift.check_remote_source({ 67 :connection_name => "remote_connection_b", 68 :name => 'ANOTHER_CHECK', 69 :database => 'stuff', 70 :query => 'select count(1) from people', 71 :expected => '100' 72}) 73 74########### 75# EXTRACT # 76########### 77 78forklift.import_local_database({ 79 :database => "database_1", 80 :prefix => false, 81 :frequency => 24 * 60 * 60, 82}) 83 84forklift.import_local_database({ 85 :database => "database_2", 86 :prefix => false, 87 :only => ['table_1', 'table_2'], 88}) 89 90forklift.import_remote_database({ 91 :connection_name => 'remote_connection_a', 92 :database => "database_3", 93 :prefix => true, 94 :skip => ['schema_migrations'] 95}) 96 97############# 98# TRANSFORM # 99############# 100 101transformation_base = File.dirname(__FILE__) + "/transformations" 102 103forklift.transform_sql({ 104 :file => "#{transformation_base}/calendars/create_calendars.sql", 105 :frequency => 24 * 60 * 60, 106}) 107 108forklift.transform_ruby({ 109 :file => "#{transformation_base}/test/test.rb", 110}) 111 112####### 113# RUN # 114####### 115 116forklift.run 117
1def run 2 lock_pidfile # Ensure that only one instance of Forklift is running 3 rebuild_working_database # Ensure that the working database exists 4 ensure_forklift_data_table # Ensure that the metadata table for forklift exists (used for frequency calculations) 5 6 run_checks # Preform any data integrity checks 7 run_extractions # Extact data from the life databases into the working database 8 run_transformations # Perform any transformations 9 run_load # Load the manipulated data into the final database 10 11 save_dump # mySQLdump the new final database for safe keeping 12 send_email # Email folks the status of this forklift 13 unlock_pidfile # Clean up the pidfile so I can run next time 14end 15
Forklift allows you to create both Ruby transformations and SQL transformations
1class Test 2 def transform(connection, database, logger) 3 logger.log "Running on DB: #{database}" 4 logger.log "Counting users..." 5 connection.q("USE `#{database}`") 6 users_count = connection.q("count(1) as 'users_count' from `users`") 7 logger.log("There were #{users_count} users") 8 end 9end
SQL Transformations can be used to generate new tables like this as well
The defaults for a new Forklift::Plan are:
1 1 { 2 2 :project_root => Dir.pwd, 3 3 :lock_with_pid? => true, 4 4 5 5 :final_database => {}, 6 6 :local_database => {}, 7 7 :forklift_data_table => '_forklift', 8 8 9 9 :verbose? => true, 1010 1111 :do_checks? => true, 1212 :do_extract? => true, 1313 :do_transform? => true, 1414 :do_load? => true, 1515 :do_email? => false, 1616 :do_dump? => false, 1717 }
1 1 forklift.check_local_source({ 2 2 :name => STRING, # A name for the test 3 3 :database => STRING, # The Database to test 4 4 :query => STRING, # The Query to Run. Needs to return only 1 row with 1 value 5 5 :expected => STRING # The response to compare against 6 6 }) 7 7 8 8 forklift.check_remote_source({ 9 9 :connection_name => STRING, # The name of the remote_connection 1010 :name => STRING, # A name for the test 1111 :database => STRING, # The Database to test 1212 :query => STRING, # The Query to Run. Needs to return only 1 row with 1 value 1313 :expected => STRING # The response to compare against 1414 })
1 1 forklift.import_local_database({ 2 2 :database => STRING, # The Database to Extract 3 3 :prefix => BOOLEAN, # Should we prefix the names of all tables in this database when imported wight the database? 4 4 :frequency => INTEGER (seconds), # How often should we import this database? 5 5 :skip => ARRAY OR STRINGS # A list of tables to ignore and not import 6 6 :only => ARRAY OR STRINGS # A list of tables to ignore and not import (use :only or :skip, not both) 7 7 }) 8 8 9 9 forklift.import_remote_database({ 1010 :connection_name => STRING, # The name of the remote_connection 1111 :database => STRING, # The Database to Extract 1212 :prefix => BOOLEAN, # Should we prefix the names of all tables in this database when imported wight the database? 1313 :frequency => INTEGER (seconds), # How often should we import this database? 1414 :skip => ARRAY OR STRINGS # A list of tables to ignore and not import 1515 :only => ARRAY OR STRINGS # A list of tables to ignore and not import (use :only or :skip, not both) 1616 }) 17
11 forklift.transform_sql({ 22 :file => STRING, # The transformation file to run 33 :frequency => INTEGER (seconds), # How often should we run this transformation? 44 }) 55 66 forklift.transform_ruby({ 77 :file => STRING, # The transformation file to run 88 :frequency => INTEGER (seconds), # How often should we run this transformation? 99 })
You can launch forklift in "debug mode" with — debug (we check ARGV[" — debug"] and ARGV["-debug"]). In debug mode the following will happen: — verbose = true — no SQL will be run (extract, load) — no transforms will be run — no email will be sent — no mySQL dumps will be created
I write about Technology, Software, and Startups. I use my Product Management, Software Engineering, and Leadership skills to build teams that create world-class digital products.
Get in touch