The CSV (“Comma Separated Values”) file format is often used to exchange data between differently similar applications. For programmer, parsing CSV files isn’t a rare case. If CSV file contains not more than one hundred lines it’s not problem to parse it with built in programming language tools or libraries. But when you are using very high-level language and file size increases you need something more powerful.
For example, we have file file.csv which size is 496M. Can you suggests how much time it will take to parse this file on my MacBookPro11,2 with default Ruby csv library?
Ooops, some info about file
Ok, now, let’s try to just read it with ruby
1:22:57? Isn’t good. But if you trying to think about CSV format you will find that it’s very similar to relational database with columns. Could PostgreSQL help us in this question?
Let’s write script that creates temp table and saves each file record in this table. We will use PostgreSQL COPY command to parse CSV file.
Just 44 seconds and we have all this stuff in DB. Pretty cool. But how you can use it in real world issues?
Let’s assume, you have list of abbreviations, list of doctors and you need to save all this stuff into table that contains full doctor name and doctor’s high school. Users table has next structure:
We have file with abbreviations (I used this) that separated with tabs and looks like
And list of doctors that looks like
Let’s put it all together. PostgreSQL uses tabs as default delimetr for text format that used by default in COPY comand. Starting from parsing list of abbreviations:
In PostgreSQL commas is default delimeter for csv format and we can replace WITH DELIMITER ',' with WITH (FORMAT "csv"). Then retrieve doctors from file:
And try to generate query that retrive data for us:
Oh no, there can be more than one definition of abbreviation. In my case I will use first abbreviation definition if there is more that one possible variant. DISTINCT ON keyword will help:
Much better. You need just add INSERT to put all this data into users table. For more complex example you can combine all this stuff with Ruby ActiveRecord and it will save a lot of time. But you need understand that time spent on coding should be commensurate with time that you save by using it. If pure ruby version works 1 hour and you need to parse one document per week seems not so reasonable make all this overhead. This XKCD comics can be used like landmark and should help you.