11/09/2008

Parsing CSV files and thinking in objects

I have a google spreadsheet that contains the answers to a questionnaire. I wanted to do some simple processing on the responses, so I thought it would be a good idea to use Squeak to munge the data around. However, that raised the question of how to get the data into the system. CSV to the rescue!

Google can export the file as CSV, so I thought it should be reasonably easy to convert it to some sort of OrderedCollection. The question was how to do it? I played around with some methods for a while, and then decided to ask on the beginners' list. The reason I wanted help was that I had the feeling that I was still thinking in far too procedural manner.

Zulq was kind enough to offer this solution
I ususally do something like this:

(((FileStream readOnlyFileNamed: 'file.csv')
contentsOfEntireFile " read and close "
findTokens: String crlf) " split into lines "
reject: [:e | e isEmpty]) " lose empty lines "
collect: [:e | e findTokens: $,] " split into fields "

Regards,
Zulq.


Squeak is so powerful, it is just amazing. It also highlights that I am still nowhere near getting the OO thinking right. Ah well, time enough. However, when I dug into the CSV a bit more I realised that the solution wasn't complete.

The CSV spec covers the situations where:
  1. data fields don't contain embedded commas
  2. where they do
  3. where they contain double quotes
This is important because - basically - if a field contains a comma then that field must be enclosed in double quotes e.g.
this is data,"so, is this"
So the findToken solution wouldn't be quite correct because it would split the line at every comma.

I was pondering this for a while, largely thinking how to solve this in the most elegant way, when it dawned on me that probably I should find the point in the existing solution that would produce the new solution, and create the least disruption - maybe a new heuristic? It seemed to me that if I could subclass (or perhaps replace) the findToken routine, then I would have changed the least amount of code, and still solved the problem. I am not really using the Test Driven Development approach yet - don't understand how to do it at a practical level - but that should be my next move.

Time to write some code

3 comments:

randy said...

I had this same problem not too long ago, that I detailed here. I wound up using a CSV parser that's available on SqueakSource.

Andy Burnett said...

Doh! I was looking in the wrong place. I went to Squeakmap. Squeak source seems much better. Thanks very much for your suggestion.

Andy Burnett said...

Avi Bryant's CSV Parser seems to work very well. I got some way with my code, and learnt a lot about string manipulation. However, I have now realised that parsing CSV is quite fiddly, so I will put that part to one side, and get on with the rest of the project - now that I can use a canned solution.