CSV Primer
Parsing certain data formats can be more challenging than others. One of the granddaddy's of challenging file formats is the humble CSV or "comma separated value" file (I've also heard "character separated value"). The lack of a strong specification as well as adherance to a specification is what makes parsing these files particularly challenging at times.
Given the experiences I've had over the past few months parsing over 40 different source file formats, here's a few ways to not make your own (or other) developer's pull all their hair out.
For the record, I've seen all of these in the wild.
One record per row
Seriously. There's a reason why I'm putting this first. I don't care if the row has 2 columns or 500, but if you're spreading records over multiple rows, everyone involved is going to have a bad time.
Bad:
id,title,image,image_position
1,dining room table,image1,1
1,,image2,3
1,,image3,2
Better:
id,title,images
1,dining room table,image1,image3,image2
Also Better:
id,title,images
1,dining room table,"image1,image3,image2"
Also also better:
id,title,image1,image2,image3,image4,image5
1,dining room table,image1,image3,image2,,
One file
Trust me, it's much easier to parse records when they're all in the same file and not spread out over, say, a "data" file, "price" file, and "inventory level" file. All the data in one place.
Have a header
If there's no header, we don't know what your data is.
Have the header be the first line
If it's not the first line, it can be worked around in most cases, but that's additional customization, which introduces fragility.
Each row should have the same number of fields as the header
Hopefully self-explanatory
Bad:
id,title,image
1,stuffed bear
Also Bad:
id,title,image
1,stuffed bear,image1,image2
Best:
id,title,image
1,stuffed bear,
2,table,image1
Delimiters
Long story short, I don't care which one you use, but use one that's not used in your data. For instance, if your data has commas, maybe don't use them as a field separator?
Bad:
testing,1,2,3,what are we having, adam, for lunch?
Best:
testing, 1, 2, 3|what are we having, adam, for lunch?
Quoting
If you must use a delimiter that your data also contains, quote the fields and be consistent. Yes, I realize most csv parsers can handle some fields being quoted and others not, but it does make things a headache if something goes wrong.
For the love of everything that is holy, do NOT use smart quotes as delimiters!!! Especially if you use smart quotes in your data!
Bad:
testing,1,2,3,what are we having, adam, for lunch?
So, so, so bad:
“testing,1,2”,3,“what are we having, adam, for lunch?”
Better:
"testing,1,2",3,"what are we having, adam, for lunch?"
Best:
"testing,1,2","3","what are we having, adam, for lunch?"
Line Endings
This is what determines where a new line/record starts.
Please please please use just \n
for a line ending, \r\n
works too. If you must, use just \r
. Finally, don't mix them, please!
Encoding
It seems like the world is starting to standardize on UTF-8. Means you should probably do the same. Multiple encodings means the parser has to try the most common first, if that fails, try a fallback, etc. This has both data quality and performance implications. Neither of which you want.
Entire Data Set vs. Update
This gets into the weeds a bit, but providing all the data all the time is the easiest way to both provide and parse data. It allows whomever you're sending the data to reprocess the entire dataset if something goes wrong.
That being said, if the data is truly immense (gigabytes or terabytes of data), I'd recommend an interval setup. Basically every so often send the entire data set, and in the meantime, send only the records that have changed, BUT, send the entire record, not just the data in the record that changed.
Deletes
This is probably the hardest bit to handle. I've seen two ways to handle it and both have their advantages and drawbacks. In my experience, the "Absence Indicates Delete" tends to be less error prone as long as the consumer is handling unexpected source files appropriately.
For the engineers reading this, it's the difference between the transfer of an entire system of state (single file) vs. state with multiple inputs that can get out of sync (delete file).
Absence Indicates Delete
Basically if a record that already exists on the consumer of the csv isn't included in the provided csv it's deleted (or marked for delete or hidden, I'm treating these conditions the same for this post).
This has the advantage of not having to parse an additional file.
The disadvantage is that if either the csv file provided isn't what's expected or something goes wrong on the consumer side, an entire dataset can be deleted.
Delete-only File
Essentially this file would be a list of the records to be deleted.
Has the advantage of being less error prone if the source csv is bad in some way.
The disadvantage here is twofold. First, needing to provide and parse a second file with a different format than the standard data file. Second, timing. If the consumer sees both the data file and the delete file at the same time and a record is included in both, which one takes precedence? Data or delete? If a record has been deleted in the past and it's now seen in the data file, should it be created or ignored?
Summary
So, what's the upshot of all this? Similar to dealing with timezones, don't build your own csv generator/parser. Use one from someone that's already gone through all these headaches. Stand on the shoulders of those giants.
If you're a product person reading this, my only advice to you is: listen to your engineers. They know how hard dealing with csv files is and they're just trying to get the business result you want given the constraints they're operating under.