Great article. It reminds me of another dilemma I came across earlier this week.
I had a 280MB xml file containing about 120,000 records. I needed a way to parse out a subset of the records (about 15,000) and then put their data into a db.
I was developing on a vps with only 256MB of RAM, so I wanted to avoid memory intensive operations. I'm using ruby, so I started looking into how to 'stream' the xml with ruby in such a way that I could read if it was the type of record I wanted to keep, somehow manipulate and store the data and move to the next record. The more I looked into that strategy, the more complex and ominous it seemed. I just really wanted to find a simpler way that wouldn't require all the apparent tediousness of streaming the xml file (it may seem simple enough, but there are dependencies you have to install, api's you have to read and learn, etc). There were just too many moving pieces and it made me feel eery about the outcome. Another aspect of it was just pure laziness. I just don't care how I get that data into the db - I just want to get it in there.
Fortunately, the glorious wonderfulness of linux utilities saved me from a long tedious solution. Behold:
This splits the large xml file into sub-xml files that start with 'xx' by default, followed by an incremental number, xx10004 for example. It splits the file based on the <title_index_item> tag - which is the tag for the items I want. See `man csplit` for more info...
The 'find' lists all the sub-xml files, then we grep for the filenames that do not contain what I'm looking for, then delete those files. I'm left with a directory of 15,000+ xml files with just the type of data I'm interested in.
find . | wc -l
This command is just so I can track the progress of the operations. Obviously `ls` would return too much data, so we pipe the file list to `wc -l` which gives us the number of lines - which in this case is the number of files in the dir.
So, two lines of code on the command-line instead of a far-more complex ruby/streaming-xml solution. Now I can have a ruby script process each individual file and add the data to the db - a much simpler problem to deal with.
My point is that you can accumulate code debt by doing the seemingly 'right' solution sometimes. There are probably coders that will cringe that I just used a couple of shell commands to do this rather than write up a long, well-documented, properly OOP, TDD, etc "right" solution. However, the "right" solution in that case would have accumulated code debt - more code to maintain, more moving pieces, more things that can go wrong. I'd trade maintaining two lines of shell code over 10's or 100's of lines of ruby code and it's dependencies any day of the week.
>There are probably coders that will cringe that I just used a couple of shell commands to do this rather than write up a long, well-documented, properly OOP, TDD, etc "right" solution.
No, I don't think so. This is not technical debt. This is simply the right way to do it. More complicated is NOT better! A lot of people, for some reason especially people who like classes and MVP and templates think that more complicated is better, and will write insane amounts of code to do a very simple thing. I think that's wrong.
The reply is usually something along the lines of it'll be easier to do this and that. But they are not doing this and that, they are doing something simple. Add the complexity when you need it - not in advance.
In similar situations I just use a SAX parser, that's what it's for. I don't know about Ruby, but it's quick and easy in Python. And once you've written one SAX application, you can easily use it as the basis for the next.
Totally. If I had had to do anything much more complex, I may have had to resort to SAX. I was just too lazy to mess with it for this - and I'm a bit OCD sometimes about writing the least amount of code possible ;)
Given that you are programming in Ruby, you can make your shell script part of a Ruby script that then calls the code to process each individual file, no? Ruby gives you best of both worlds in this case.
I had a 280MB xml file containing about 120,000 records. I needed a way to parse out a subset of the records (about 15,000) and then put their data into a db.
I was developing on a vps with only 256MB of RAM, so I wanted to avoid memory intensive operations. I'm using ruby, so I started looking into how to 'stream' the xml with ruby in such a way that I could read if it was the type of record I wanted to keep, somehow manipulate and store the data and move to the next record. The more I looked into that strategy, the more complex and ominous it seemed. I just really wanted to find a simpler way that wouldn't require all the apparent tediousness of streaming the xml file (it may seem simple enough, but there are dependencies you have to install, api's you have to read and learn, etc). There were just too many moving pieces and it made me feel eery about the outcome. Another aspect of it was just pure laziness. I just don't care how I get that data into the db - I just want to get it in there.
Fortunately, the glorious wonderfulness of linux utilities saved me from a long tedious solution. Behold:
This splits the large xml file into sub-xml files that start with 'xx' by default, followed by an incremental number, xx10004 for example. It splits the file based on the <title_index_item> tag - which is the tag for the items I want. See `man csplit` for more info... The 'find' lists all the sub-xml files, then we grep for the filenames that do not contain what I'm looking for, then delete those files. I'm left with a directory of 15,000+ xml files with just the type of data I'm interested in. This command is just so I can track the progress of the operations. Obviously `ls` would return too much data, so we pipe the file list to `wc -l` which gives us the number of lines - which in this case is the number of files in the dir.So, two lines of code on the command-line instead of a far-more complex ruby/streaming-xml solution. Now I can have a ruby script process each individual file and add the data to the db - a much simpler problem to deal with.
My point is that you can accumulate code debt by doing the seemingly 'right' solution sometimes. There are probably coders that will cringe that I just used a couple of shell commands to do this rather than write up a long, well-documented, properly OOP, TDD, etc "right" solution. However, the "right" solution in that case would have accumulated code debt - more code to maintain, more moving pieces, more things that can go wrong. I'd trade maintaining two lines of shell code over 10's or 100's of lines of ruby code and it's dependencies any day of the week.