And why do you think that? Awk is simply the best tool for many tasks.

moe · on Aug 27, 2011

Name one?

4ad · on Aug 27, 2011

Hacking together something to parse text?

I used it to find a house[1]. I needed to rent two apartments close to each other. I wrote some shell/awk scripts that take several real estate web sites as input, parse them, extract all data about apartments including URL, address and price, pipe the addresses into Google maps API, find the exact coordinates of all apartments, calculate distances between them and produce a list of pairs of apartments sorted by the distance between them. It works incredibly well, it routinely finds houses that are on the same street or apartments that are in the same building.

The first version of the code was written when I was in Vienna, looking for a house. It took me an evening.

The only thing that comes even close to this in power is perl, but I don't like perl, I think it's overkill when you only need to parse text, not do some other computation, and I did all this work under Plan9 which doesn't have perl anyway.

Oh, and what was most useful after all this was creating one liners for doing filtering, like finding pairs that had total cost under some value, total number of rooms above some value and distance between them in some interval. I found this one liners to be easier in awk then perl.

Your argument against awk because is 2011 is the same as saying let's drop C because it's 2011. Both C and awk solve some things well, for other things there are other languages.

[1] http://code.google.com/p/operation-housefinder/

burgerbrain · on Aug 28, 2011

I find that any argument along the lines of "Oh goodness, it's [YEAR] for crying out loud" is generally rubbish. Not always, but often enough that I've noticed the correlation.

I know Perl, I know Ruby, I know Brainfuck and I know Awk. I'm also not a masochist. I don't use Brainfuck but I use the hell out of Awk. Myself and other Awk users like 4ad here aren't using Awk because we're crazy, we've calculated effort/rewards/tradeoffs and come upon a solution.

I find people suggesting that others should or should not use various tools to be incredibly condescending. In doing so you are effectively refusing to recognize others as your peers. If you're Dijkstra and everyone around you has a hardon for GOTO, then be my guest, but you are not.

telemachos · on Aug 27, 2011

I find it quicker and easier to create one-off text filters in awk than Perl or Ruby. That is, I may have a set of csv files for teachers, administrators and students by grade. I want to extract fields x, y and z from all these files, transform the data in some specific ways and then create a single output file (maybe to upload somewhere else or to use as part of a larger program). I certainly could write a script in an interpreted language to do this, but I often find it faster to do it in awk. (There's a sweet spot here: if the job is too complex or I'm going to do it over and over, a script may become the better choice.) I use awk for this kind of thing 3-5 times a week, easily.

jrockway · on Aug 28, 2011

By the time you've dealt with all the special cases of the CSV format, your awk script isn't going to look very clean.

A few lines of Perl with Text::CSV is going to be much more robust, run faster, and take less time to get working.

telemachos · on Aug 28, 2011

(I know your work, and I respect you. I mean the following as a compliment - truly, no sarcasm or snark.)

Sounds to me like you're being a professional programmer. I'm not. I'm a self-taught amateur. I program a little for myself, a little for the school I work for and a bunch to do sysadmin tasks (for myself and the school). I don't worry all that much about the special cases of CSV. I look at the data, massage it a little before if I need to (usually that with awk, sed, etc. as well), then run it through awk and do the job. If the result isn't all perfect, I clean it up quickly by hand using Vim. In this kind of context, 'robust' doesn't mean anything to me. I'm not expecting to do the exact same task ever again. As for running faster, I call bullshit and remind you of this[1]. awk runs plenty fast - seconds or less for such cases. As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.

The bottom line for me is that I like awk. It's defaults make sense to me, it's fast, it's very flexible and powerful. Finding the right Perl module or Ruby gem, learning its API, opening my editor, etc. - that all takes time. For small, one-off jobs, it's not worth it.

[1] http://news.ycombinator.com/item?id=1116224

chromatic · on Aug 28, 2011

As for less time to get it working, I doubt it. I'm pretty quick with a hackish little awk script.

I'm a lot faster not writing code than I am writing code, and when I have to write code, I write code much faster when I let the computer massage and clean up data.

telemachos · on Aug 28, 2011

This is getting pretty silly. The computer doesn't magically massage data itself. You're saying you prefer to use Perl and a module from CPAN to help do that. Ok. I'm saying that in many cases, I prefer to use some combination of standard *nix tools, an editor and awk. The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish. So I'm just not seeing any argument from "write code much faster".

Again, this isn't robust. It's not professional. No tests in advance. But it works for me day in, day out, every week.

Last thought from me - this is beginning to remind me of this slide from Mark Jason Dominus: http://perl.plover.com/yak/12views/samples/notes.html#sl-3,

"I mentioned this approach on the #perl IRC channel once, and I was immediately set upon by several people who said I was using the wrong approach, that I should not be shelling out for such a simple operation. Some said I should use the Perl File::Compare module; most others said I should maintain a database of MD5 checksums of the text files, and regenerate HTML for files whose checksums did not match those in the database.

I think the greatest contribution of the Extreme Programming movement may be the saying "Do the simplest thing that could possibly work." Programmers are mostly very clever people, and they love to do clever things. I think programmers need to try to be less clever, and to show more restraint. Using system("cmp -s $file1 $file2") is in fact the simplest thing that could possibly work. It was trivial to write, it's efficient, and it works. MD5 checksums are not necessary. I said as much on IRC.

People have trouble understanding the archaic language of "sufficient unto the day is the evil thereof," so here's a modern rendering, from the New American Standard Bible: "Do not worry about tomorrow; for tomorrow will care for itself. Each day has enough trouble of its own." (Matthew 6:34)

People on IRC then argued that calling cmp on each file was wasteful, and the MD5 approach would be more efficient. I said that I didn't care, because the typical class contains about 100 slides, and running cmp 100 times takes about four seconds. The MD5 thing might be more efficient, but it can't possibly save me more than four seconds per run. So who cares?"

chromatic · on Aug 28, 2011

The cases I'm thinking of take (total) between 30 second and 5 minutes, start to finish.

I underestimate how much time it'll take me to do a job manually when I already have great tools to do it the right way all the time.

Maybe you're just that much better a programmer than I am--but I know when I need to parse something like CSV, Text::xSV will get it right without me having to think about whether there are any edge cases in the data at all. (If Text::xSV can't get it right, then there's little chance I will get it right in an ad hoc fashion.)

In the same way, I could write my own simple web server which listens on a socket, parses headers, and dumps files by resolving paths, or I could spend three minutes extending a piece of Plack middleware once and not worrying about the details of HTTP.

Again, maybe I'm a stickler about laziness and false laziness, but I tend to get caught up in the same cleverness MJD rightfully skewers. Maybe you're different, but part of my strategy for avoiding my own unnecessary cleverness is writing as little parsing code as I can get away with.

burgerbrain · on Aug 28, 2011

I could take a few more minutes to use something that makes me not have to worry about CSV edge cases, or I could just recognize that those cases don't apply to me, throw together an awk line in 20 seconds, and get to the pub early and get a pint or two down before my buds show up.

Who cares about what is better on paper? I have a life to live.

berntb · on Aug 28, 2011

Personally, unless I knew the data, I'd have problems enjoying that pint without worrying about data fields containing e.g. ",":s which my naive script would fail on...

burgerbrain · on Aug 28, 2011

In my experience, files have mixed formats far less often than you might suspect. Generally they were kicked out by another script some other bored bloke who just wants to get home wrote himself. If I don't get the results I'm expecting then I'll investigate but the time I save myself by assuming things will work more than offsets that.

Usually the issues I find are things that invalidate the entire file, and indicate a problem further up the stream. In those cases I'm actually glad my script was not robust.

berntb · on Aug 30, 2011

Sorry for coming in late, but... that argument can motivate the use of simpler tools (here, awk instead of perl) in any situation. :-)

(I didn't mention "mixed formats", I mentioned the specific problems of e.g. ",":s (and possibly '"':s) in CSV. You will find such characters even in e.g. names, especially if entered by hand/ocr.)

brendano · on Aug 28, 2011

Alternative: first pipe through a CSV to strict TSV converter, such that "awk -F'\t'" is correct. I do this all the time, because awk is far superior to standalone perl/python scripts for quick queries, tests, and reporting.

(My converter script: https://github.com/brendano/tsvutils/blob/master/csv2tsv )

veyron · on Aug 27, 2011

!x[$1]++

Use that at least 10 times a day

Also, Larry Wall (perl) said "I still say awk '{print $1}' a lot."

burgerbrain · on Aug 28, 2011

Don't forget about `cut -f1` though!

Of course anything more complex than that, but not requiring the complexity of Perl/Ruby is generally best still done with Awk. In my world, that is a lot.

a3_nm · on Aug 28, 2011

cut -f1 and awk '{print $1}' are not equivalent, though: awk does not count empty fields. To be precise:

    pi:~$ echo 'foo  bar' | cut -d ' ' -f 2

    pi:~$ echo 'foo  bar' | awk '{print $2}'
    bar

jsrn · on Aug 28, 2011

Just for the interested reader: when the first command line is changed into

    pi:~$ echo 'foo  bar' | tr -s ' ' | cut -d ' ' -f 2

it also outputs 'bar'. The -s (for "squeeze") option of tr turns every sequence of the specified character (space in this case) into one instance of this character.

Of course, the awk solution is more succint and elegant in this case - I just think that tr -s / cut -d is handy to know from time to time, too.

a3_nm · on Aug 28, 2011

The interested reader should not assume hastily, however, that tr -s is enough to make cut behave exactly like awk. Hint: leading spaces.

    pi:~$ echo ' foo' | tr -s ' ' | cut -d ' ' -f 1 
    
    pi:~$ echo ' foo' | awk '{print $1}'
    foo

dbbo · on Aug 28, 2011

Because it's shorter than `perl -lane 'print $F[0]'`.

CJefferson · on Aug 28, 2011

It's certainly easier to remember.. what's the lane?

jsrn · on Aug 28, 2011

"what's the lane?"

four options squeezed together:

-l: 1. "chomps" the input (which here means: removes newlines (if present) from each line, more general explanation: http://perldoc.perl.org/functions/chomp.html 2. and automatically adds a newline to each output newline (see below how to achieve this in a shorter way with a relatively new Perl feature).

-a: turns on auto-split-mode: this splits the input (like awk) into the @F (for fields) array. The default separator is one (or many) spaces, i.e. it behaves like AWK.

-n: makes Perl process the input in an implicit while loop around the input, which is processed line-wise: "as long as there is another line, process it!"

-e: execute the next command line argument as a Perl program (argument in this case beeing 'print $F[0]').

Note that the example can be shortened if you use -E instead of -e. -E enables all extensions and new features of Perl (which aren't enabled with -e because of backwards compatibility). This allows you to use 'say' instead of 'print' which adds a trailing newline automatically and lets you drop the -l option (if you don't need the 'chomp' behaviour explained above):

    $ perl -anE 'say $F[0]'

Of course, the AWK command line is still shorter - and that's expected, because AWK is more specialized than Perl.

Still, Perl one liners are often very useful and can do other things better / shorter than AWK - especially if the one-liner uses one of the many libraries that are available for Perl.

A thorough reference of all Perl command line options is available at:

    http://perldoc.perl.org/perlrun.html

or just:

    $ man perlrun

CJefferson · on Aug 29, 2011

Thank you, that tempts me to go and look at perl. At the moment I tend to use simple awk, or a full python script. I find python really doesn't lend itself to "one line", or even particularly short, programs however. I keep meaning to go back and look at perl. I was tempted to wait for Perl 6, but I think the time has come to just look at Perl 5 :)

telemachos · on Aug 28, 2011

If you're genuinely curious about -lane, my favorite site for Perl one liners has gone the way of all flesh, but the Wayback Machine can still get you a copy[1].

[1] http://web.archive.org/web/20090602215912/http://sial.org/ho...