Wednesday, May 22, 2019

Awk, Awk!

No, not the sound of a large bird.  I am trying to separate the columns of a csv file.  It does okay until I hit a line with comas inside the quotes.  Here is the command:
awk -F"\"*,\"*" '{print $1,$4,$5,$9}' USATodayList.csv
 Here is a line that it does not behave on:
58,Apr  3 2009 12:00AM,13,Binghamton,NY,N.Y.,42.0986867,-75.9179738,"Wearing a bulletproof vest, Jiverly Wong, 41, entered a citizenship class and fired 98 shots from two handguns. Wong killed 13 people and injured several more before shooting himself.",Shooting,Public Killing,Y,Y,,,,
Field 9 stops after
Wearing a bulletproof vest
Would it make sense to replace all the "," strings with | and use that as the field separator?  But that does not replace commas not in quotes.  How do I replace all commas not in quotes with |?  If I can do this operation first, I think that awk command will work with -F|

This gets less and less fun with time.
sed -e 's/\\"([[:print:]]\),\([[:print:]]\"\)/\1|\2/g' temp.csv

replaces all the commas with | including the ones inside the the quotes. But only from the shell.  From a bash script, it insists that the file is not there, but I can ls it just fine.

Is sed already beyond my abilities?  Should I write a C program instead?

The problem is that I only want to replace commas not in quotes, or all commas not followed by a space.

I seem to be getting there:
sed -e 's/\(\"[[:print:]]\)*.*\(\"[[:print:]]\)/\1|\2/g' filename
Back in the days of the DEC-10, it was claimed that many TECO commands (a text editor for the DEC-10 were indistinguishable from line noise.  (And if you are the period when modems were strictly devices for converting ASCII values to weird sounds, you know what line noise was.)

This still does nothing for commas not in quotes, but I think I see the way to do that next.

Nope, but the good hearted soul that extracted the data as a CSV did it again as a |separated file.

Next step: extract proper nouns (schools, business names, names of victims and killers) then feed them into a search engine automatically and retrieve URLs.  This is going to be harder.

4 comments:

  1. There's no guaranteed solution with awk, but for this specific case, it should be sufficient to replace all commas followed by a space with something unique, then run it through awk, then reverse the transformation:

    sed -e 's/, /__/g' | awk -F, '{print $1,$4,$5,$9}' | sed -e 's/__/, /g'

    -j

    ReplyDelete
  2. I have not tried this myself, but you might want to check out this "advanced feature" of gawk: Defining Fields by Content.

    (Or use Text::CSV in Perl. But that's me.)

    ReplyDelete
  3. 1) sed?

    2) "Some people, when faced with a Unix problem, think 'aha! I know! I'll use sed!' ... now they have TWO problems."

    3) I would probably honestly write a parser.

    ReplyDelete
  4. What is blowing things up--as I'm sure you've figured out--is that you've got commas embedded inside the fields.

    If you know Python, you can use the CSV module to simplify this for you--it (should) notice when a field begins ," and ends ", and treat what is in the middle as a single string.

    Otherwise you're going to need a simple state machine to find commas between " and turn them into something else, then turn them back.

    ReplyDelete