Tuesday, October 2, 2018

Feeling Very sed Stupid Today

Trying to replace all four digit strings at start of line (yes, that's a year with no month and date):
sed -e "s\[:digit][:digit][:digit][:digit]\0/0/&\" massmurder.csv
Using the cygwin sed in a Windows 7 terminal window:
sed -e expression #1, char 57: unterminated 's' command
The s command terminates with the last backslash.  What am I missing from my brain that I still had in 2014?

One reader suggested using \ as a delimiter was the problem because it was read as an escape character.  So I replaced them with =.
I tried a very simple sed command: 

Does cygwin sed just not work?  & should substitute the matching pattern, but does not.  Does not work under Linux, either until I rep[ace [:digit] with [0-9].  Weird.  Still fails under Windows 7.  Beginning to suspect some weird shell issue.  Use " not ' and the problem goes away.

Even under Linux, sed does not work as advertised:
sed -E -i "s!\([0-9]\)\([0-9][0-9][0-9][0-9]\)!\1--\2!"
should grab the first digit into one logical region and the next four digits into another logical region and then output region 1--region 2.  Instead:
invalid reference \2 on 's' command's RHS. 
Do I need to write my own sed?

Of course, not an entire sed.  A C program to read the CSV file and do the various needed transformations.  Use strtok to break the cells up on the commas, but there are internal c commas in some cells, so start out using strrep to replace all commas in the line with a character never used, such as logical or ("|"), then strtok.  At time to export the line, use strrep again to put the commas back in.  There is enough work that needs to be done in other parts of the spreadsheet, such as converting lists of firearms types (pistol, shotgun) into booleans that I may just continue entering data inn Access manually; I do find errors along the way. 

Just started coding.  Commas in a CSV file are enclosed in commas so translating all commas to | won't work.  I will have to write something a bit more difficult, looking for opening and closing quotes and commas instead of using strtok. If inside quotes, then everything inside is a cell, including commas.  No opening quotes: everything until the next comma is a cell. 

I also have two columns that need work: there is a firearms column and a type of firearm if known column which may contain several types separated by commas.  The database has a firearm (type unknown) and rifle, pistol and shotgun columns.  So I have to separate the known firearms column into the three column booleans and one firearm unknown column if the types column is empty.  The header line needs similar adjustment.  Of course, a CSV uses left and right quotes instead of ", so I have to be wary of automatic translation.


C. Petro said...

ortep@Motte ~: cat t | sed 's\[[:digit:]][[:digit:]][[:digit:]][[:digit:]]\0/0/&\'
Cygwin on Windows 10.

ortep@Motte ~: cat t

ortep@Motte ~: cat t | sed 's\[[:digit:]][[:digit:]][[:digit:]][[:digit:]]\0/0/&\'

Note that the character class is [[:digit:]] not [:digit].

The -e is not required in this use case.

The glyph after the s can be anything:
ortep@Motte ~: cat t | sed 's|[[:digit:]][[:digit:]][[:digit:]][[:digit:]]|0/0/&|'

and you can use \{\} to match of the previous match:

ortep@Motte ~: cat t | sed 's|[[:digit:]]\{4\}|0/0/&|'

Clayton Cramer said...

C.Petro: How dare you point out such a simple fix. :-)