Saturday, June 15, 2019

Almost Done With This Script

#!/bin/bash
awk -F'\t' '{print $33"\t"$34}' MassMurderCurrent.txt >SourcesUrls.txt
# remove all lines with URLs
sed '/https:/d' SourcesURLs.txt | sed '/http:/d' > Sources.txt
# remove all lines with Rith and Dayton
sed '/Roth and Dayton/d' Sources.txt >SourcesMinusDandR
# replace DOS characters with proper forms (" instead of inward quotes).
tr '\221\222\223\224\226\227' '\047\047""--' Dewindowed.txt
sed 's| \([0-9]\),| 0\1,|g' Dewindowed.txt >TwoDigitDates.txt
# convert month, date, year to mm/dd/yyyy
sed 's| Jan. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 01/\1/\2, |g' TwoDigitDates.txt |\
sed 's| Feb. \([0-9][0-9]\)[,\."] \([0-9][0-9][0-9][0-9]\)[,\."]| 02/\1/\2, |g' |\
sed 's| Mar. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 03/\1/\2, |g' |\
sed 's| Apr. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 04/\1/\2, |g' |\
sed 's| May \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 05/\1/\2, |g' |\
sed 's| Jun. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 06/\1/\2, |g' |\
sed 's| Jul. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 07/\1/\2, |g' |\
sed 's| July \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 07/\1/\2, |g' |\
sed 's| Aug. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 08/\1/\2, |g' |\
sed 's| Sep. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 09/\1/\2, |g' |\
sed 's| Oct. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 10/\1/\2, |g' |\
sed 's| Nov. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 11/\1/\2, |g' |\
sed 's| Dec. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 12/\1/\2, |g' >mmddyyyy.txt
# extract article title
sed 's|sources\tURL|title\tjournal and date|g' mmddyyyy.txt |\
sed 's|"\([A-z0-9 ]*\),"\([A-z0-9 ]*\), \([0-9/]*\), \([0-9]*\)|"\1," \t\2,\t\3,\t\4|g'| sed 's/""/"/g'  >articleColumn1.txt 

(Yes, I could pipe all these together, but seeing the intermediate files is useful for debugging.)  The only remaining problem is that I cannot figure out how to convert the Windows ellipsis character to three periods.  Windows has several special characters (e.g., left quote, right quote, right apostrophe) which is why the above script has:

# replace DOS characters with proper forms (" instead of inward quotes).
tr '\221\222\223\224\226\227' '\047\047""--'

But I cannot find any description of how to enter the ellipsis character (\205).  I read that :

sed -e 's|`$(echo "\0205")`|...|g'

should do it.  But it does not.  \0205 is not matching it.

Solution: in Excel, Replace ALT-0133 (numeric keypad) with three periods.  It then saves as periods.

So why does this not work?
$ echo '"Horrid Barbarity,"' | sed 's|"\(A-z0-9]*\),"|\1$|g'

"Horrid Barbarity,"


1 comment:

  1. I am not a programer, I just notice when something seems out of pattern.

    sed 's| Feb. \([0-9][0-9]\)[,\."] \([0-9][0-9][0-9][0-9]\)[,\."]| 02/\1/\2, |g' |\
    sed 's| Mar. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 03/\1/\2, |g' |\
    sed 's| Apr. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 04/\1/\2, |g' |\
    sed 's| May \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 05/\1/\2, |g' |\ (doesn't MAY need a period after?)
    sed 's| Jun. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 06/\1/\2, |g' |\
    sed 's| Jul. \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 07/\1/\2, |g' |\
    sed 's| July \([0-9][0-9]\)[,\.] \([0-9][0-9][0-9][0-9]\)[,\."]| 07/\1/\2, |g' |\ (isn't this just the line before??)

    ReplyDelete