There is:
http://www.google.com/search?as_q=Arturo+Ibarra+Raul+Segura-Rodriguez
This explains the codes.
Now I just need to finish the awk that extracts all capitalized words out of the line.
This should output all words starting with a capital letter that include lower case letters, hyphens or apostrophes:
awk -F'\t' '/b[A-Z]+[-'a-z]*/{print $0} USATodayList.txtUse sed:
sed 's/\(\b[-a-z0-9[:punct:]]*\)//g' USATodayList.txt |sed 's/\bAM//g' | sed 's/\bPM//g'I still need to delete capitalized words after quotes. These are start of sentence.
I have a script that does the required filtering:
#!/bin/bash
# extract city, and description fields
awk -F'\t' '{print $4,"\011",$9}' USATodayList.txt | \
# remove all words entirely lower case
sed 's/\(\b[-a-z0-9[:punct:]]*\)//g' |\
# remove all AM and PM strings
sed 's/AM//g' | sed 's/PM//g' |\
# remove all $ and quotes
sed 's/$[[:print:]]//g'| tr -d '"'|\
# remove certain high frequency words not likely to improve matching
sed 's/\b\".\b//g' temp.txt| sed 's/Police//g'| sed 's/He//g' | sed 's/Him//g' | sed 's/Him//g' |\sed 's/The//g' | sed 's/Three//g' |sed 's/Four//g' | sed 's/Five//g' | sed 's/Suspects//g' | sed 's/Prosecutors//g' |\sed 's/Deputies//g' |\
# remove hyphens tr -d '-'|\
# remove all \ following a blank
sed 's/ \\//g'Now what should be easy (should not say that), combine all those keywords into search queries.
#!/bin/bash
# extract city, and description fields
# skip first linetail -n +2 USATodayList.txt |\
awk -F'\t' '{print $4,"\011",$9}' | \sed 's/\"\\"/"/g' |\
# remove all words entirely lower casesed 's/\b[-a-z0-9]*//g' |tee nolower.txt|\
# remove all AM and PM strings sed 's/AM\|PM//g' |\
# remove all $ commas and quotes
sed 's/$[[:print:]]//g'| tr -d '",.' |tee nopunct.txt|\
# remove certain high frequency words not likely to improve matching
sed 's/\b\".\b//g' | sed 's/Police \|He \|Him \|His \|She \|Her \|Hers//g' |\sed 's/The \|Two \|Three \|Four \|Five \|Suspects \|Prosecutors \|Deputies \A \|There \|Before \|An \|When \|In |Investigators \|Just \|But \|According \| - //g' |\
# remove hyphens and double apostrophes
sed -e 's/--//g' | sed 's/\.\.//g'|\
# remove all \ following a blank
sed 's/ \\//g' |\
# and all apostrophes or commas remaining
tr -d "',;" |\
# replace all superfluous blanks and tabs
sed -e ':loop' -e 's/\t\t/+/g' -e 't loop' | sed -e ':loop' -e 's/[[:blank:]][[:blank:]]/+/g' -e 't loop' | tee superfluous.txt|\
# remove superfluous +s
sed -e ':loop' -e 's/++/+/g' -e 't loop' |tee superpluses|\
# and the mysterious +blank
sed -e 's/+[[:blank:]]/+/g' |\# and the trailing blank
sed -e 's/+$//g' | tee trail.txt|\sed 's/ /+/g' |tee final.txt|\
# now build the URLs
awk '{print "http://www.google.com/search?as_q="$0}' >>urls.txt
And it works! I now have search URLs for every entry in the USA Today database of mass murders.
The more I look at, the more fearful of changing anything. James Gosling, author of emacs, once described PostScript as a write-only language. I largely agree with him, but sed isn't much better.
The more I look at, the more fearful of changing anything. James Gosling, author of emacs, once described PostScript as a write-only language. I largely agree with him, but sed isn't much better.
I just sent you an email with my results of creating the URLs. Hopefully this will still be of some use...
ReplyDelete