Tuesday, June 25, 2019

Today's Obscure Question

Say that I have several keywords such as "Lake Charles" and "Lee Roy Williams" that I want to do a search for, without having to enter each keyword by hand.  I will have a list of keywords produced by an awk script.  I would like to produce a series of searches (using whatever search engine is most familiar to you), so that I will double click a URL in a list of URLs, and have it take me to the list of matches.  There must be a way.

There is:
http://www.google.com/search?as_q=Arturo+Ibarra+Raul+Segura-Rodriguez
This explains the codes.

Now I just need to finish the awk that extracts all capitalized words out of the line.

This should output all words starting with a capital letter that include lower case letters, hyphens or apostrophes:
awk -F'\t' '/b[A-Z]+[-'a-z]*/{print $0} USATodayList.txt  
Use sed:
sed 's/\(\b[-a-z0-9[:punct:]]*\)//g' USATodayList.txt |sed 's/\bAM//g' | sed 's/\bPM//g'
I still need to delete capitalized words after quotes.  These are start of sentence.

I have a script that does the required filtering:
#!/bin/bash
#  extract city, and description fields
awk -F'\t' '{print $4,"\011",$9}' USATodayList.txt | \
# remove all words entirely lower case
sed 's/\(\b[-a-z0-9[:punct:]]*\)//g' |\
# remove all AM and PM strings 
sed 's/AM//g' | sed 's/PM//g' |\

# remove all $ and quotes
sed 's/$[[:print:]]//g'| tr -d '"'|\
# remove certain high frequency words not likely to improve matching
sed 's/\b\".\b//g' temp.txt| sed 's/Police//g'| sed 's/He//g' | sed 's/Him//g' | sed 's/Him//g' |\sed 's/The//g' | sed 's/Three//g' |sed 's/Four//g' | sed 's/Five//g' | sed 's/Suspects//g' | sed 's/Prosecutors//g' |\sed 's/Deputies//g' |\
# remove hyphens tr -d '-'|\
# remove all \ following a blank
sed 's/ \\//g' 
Now what should be easy (should not say that), combine all those keywords into search queries.
#!/bin/bash
#  extract city, and description fields
# skip first linetail -n +2 USATodayList.txt |\
awk -F'\t' '{print $4,"\011",$9}' | \sed 's/\"\\"/"/g' |\
# remove all words entirely lower casesed 's/\b[-a-z0-9]*//g' |tee nolower.txt|\
# remove all AM and PM strings sed 's/AM\|PM//g' |\
# remove all $ commas and quotes
sed 's/$[[:print:]]//g'| tr -d '",.' |tee nopunct.txt|\
# remove certain high frequency words not likely to improve matching
sed 's/\b\".\b//g' | sed 's/Police \|He \|Him \|His \|She \|Her \|Hers//g' |\sed 's/The \|Two \|Three \|Four \|Five \|Suspects \|Prosecutors \|Deputies \A \|There \|Before \|An \|When \|In |Investigators \|Just \|But \|According \| - //g' |\
# remove hyphens and double apostrophes
sed -e 's/--//g' | sed 's/\.\.//g'|\
# remove all \ following a blank

sed 's/ \\//g' |\
# and all apostrophes or commas remaining
tr -d "',;" |\
# replace all superfluous blanks and tabs 
sed -e ':loop' -e 's/\t\t/+/g' -e 't loop' | sed -e ':loop' -e 's/[[:blank:]][[:blank:]]/+/g' -e 't loop' | tee superfluous.txt|\
# remove superfluous +s
sed -e ':loop' -e 's/++/+/g' -e 't loop' |tee superpluses|\
# and the mysterious +blank
sed -e 's/+[[:blank:]]/+/g' |\# and the trailing blank
sed -e 's/+$//g' | tee trail.txt|\sed 's/ /+/g' |tee final.txt|\
# now build the URLs
awk '{print "http://www.google.com/search?as_q="$0}' >>urls.txt

And it works!  I now have search URLs for every entry in the USA Today database of mass murders.
The more I look at, the more fearful of changing anything.  James Gosling, author of emacs, once described PostScript as a write-only language.  I largely agree with him, but sed isn't much better.








1 comment:

Unknown said...

I just sent you an email with my results of creating the URLs. Hopefully this will still be of some use...