Clayton Cramer.: Today's Obscure Question

Tuesday, June 25, 2019

Today's Obscure Question

Say that I have several keywords such as "Lake Charles" and "Lee Roy Williams" that I want to do a search for, without having to enter each keyword by hand. I will have a list of keywords produced by an awk script. I would like to produce a series of searches (using whatever search engine is most familiar to you), so that I will double click a URL in a list of URLs, and have it take me to the list of matches. There must be a way.

There is:
http://www.google.com/search?as_q=Arturo+Ibarra+Raul+Segura-Rodriguez
This explains the codes.

Now I just need to finish the awk that extracts all capitalized words out of the line.

This should output all words starting with a capital letter that include lower case letters, hyphens or apostrophes:

awk -F'\t' '/b[A-Z]+[-'a-z]*/{print $0} USATodayList.txt

Use sed:

sed 's/$\b[-a-z0-9[:punct:]]*$//g' USATodayList.txt |sed 's/\bAM//g' | sed 's/\bPM//g'

I still need to delete capitalized words after quotes. These are start of sentence.

I have a script that does the required filtering:

#!/bin/bash

# extract city, and description fields

awk -F'\t' '{print $4,"\011",$9}' USATodayList.txt | \

# remove all words entirely lower case

sed 's/$\b[-a-z0-9[:punct:]]*$//g' |\

# remove all AM and PM strings

sed 's/AM//g' | sed 's/PM//g' |\

# remove all $ and quotes

sed 's/$[[:print:]]//g'| tr -d '"'|\

# remove certain high frequency words not likely to improve matching

sed 's/\b\".\b//g' temp.txt| sed 's/Police//g'| sed 's/He//g' | sed 's/Him//g' | sed 's/Him//g' |\sed 's/The//g' | sed 's/Three//g' |sed 's/Four//g' | sed 's/Five//g' | sed 's/Suspects//g' | sed 's/Prosecutors//g' |\sed 's/Deputies//g' |\

# remove hyphens tr -d '-'|\

# remove all \ following a blank

sed 's/ \\//g'

Now what should be easy (should not say that), combine all those keywords into search queries.

#!/bin/bash

# extract city, and description fields

# skip first linetail -n +2 USATodayList.txt |\

awk -F'\t' '{print $4,"\011",$9}' | \sed 's/\"\\"/"/g' |\

# remove all words entirely lower casesed 's/\b[-a-z0-9]*//g' |tee nolower.txt|\

# remove all AM and PM strings sed 's/AM\|PM//g' |\

# remove all $ commas and quotes

sed 's/$[[:print:]]//g'| tr -d '",.' |tee nopunct.txt|\

# remove certain high frequency words not likely to improve matching

sed 's/\b\".\b//g' | sed 's/Police \|He \|Him \|His \|She \|Her \|Hers//g' |\sed 's/The \|Two \|Three \|Four \|Five \|Suspects \|Prosecutors \|Deputies \A \|There \|Before \|An \|When \|In |Investigators \|Just \|But \|According \| - //g' |\

# remove hyphens and double apostrophes

sed -e 's/--//g' | sed 's/\.\.//g'|\

# remove all \ following a blank

sed 's/ \\//g' |\

# and all apostrophes or commas remaining

tr -d "',;" |\

# replace all superfluous blanks and tabs

sed -e ':loop' -e 's/\t\t/+/g' -e 't loop' | sed -e ':loop' -e 's/[[:blank:]][[:blank:]]/+/g' -e 't loop' | tee superfluous.txt|\

# remove superfluous +s

sed -e ':loop' -e 's/++/+/g' -e 't loop' |tee superpluses|\

# and the mysterious +blank

sed -e 's/+[[:blank:]]/+/g' |\# and the trailing blank

sed -e 's/+$//g' | tee trail.txt|\sed 's/ /+/g' |tee final.txt|\

# now build the URLs

awk '{print "http://www.google.com/search?as_q="$0}' >>urls.txt

And it works! I now have search URLs for every entry in the USA Today database of mass murders.
The more I look at, the more fearful of changing anything. James Gosling, author of emacs, once described PostScript as a write-only language. I largely agree with him, but sed isn't much better.

1 comment:

wbJune 27, 2019 at 1:02 PM
I just sent you an email with my results of creating the URLs. Hopefully this will still be of some use...
ReplyDelete
Replies

Clayton Cramer.

Pages

Tuesday, June 25, 2019

Today's Obscure Question

1 comment:

About Me

World Climate Widget

Copyright Notice