#374299 - 23/03/2024 00:32
Help with super weird Bash bug?
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
I run some Bash scripts on my Synology to maintain my birdwatching camera. I'm having a weird problem with one of them, and the kind folks on this BBS have taught me so much about shell scripting in Linux before. I hope y'all can help here? I've tried to pare down the problem to the simplest and clearest possible example, included in the code below. Basic description: The script uses curl to retrieve this web page: http://www.google.com/search?q=sunrise - Google uses the calling browser's IP address to locate the user on the globe, then returns today's sunrise time at the top of the search results, in big bold text. The script parses that resulting web page for the first thing that looks like a time-of-day and prints it out. I actually have more complex code that retrieves both sunrise and sunset, but I've simplified this code to just look for sunrise, so I can show the example clearly. This all used to work fine. But recently, Google changed their output so that instead of outputting a simple "7:05 AM" for example, they had to get fancy and change the space in the middle to a unicode nonbreaking space (code 0x202f). This made me change the parser a little bit so that it has to parse for any single character in between the digits and the AM/PM. The change to the Google results, and subsequently, the change to the parser, is when the trouble started. The weird thing is that the parser works in every test attempt that I make by hand. Every attempt I make to debug the thing in an easy/simple way, it never fails. It only fails in this one situation that's hard to debug. The problem: - When I run the script at the Mac shell prompt, or when I SSH into the Synology and run it from the shell prompt there, it works perfectly. - Only when I run the script as a task in the Synology Task Scheduler, only then, it fails. - It fails in the weirdest way. Failure details (when it fails): - In the Task Scheduler setup, I pipe the script's output into a log file, to see what's going wrong. The "Run Command" is a user-defined script that looks like this: bash "/volume1/homes/admin/CrowCam/TimeTest.sh" > "/volume1/homes/admin/CrowCam/TimeTest.log" 2>&1 - When I run that Task in the Synology Task Scheduler, it fails, and then I grab TimeTest.log and load it up in my Sublime Text editor. - I see the log: The first grep statement in the script has returned nothing, a null string. (But only when running the script from the Synology Task Scheduler.) - The string it's grepping into is fine. The entire web page is indeed printed there in the log, including the string it's grepping for, with 0x202f in the middle as expected. - When I use Sublime's regex search feature, using the same regex search that the grep statement uses, it succeeds and finds the thing that the grep couldn't find. So I know that the google part of the script is correctly returning the expected data, and that the data is greppable. The weird part is that when I SSH into the Synology and run the same script unchanged (regardless of whether I "sudo" or I launch it with Bash or with Sh), it all always works perfectly, no failure. Any ideas what could cause that Grep to fail? Here's the test code:
#!/bin/bash
# ---------------------------------------------------------------------------
# TimeTest.sh - Retrieve the time of the sunrise from Google.
#
# Google use the IP address of the user to find the location and then prints
# the sunrise time on the screen for the user's location, at the top of its
# search results.
# ---------------------------------------------------------------------------
# The User Agent string is required in order for Google to return a result.
# Without the User Agent string, Google prints an error message.
userAgent="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/51.0.2704.103 Safari/537.36"
googleQueryUrl="http://www.google.com/search?q=sunrise"
echo ""
echo "Performing Google query: $googleQueryUrl"
googleQueryResult=
googleQueryResult=$( curl -L -A "$userAgent" -s "$googleQueryUrl" )
# Print the HTML page result to prove we got it (sorry it's so big).
echo ""
echo "Google Query Result: $googleQueryResult"
echo ""
# Sunrise time is the first thing in the result which looks like a time string.
# For example "7:05 AM" appears in the results near the top. However, the
# "7" might sometimes be two digits, and the space between the "05" and the "AM"
# is a unicode nonbreaking space, code 0x202f. So the regex below searches for
# one or more digits, a colon, two more digits, any single character such as the
# nonbreaking space or a regular space, then AM or PM.
timeWithWeirdSpaceInTheMiddle=$( echo $googleQueryResult | \
grep -o -m 1 '[0-9][0-9]*:[0-9][0-9].[AP]M' )
# Convert this back into a more usable time string by parsing out the time
# digits and the AM/PM, and then re-inserting a regular space between them.
firstTimeSection=$(echo $timeWithWeirdSpaceInTheMiddle | grep -o '[0-9][0-9]*:[0-9][0-9]')
secondTimeSection=$(echo $timeWithWeirdSpaceInTheMiddle | grep -o '[AP]M')
finalSunriseString="$firstTimeSection $secondTimeSection"
# Output all of the things that were retrieved from the parsing.
echo "timeWithWeirdSpaceInTheMiddle: $timeWithWeirdSpaceInTheMiddle"
echo "firstTimeSection: $firstTimeSection"
echo "secondTimeSection: $secondTimeSection"
# Exit with an error if either of the time sections came up empty.
if [ -z "$firstTimeSection" ] || [ -z "$secondTimeSection" ]
then
echo ""
echo "ERROR: Time was not correctly retrieved"
echo ""
exit 1
else
echo ""
echo "finalSunriseString: $finalSunriseString"
echo ""
exit 0
fi
|
Top
|
|
|
|
#374300 - 23/03/2024 00:56
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
It always helps one debug one's own code by trying to explain the code to somebody else. I have a workaround now. I'm not sure WHY this workaround succeeds, or more specifically, why it's needed in that specific situation, but here it is. I had to change the grep command to this: grep -o -E -m 1 '[0-9][0-9]*:[0-9][0-9].{1,3}[AP]M' What I did was turn on Extended regex ("-E), then allowed for 1-3 of any character between the time digits and the AM/PM part ("{1,3}"). What this tells me is that, when running under the Synology Task Scheduler, that weird unicode space is acting like multiple characters to the regex. Which sort of makes sense. But then why didn't it also fail at shell when I ran the same file from the same location after SSH'ing into the Synology?
|
Top
|
|
|
|
#374301 - 23/03/2024 13:58
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14496
Loc: Canada
|
When I do the same google access here, the Sunrise time shows up in a different format. It looks like this: 6:58 a.m. Sunset comes out like this: 7:16 pm You may or may not also have to cope with those someday.
|
Top
|
|
|
|
#374302 - 23/03/2024 19:55
Re: Help with super weird Bash bug?
[Re: mlord]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Very interesting! Indeed when I spoof my location in the browser and point it to Montreal, I get the other format that you describe. Seems to be a localization thing; the Canadian version of the formatting prefers the lower case with the dots. Strange that pm is different from a.m. in that regard. Thanks for pointing that out!
|
Top
|
|
|
|
#374303 - 25/03/2024 07:42
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4180
Loc: Cambridge, England
|
Get the script to echo the value of $LC_ALL. Looks like when it runs as you, it's in a Unicode locale (e.g. "en_us.UTF-8" with UTF-8 support) but when it runs under the scheduler it's in a legacy locale (probably "C") where U+202F (i.e., the bytes 0xE2 0x80 0xAF) appears as three unrelated top-bit-set characters.
If that's the issue, than you can force grep to use a proper locale by setting LC_ALL=en_us.UTF-8 yourself in your script.
Peter
Edited by peter (25/03/2024 07:43)
|
Top
|
|
|
|
#374304 - 25/03/2024 17:24
Re: Help with super weird Bash bug?
[Re: peter]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Interesting! Indeed the value of $LC_ALL is different, depending on whether I'm running it from the shell or running it from task manager.
Running it from the shell, I get "en_US.utf8" and running it from task manager I get a blank/null value.
|
Top
|
|
|
|
#374305 - 25/03/2024 17:38
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
If that's the issue, than you can force grep to use a proper locale by setting LC_ALL=en_us.UTF-8 yourself in your script. I tried all of these:
LC_ALL=en_us.UTF-8
LC_ALL="en_us.UTF-8"
LC_ALL=en_us.UTF8
LC_ALL="en_us.UTF8"
LC_ALL=en_us.utf8
LC_ALL="en_us.utf8"
Those all produced an error message: warning: /volume1/homes/admin/CrowCam/TimeTest.sh: line 30: setlocale: LC_ALL: cannot change locale (en_us.utf8): No such file or directory This one seemed to work (at least it did not produce an error message): ... However it did not solve the problem. The Grep command (without my multi-character workaround) still produces no results. It fails the same way, whether I set LC_ALL before scraping the HTML or before issuing the grep command. For clarity, this is the code that I added to the top of my example script:
# Force the value of the locale variable:
echo ""
echo "Previous Locale:"
echo $LC_ALL
echo "Forcing locale."
LC_ALL=en_US.utf8
echo ""
|
Top
|
|
|
|
#374306 - 25/03/2024 17:55
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14496
Loc: Canada
|
You'll want to put the export keyword in front of the assignment. Eg.
export LC_ALL=en_US.utf8
|
Top
|
|
|
|
#374307 - 25/03/2024 18:14
Re: Help with super weird Bash bug?
[Re: mlord]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
You'll want to put the export keyword in front of the assignment. Eg.
export LC_ALL=en_US.utf8
BINGO. That command fixed the bug and now the grep statement works as expected again.
|
Top
|
|
|
|
#374308 - 25/03/2024 18:17
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Mark and Peter, thanks so much for your help. I'm going to fix the production code with that locale fix, and also fix the parser so that it gets the alternate forms of "a.m." and "pm" as Mark pointed out.
Thanks again!
|
Top
|
|
|
|
#374309 - 25/03/2024 19:41
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Turns out it's more complicated than that. I want to be able to debug the code on macs, pc's, and on the Synology, and the possible locale variable can differ depending on what system you're running it on. For example, on the Synology it's:
en_US.utf8
And on Mac, it's
en_US.UTF-8
And if you use the wrong one, everything fails. My first idea is to only change the variable if it's currently null, and if it's null, then parse the output results of the command:
locale -a
and then try to find a match that's close to one of the two above and then force it to that.
|
Top
|
|
|
|
#374310 - 25/03/2024 20:26
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4180
Loc: Cambridge, England
|
You know, you might be better off matching on anything between the digits and the am/pm; that would be more robust in the face of any future changes to the website: grep -o -i '[0-9]+[^0-9]+[0-9]+[^AP]*[AP][^M]*M' Peter
|
Top
|
|
|
|
#374311 - 25/03/2024 20:28
Re: Help with super weird Bash bug?
[Re: peter]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14496
Loc: Canada
|
grep -o -i '[0-9]+[^0-9]+[0-9]+[^AP]*[AP][^M]*M' Also handle upper/lowercase: grep -o -i '[0-9]+[^0-9]+[0-9]+[^aApP]*[aApP][^mM]*[mM]'
|
Top
|
|
|
|
#374312 - 25/03/2024 20:36
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 13/07/2000
Posts: 4180
Loc: Cambridge, England
|
I think that's what the "-i" does.
Peter
|
Top
|
|
|
|
#374313 - 25/03/2024 20:46
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14496
Loc: Canada
|
|
Top
|
|
|
|
#374314 - 25/03/2024 20:52
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
There can be a lot of text in the results, and though the proper result is usually the first one in the search results, I want to avoid greedily grepping for "*" as much as I can avoid it. It might return a bunch of stuff I don't want. I certainly want to search for only a single character between the digits and the AM/PM part if I can. So that way it doesn't return any results like "It was about 3:30 in the AM when the flying saucer abducted me".
I'm trying to find ways that I can get a simple yet reliable search. Oddly, the following things are behaving weirdly:
If I try to use "-i" then it doesn't work at all when I'm debugging things at my mac shell prompt.
If I try to more greedily search for other types of things, it find all the timestrings in the entire page rather than just the first one, even though I'm using "-m 1" in my grep. So I have to use "| head -n 1" at the end of my search to only return the first one.
Though something like [AP][M] works for upper case, as soon as I try to do something like [aApP][mM] it stops working and gets no results.
I'm experimenting and I'll see what I can come up with.
|
Top
|
|
|
|
#374315 - 25/03/2024 20:54
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 29/08/2000
Posts: 14496
Loc: Canada
|
I mashed this up using some of the original code. It doesn't have the latest/best matching for AM/PM, but it does know how to find the correct strings:
#!/bin/bash
userAgent="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
googleQueryUrl="http://www.google.com/search?q="
## Sunrise:
googleQueryResult=$( curl -L -A "$userAgent" -s "${googleQueryUrl}sunrise" )
echo "$googleQueryResult" | sed -n -e 's/.*Sunrise..span.*\([01]*[0-9]:[0-5][0-9]\).\([aApP][.]*[mM]\).*/\1 \2/p' | sed -e 's/[aA][.][mM]/am/'
## Sunset:
googleQueryResult=$( curl -L -A "$userAgent" -s "${googleQueryUrl}sunset" )
echo "$googleQueryResult" | sed -n -e 's/.*Sunset..span.*\([01]*[0-9]:[0-5][0-9]\).\([aApP][.]*[mM]\).*/\1 \2/p' | sed -e 's/[pP][.][mM]/pm/'
|
Top
|
|
|
|
#374316 - 25/03/2024 22:25
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Mark and Peter, thanks so much for your help and suggestions! My final fix was a bit different from your suggestions for a few reasons: I wanted it to be less fancy and more readable, I wanted it to be less greedy about getting results, I needed it to fit in with the way my current code worked, and I wanted to future proof it as much as possible. Your suggestions were very helpful because they gave me ideas of what to look for and what to try. I think I like the current version I've got, which is:
timeWithWeirdSpaceInTheMiddle=$(echo $googleQueryResult | \
grep -o -E -m 1 '[0-9]{1,2}:[0-5][0-9].(AM|PM|am|pm|A\.M\.|P\.M\.|a\.m\.|p\.m\.)' | \
head -n 1 )
I know that the am/pm section could be written with fewer characters, but that version is the most readable. Full code of my changes (including the somewhat involved fix for LC_ALL) is checked in here. Thanks again!
|
Top
|
|
|
|
#374317 - 26/03/2024 04:08
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 08/07/1999
Posts: 5549
Loc: Ajijic, Mexico
|
I once wrote a program in Q-Basic that displayed right there on the screen: "Hello, world." If I recall correctly, it only took me a couple hours to debug it and get it working. So if you need any further help with this project, be sure and let me know. I mean, how hard can it be?
_________________________
"There Ain't No Such Thing As A Free Lunch"
|
Top
|
|
|
|
#374318 - 26/03/2024 09:26
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 18/01/2000
Posts: 5683
Loc: London, UK
|
Frame challenge: why is Google giving you differently-formatted results? Can you force it to a predictable format by using an Accept-Language header?
_________________________
-- roger
|
Top
|
|
|
|
#374319 - 26/03/2024 14:01
Re: Help with super weird Bash bug?
[Re: tfabris]
|
veteran
Registered: 25/04/2000
Posts: 1529
Loc: Arizona
|
At some point, it has to be easier to just give up on using Google. I used this API for something I did: https://sunrisesunset.io/api/
|
Top
|
|
|
|
#374320 - 26/03/2024 17:43
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Can you force it to a predictable format by using an Accept-Language header? Great idea! I hadn't thought of that. Worth experimenting. Didn't even know that existed! Also worth trying! (Ah, just noticed, it didn't exist until 2022, a couple years after my first implementation of that code.) Great suggestions!
|
Top
|
|
|
|
#374321 - 26/03/2024 19:06
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Based on my initial experimentation, sunrisesunset.io requires entering a lat/long, you can't use a named location, nor can you omit the location. Which is odd: It will use the the internet location to determine the timezone, why can't it use the same information to determine location? Agreed that the location would be imprecise, but would be close enough for most usages, and you could have a warning field in the returned JSON which says that it'll be imprecise. Maybe the same site offers a separate API call to return the lat/long of a location, maybe there's another easier way to do that from a public API, or maybe there's a simple way to do that in code at the browser level without having to call any API, and I just don't know those methods.
Anyway, I wanted my implementation to be able to either omit the location information, or just use a zip code or city name, and Googling and scraping the result was easy at the time (until it got more complicated recently).
Along the lines of Roger's suggestion, I wonder if there's a Google API for returning those kinds of results in a JSON, so that I don't have to scrape the HTML? That would be the best solution. Heck, that same code is already using the Google API to control YouTube videos, why didn't I think to look for a Google API for web searches? Silly me.
|
Top
|
|
|
|
#374322 - 26/03/2024 19:19
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Along the lines of Roger's suggestion, I wonder if there's a Google API for returning those kinds of results in a JSON, so that I don't have to scrape the HTML? Yes there is, but it's not free. After 100 queries per day, it starts to cost money. A very small amount of money, but I don't want to have to mess with that in my code's setup instructions.
|
Top
|
|
|
|
#374323 - 27/03/2024 15:18
Re: Help with super weird Bash bug?
[Re: tfabris]
|
veteran
Registered: 25/04/2000
Posts: 1529
Loc: Arizona
|
You could use something like https://ipstack.com/ to get the lat/lon from your IP. You are limited to 100 calls per month for free, but that should be easy to avoid if you only grab it when the IP changes, or limit it even further to only check when the anything except the last octet of the IP changes (to prevent a renewed lease gets a new IP).
|
Top
|
|
|
|
#374324 - 27/03/2024 18:12
Re: Help with super weird Bash bug?
[Re: tfabris]
|
carpal tunnel
Registered: 20/12/1999
Posts: 31600
Loc: Seattle, WA
|
Can you force it to a predictable format by using an Accept-Language header? Did a quick experiment. Adding the header "Accept-Language: en-CA" didn't produce the Canadian results that I got when spoofing my location in the browser. So from my initial experiment, looks like no. And yeah, anything where I have to do more complicated steps, like, getting the location first, paying money (or having to limit the calls to 100 a day to avoid money), all that is more complex and difficult than my (now working) google scrape, so for now I'm just stucking with the google scrape. Tim, your suggestions are great because using a true API is usually more preferable to scraping HTML that might change out from under me (like it already did once with that nonbreaking space thing). On the other hand, I wonder which site is going to last longer: ipstack, sunrisesunset.io, or google.com ? I'm sure it's the latter, the only issue is, how long before I have to change the scraper again?
|
Top
|
|
|
|
|
|