Author |
Message |
marc.aronson
|
Posted: Fri Jul 28, 2006 11:16 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
I've developed a script that runs in the background and detects certain back-end failures. Upon detecting any of these conditions it will automatically restart the backend. It will detect and recover from the following conditions:
1. "Waiting for a thread" errors.
2. Conditions that result in the backend logfile growing to more than 1 megabyte.
3. Backend process terminating for any reason.
I've developed this script because these conditions happen to my system occasionally, and the automated recovery ensures I don't loose recordings. If anyone is interested in them, let me know and I'll post them with usage instructions. They are tested with R5A30.2, but I suspect they should work fine with R5B7 & R5C7.
Marc
|
|
Top |
|
 |
marc.aronson
|
Posted: Sat Jul 29, 2006 1:46 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
I've finished adding some comments to the script, so here it is in case anyone is interested.
Note: Script updated on July 30 to minimize odds of runaway restarts
Note: Script updated on August 2 to also restart gdm anytime backend is restarted.
Note: Script updated on August 5 to make timestamp on captured mythbackend log file match timestamp in mbemon.log.
Note: Script updated on August 23 to periodically probe port 6544 to ensure that the backend isn't hung.
Place the following in /usr/local/bin/mbemon.sh and don't forget about the "chmod a+x /usr/local/bin/mbemon.sh" Code: #! /bin/sh # # This script monitors the backend and automatically restarts # the backend if certain error conditions are detected. Those # conditions are documented in the various comments below. # Each of those comments contains the string "Check for", so you # can find them quickly by searching for that pattern. # # If this script finds the backend has hit a problem more than 4 # times out of the last 15 checks, it assumes that your backend # has hit a persistant error that requires manual intervention # to correct, and it will exit after doing a final restart of the # backend. # # Note that if this script is running when you do a normal "stop" # of the backend, it will eventually restart the backend for you, # even if that is not what you wanted. So if you deliberately want # to stop the backend, you should stop this script first. # # Started by executing: /etc/init.d/mbemon start # Halted by executing: /etc/init.d/mbemon stop # # Script variables that you might choose to alter: # # mbelog: Location of the myth backend log file. # Defaults to the location used by Knoppmyth. # stime: Number of seconds to sleep between scans. # Defaults to 300 seconds (5 minutes). # logpath:Path to the directory where this script will # place it's log file and various other diagnostic # information. # log: Path to logfile where this script will log various # status messages. # mbelog=/var/log/mythtv/mythbackend.log stime=300 logdir=/var/log/mythtv/mbemon log=$logdir/mbemon.log hist=$logdir/history.log # # # mkdir -p $logdir echo "mbemon started `date`" > $log rm -f $hist while true do sleep $stime status="ok" # # Check for backend not running. # if [ "$status" == "ok" ] then xxx=`pgrep -l mythbackend` if [ "$xxx" == "" ] then status="Myth backend process is missing" fi fi # # Check for excessively large backend log # if [ "$status" == "ok" -a -e "$mbelog" ] then xxx=0 xxx=`ls -s $mbelog | cut -d" " -f1` if [ "$xxx" -gt 1100 ] then status="Log file size of $xxx blocks is too large -- probable runaway" fi fi # # Check for "waiting or a thread.." in the log file # if [ "$status" == "ok" -a -e "$mbelog" ] then grep -q "waiting for a thread" $mbelog if [ $? -eq 0 ] then status="Log file contians 'waiting for a thread' error message" fi fi # # Check to see if backend is responding to quries on porrt 6544 # if [ "$status" == "ok" ] then date > $logdir/status.txt lynx -dump http://localhost:6544/ >> $logdir/status.txt lstat=$? grep -q "Schedule" $logdir/status.txt gstat=$? if [ $lstat -ne 0 ] || [ $gstat -ne 0 ] ; then status="Unable to query port 6544 lstat=$lstat gstat=$gstat" fi fi # # Check results and restart backend if there is a problem. # histline="OK `date`" if [ "$status" != "ok" ] then histline="ERROR `date`" echo "" >> $log echo Problem encountered on `date` >> $log echo " $status" >> $log echo " Halting backend" >> $log /etc/init.d/mythtv-backend stop | sed -e "s/^/ /" >> $log if [ -e $mbelog ] then echo " Capturing back end log file" >> $log [ ! -e $logdir/mythbackend.log-2 ] || \ mv -f $logdir/mythbackend.log-2 $logdir/mythbackend.log-3 [ ! -e $logdir/mythbackend.log-1 ] || \ mv -f $logdir/mythbackend.log-1 $logdir/mythbackend.log-2 [ ! -e $logdir/mythbackend.log ] || \ mv -f $logdir/mythbackend.log $logdir/mythbackend.log-1 mv -f $mbelog $logdir/mythbackend.log | sed -e "s/^/ /" >> $log touch $logdir/mythbackend.log fi echo " Restarting backend" >> $log /etc/init.d/mythtv-backend start | sed -e "s/^/ /" >> $log sleep 1 echo " Restarting gdm" >> $log /etc/init.d/gdm restart fi echo $histline >> $hist tail -15 $hist > $hist.2 mv -f $hist.2 $hist cnt=`grep "^ERROR" $hist | wc -l` if [ $cnt -gt 4 ] then echo " More than 4 of the last 15 checks have resulted in a restart" >> $log echo " It appears that there is a persistant error that cannot be" >> $log echo " recovered from. mbemon is giving up and exiting." >> $log exit fi done
Place the following in /etc/init.d/mbemon and don't forget to "chmod a+x /etc/init.d/mbemon" Code: #!/bin/sh # # Start/stops the myth front end monitor # #
case "$1" in start) echo "Starting mbemon daemon" killall -q mbemon.sh start-stop-daemon -S --exec /usr/local/bin/mbemon.sh \ --name marcmbemon --background --nicelevel 19 ;; stop) echo -n "Stopping mbemon daemon" killall -q mbemon.sh || echo -n " -- mbemon not alrady running" echo "." ;; restart|force-reload) $0 stop sleep 1 $0 start ;; *) echo "Usage: /etc/init.d/mbemon {start|stop|restart|force-reload }" exit 1 ;; esac
exit 0
Last edited by marc.aronson on Wed Aug 23, 2006 1:14 pm, edited 4 times in total.
|
|
Top |
|
 |
randomhtpcguy
|
Posted: Sun Jul 30, 2006 10:00 am |
|
Joined: Mon Nov 07, 2005 10:09 am
Posts: 153
|
I was having a lot of problems before with my backend crashing.
I installed monit following these instructions
http://www.mythtv.org/wiki/index.php/St ... ing_How_To
It seems to work great and has a nice html status page
Monit is running on mythtvslave with uptime, 4d 13h 38m and monitoring:
Memory
mythtvslave [2.00] [1.97] [1.80] 0.0%us, 1.1%sy, 98.2%wa 76.3% [392124 kB]
mythbackend running 4d 13h 35m 0.0% 4.7% [24452 kB]
mysql running 4d 13h 35m 0.0% 1.2% [6524 kB]
It has log files too. Scripts could be added to warn of low disk space with an email and monitor the size of log files. I can't get it to work with mythfrontend cause there is no pid file and I need to set up the outgoing email.
Your script looks nice too. I think it would be nice to incorporate this type of technology into knoppmyth because its very helpful. An alert to the user of a suddenly rapidly growing log file would probably detect many of the big problems.
[/img]
|
|
Top |
|
 |
cesman
|
Posted: Sun Jul 30, 2006 11:16 am |
|
Joined: Fri Sep 19, 2003 7:05 pm
Posts: 5088
Location:
Fontana, Ca
|
randomhtpcguy wrote: I think it would be nice to incorporate this type of technology into knoppmyth because its very helpful. An alert to the user of a suddenly rapidly growing log file would probably detect many of the big problems.
My general feeling toward this is no. If the backend is crashes, no should investigate to find out why it crashed. Potentially restarting the backend over and over isn't going to do nothing but grow the log.
If you expect to see something like log alert, then I challenge you to come up with something. No review with peer review.
_________________ cesman
When the source is open, the possibilities are endless!
|
|
Top |
|
 |
marc.aronson
|
Posted: Sun Jul 30, 2006 11:16 am |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
randomhtpcguy, I saw the post on monit. I decided to go this route because I wanted to cover some other conditions I've encountered and I didn't want to learn anoter scripting language. Both approaches work. This scripts above will capture diagnostic information, including the backend log file, for subsequent diagnosis of root-cause.
In terms of the front-end -- have you considered linking one of your remote control buttons to the command "/etc/init.d/gdm restart"? This would enable you to restart the front end with your remote control when you run into problems. In my case, I linked the restart to the remote control's power button. This a thought...
Marc
|
|
Top |
|
 |
tkoster
|
Posted: Sun Jul 30, 2006 1:00 pm |
|
Joined: Mon Apr 04, 2005 10:50 am
Posts: 120
|
What happens when you stop the backend on purpose to tweak card settings or repair the database? Does it stay stopped or automatically restart?
|
|
Top |
|
 |
marc.aronson
|
Posted: Sun Jul 30, 2006 6:04 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
tkoster, the script I've written does not distinguish a deliberate stop of the backend from an unintended halt. The way I deal with this is executing "/etc/init.d/mbemon stop" before stopping the backend.
For my purposes, I've created a single stop script that stops apache, mbemon, mythbackend & mysql with the issuance of a single command. I've also created the corresponding start script...
Cesman, I agree that the ultimate goal is to diagnose back-end failures and fix the underlying bug so that over time the backend becomes 100% stable. That is the reason that the script captures diagnostic information before doing a restart. My problem has been that the failures, while very infrequent, tend to happen while I am out of the country, and I wind up talking my wife through the reset process from 10K miles away, after she has been unable to use the system for some period of time. I felt this was a better soluton for me, given my circumstances...
Marc
|
|
Top |
|
 |
marc.aronson
|
Posted: Sun Jul 30, 2006 6:21 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
Cesman, I've been thinking some more about your post. It seems like another one of your concerns is the risk of entering a "restart loop" if the cause of the halt is a persistant problem. Is this correct?
Marc
|
|
Top |
|
 |
tjc
|
Posted: Sun Jul 30, 2006 6:28 pm |
|
Joined: Thu Mar 25, 2004 11:00 am
Posts: 9551
Location:
Arlington, MA
|
Yes. It's a classic systems programming problem. Automatic restart need to be handled very carefully lest it make the problem far worse. There are various ways you can try to mitigate the risk like restart counts and doubling restart delays, but even with those it can bite you hard.
|
|
Top |
|
 |
marc.aronson
|
Posted: Sun Jul 30, 2006 6:49 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
TJC, agreed. This is a good point. While it's not a bullet-proof approach, I've modified the script to track restarts. In this new version, if it detects a need to restart the backend more than 4 times over the previous 15 checks, it will log a message in it's log file and exit, so that it doesn't trigger a permenant restart loop.
I've tested this condition and it seems to work properly, but I want to let it run for a bit before I post the updated version.
Marc
|
|
Top |
|
 |
cesman
|
Posted: Sun Jul 30, 2006 7:54 pm |
|
Joined: Fri Sep 19, 2003 7:05 pm
Posts: 5088
Location:
Fontana, Ca
|
Yes, that is my concern.
_________________ cesman
When the source is open, the possibilities are endless!
|
|
Top |
|
 |
marc.aronson
|
Posted: Sun Jul 30, 2006 8:22 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
I've updated the script in my original post to reduce the risk of this happening. In this new version, if it detects a need to restart the backend more than 4 times over the previous 15 checks, it will log a message in it's log file and exit, so that it doesn't trigger a permenant restart loop.
Marc
|
|
Top |
|
 |
randomhtpcguy
|
Posted: Wed Aug 02, 2006 7:55 am |
|
Joined: Mon Nov 07, 2005 10:09 am
Posts: 153
|
marc. Looks like your making good progress. To answer your question for deliberate restarts I do /etc/init.d/monit stop first and then the others you mentioned.
The frontend needs to be started from a wrapper which creates a pid file like the backend, mysql, gdm so i think it will be more complicated to setup using monit (they give the suggested wrapper in the faq) whereas the backend and mysql already start this way so are very easy because you just add the pid file to the config and your done.
Autorestarting the frontend might be more trouble than helpful. Its obvious when it has crashed and nothing is lost when your away and it crashes. gdm restart from remote would be sufficient when the remote is working.
I still have a periodic freezes of the frontend in mythmusic when i navigate quickly between songs, but the remote stops responding (no keyboard ) so the gdm restart doesn't work. I would like to have the box power button do a lowlevel kill or magic sysrq or some safe reboot, but so far i've had to do a hard reset which causes me to have to run fsck manually on reboot and is dangerous but luckily appears to fix things without moving stuff to the lost+found.
Obviously, i wouldn't want this type of crash(freeze) to be automatically reset and ignored as cesman points out. The only reason i bring this up is ... could mysql or the backend be crashing and restarting in a loop causing the frontend to crash. The restart limit with increasing delays would likely help. I think anytime monit or your script has to take action the admin should be notified. When i'm home i periodically check the uptimes on the monit webpage. Its in the logs but as you say this is designed for when your away so maybe an email would be nice.
Cesman- an "unusual increased rate of logging" detection and alert is something i should try before I suggest. I know it would be complicated as some normal processes probably grow the log like building a dvd or transcoding. I will look around though. Usually the small size of root solves this problem as it fills up and prevents logging in. So I guess in principle this forces the user to fix the problems early and often. I guess its more clever than I realized.
|
|
Top |
|
 |
cesman
|
Posted: Wed Aug 02, 2006 8:39 am |
|
Joined: Fri Sep 19, 2003 7:05 pm
Posts: 5088
Location:
Fontana, Ca
|
The idea is to find the issue and resolve it permanetly.
_________________ cesman
When the source is open, the possibilities are endless!
|
|
Top |
|
 |
marc.aronson
|
Posted: Wed Aug 02, 2006 10:56 pm |
|
Joined: Tue Jan 18, 2005 2:07 am
Posts: 1532
Location:
California
|
randomhtpcguy, it sounds like your system has a nasty problem with the kind of front-end lockup you are seeing. Have you checked /var/log/messages and /var/log/kern.log to see if there are any diagnostic messages that would help solve the problem? I was having problems with front-end lock ups on my old hardware. I would get messages that read something like this:
Code: Mar 8 20:45:09 mythhd kernel: NVRM: Xid: 16, Head 00000000 Count 003ada32 Mar 8 20:45:13 mythhd kernel: NVRM: Xid: 8, Channel 0000001e
Based on what I read in various forums, that problem is sometimes caused by a motherboard problem, so I rebuilt my system with a new mother board & CPU and that problem has gone away. I'm not sure if this applies to you, but I thought I'd mention it.
Marc
|
|
Top |
|
 |