LinHES Forums • View topic - Automated detection & recovery from backend failures

Board index KnoppMyth General

All times are UTC - 6 hours

Automated detection & recovery from backend failures

Page 1 of 2

[ 23 posts ]

Go to page 1, 2 Next

Print view

Previous topic Next topic

Author

Message

marc.aronson

Post subject: Automated detection & recovery from backend failures

Posted: Fri Jul 28, 2006 11:16 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

I've developed a script that runs in the background and detects certain back-end failures. Upon detecting any of these conditions it will automatically restart the backend. It will detect and recover from the following conditions:

1. "Waiting for a thread" errors.

2. Conditions that result in the backend logfile growing to more than 1 megabyte.

3. Backend process terminating for any reason.

I've developed this script because these conditions happen to my system occasionally, and the automated recovery ensures I don't loose recordings. If anyone is interested in them, let me know and I'll post them with usage instructions. They are tested with R5A30.2, but I suspect they should work fine with R5B7 & R5C7.

Marc

Top

marc.aronson

Post subject:

Posted: Sat Jul 29, 2006 1:46 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

I've finished adding some comments to the script, so here it is in case anyone is interested.

Note: Script updated on July 30 to minimize odds of runaway restarts
Note: Script updated on August 2 to also restart gdm anytime backend is restarted.
Note: Script updated on August 5 to make timestamp on captured mythbackend log file match timestamp in mbemon.log.
Note: Script updated on August 23 to periodically probe port 6544 to ensure that the backend isn't hung.

Place the following in /usr/local/bin/mbemon.sh and don't forget about the "chmod a+x /usr/local/bin/mbemon.sh"

Code:

#! /bin/sh
#
# This script monitors the backend and automatically restarts
# the backend if certain error conditions are detected. Those
# conditions are documented in the various comments below.
# Each of those comments contains the string "Check for", so you
# can find them quickly by searching for that pattern. 
#
# If this script finds the backend has hit a problem more than 4
# times out of the last 15 checks, it assumes that your backend
# has hit a persistant error that requires manual intervention
# to correct, and it will exit after doing a final restart of the
# backend.
#
# Note that if this script is running when you do a normal "stop"
# of the backend, it will eventually restart the backend for you,
# even if that is not what you wanted. So if you deliberately want
# to stop the backend, you should stop this script first.
#
# Started by executing: /etc/init.d/mbemon start
# Halted by executing:  /etc/init.d/mbemon stop
#
# Script variables that you might choose to alter:
#
# mbelog: Location of the myth backend log file. 
#         Defaults to the location used by Knoppmyth.
# stime:  Number of seconds to sleep between scans.
#         Defaults to 300 seconds (5 minutes).
# logpath:Path to the directory where this script will
#         place it's log file and various other diagnostic
#         information.
# log:    Path to logfile where this script will log various
#         status messages.
#
mbelog=/var/log/mythtv/mythbackend.log
stime=300
logdir=/var/log/mythtv/mbemon
log=$logdir/mbemon.log
hist=$logdir/history.log
#
#
#
mkdir -p $logdir
echo "mbemon started `date`" > $log
rm -f $hist
while true
do
  sleep $stime
  status="ok"
  #
  # Check for backend not running.
  #
  if [ "$status" == "ok" ]
  then
    xxx=`pgrep -l mythbackend`
    if [ "$xxx" == "" ]
    then
      status="Myth backend process is missing"
    fi
  fi
  #
  # Check for excessively large backend log
  #
  if [ "$status" == "ok" -a -e "$mbelog" ]
  then
    xxx=0
    xxx=`ls -s $mbelog | cut -d" " -f1`
    if [ "$xxx" -gt 1100 ]
    then
      status="Log file size of $xxx blocks is too large -- probable runaway"
    fi
  fi
  #
  # Check for "waiting or a thread.." in the log file
  #
  if [ "$status" == "ok" -a -e "$mbelog" ]
  then
    grep -q "waiting for a thread" $mbelog
    if [ $? -eq 0 ]
    then
      status="Log file contians 'waiting for a thread' error message"
    fi
  fi
  #
  # Check to see if backend is responding to quries on porrt 6544
  #
  if [ "$status" == "ok" ]
  then
    date > $logdir/status.txt
    lynx -dump http://localhost:6544/ >> $logdir/status.txt
    lstat=$?
    grep -q "Schedule" $logdir/status.txt
    gstat=$?
    if [ $lstat -ne 0 ] || [ $gstat -ne 0 ] ; then
      status="Unable to query port 6544 lstat=$lstat gstat=$gstat"
    fi
  fi
  #
  # Check results and restart backend if there is a problem.
  #
  histline="OK `date`"
  if [ "$status" != "ok" ]
  then
    histline="ERROR `date`"
    echo "" >> $log
    echo Problem encountered on `date` >> $log
    echo "  $status" >> $log
    echo "  Halting backend" >> $log
    /etc/init.d/mythtv-backend stop | sed -e "s/^/  /" >> $log
    if [ -e $mbelog ]
    then
      echo "  Capturing back end log file" >> $log
      [ ! -e $logdir/mythbackend.log-2 ] || \
        mv -f $logdir/mythbackend.log-2 $logdir/mythbackend.log-3
      [ ! -e $logdir/mythbackend.log-1 ] || \
        mv -f $logdir/mythbackend.log-1 $logdir/mythbackend.log-2
      [ ! -e $logdir/mythbackend.log ] || \
        mv -f $logdir/mythbackend.log $logdir/mythbackend.log-1
      mv -f $mbelog $logdir/mythbackend.log  | sed -e "s/^/  /" >> $log
      touch $logdir/mythbackend.log
    fi
    echo "  Restarting backend" >> $log
    /etc/init.d/mythtv-backend start | sed -e "s/^/  /" >> $log
    sleep 1
    echo "  Restarting gdm" >> $log
    /etc/init.d/gdm restart
  fi
  echo $histline >> $hist
  tail -15 $hist > $hist.2
  mv -f $hist.2 $hist
  cnt=`grep "^ERROR" $hist | wc -l`
  if [ $cnt -gt 4 ]
  then
    echo "  More than 4 of the last 15 checks have resulted in a restart" >> $log
    echo "  It appears that there is a persistant error that cannot be" >> $log
    echo "  recovered from. mbemon is giving up and exiting." >> $log
    exit
  fi
done

Place the following in /etc/init.d/mbemon and don't forget to "chmod a+x /etc/init.d/mbemon"

Code:

#!/bin/sh
#
# Start/stops the myth front end monitor
#
#

case "$1" in
        start)
                echo "Starting mbemon daemon"
                killall -q mbemon.sh
                start-stop-daemon -S --exec /usr/local/bin/mbemon.sh \
                  --name marcmbemon --background --nicelevel 19
                ;;
        stop)
                echo -n "Stopping mbemon daemon"
                killall -q mbemon.sh || echo -n " -- mbemon not alrady running"
                echo "."
                ;;
        restart|force-reload)
                $0 stop
                sleep 1
                $0 start
                ;;
        *)
                echo "Usage: /etc/init.d/mbemon {start|stop|restart|force-reload
}"
                exit 1
                ;;
esac

exit 0

Last edited by marc.aronson on Wed Aug 23, 2006 1:14 pm, edited 4 times in total.

Top

randomhtpcguy

Post subject:

Posted: Sun Jul 30, 2006 10:00 am

Joined: Mon Nov 07, 2005 10:09 am

Posts: 153

I was having a lot of problems before with my backend crashing.

I installed monit following these instructions

http://www.mythtv.org/wiki/index.php/St ... ing_How_To

It seems to work great and has a nice html status page
Monit is running on mythtvslave with uptime, 4d 13h 38m and monitoring:

Memory
mythtvslave [2.00] [1.97] [1.80] 0.0%us, 1.1%sy, 98.2%wa 76.3% [392124 kB]

mythbackend running 4d 13h 35m 0.0% 4.7% [24452 kB]
mysql running 4d 13h 35m 0.0% 1.2% [6524 kB]

It has log files too. Scripts could be added to warn of low disk space with an email and monitor the size of log files. I can't get it to work with mythfrontend cause there is no pid file and I need to set up the outgoing email.

Your script looks nice too. I think it would be nice to incorporate this type of technology into knoppmyth because its very helpful. An alert to the user of a suddenly rapidly growing log file would probably detect many of the big problems.

[/img]

Top

cesman

Post subject:

Posted: Sun Jul 30, 2006 11:16 am

Joined: Fri Sep 19, 2003 7:05 pm

Posts: 5088

Location: Fontana, Ca

randomhtpcguy wrote:

I think it would be nice to incorporate this type of technology into knoppmyth because its very helpful. An alert to the user of a suddenly rapidly growing log file would probably detect many of the big problems.

My general feeling toward this is no. If the backend is crashes, no should investigate to find out why it crashed. Potentially restarting the backend over and over isn't going to do nothing but grow the log.

If you expect to see something like log alert, then I challenge you to come up with something. No review with peer review.

_________________
cesman

When the source is open, the possibilities are endless!

Top

marc.aronson

Post subject:

Posted: Sun Jul 30, 2006 11:16 am

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

randomhtpcguy, I saw the post on monit. I decided to go this route because I wanted to cover some other conditions I've encountered and I didn't want to learn anoter scripting language. Both approaches work. This scripts above will capture diagnostic information, including the backend log file, for subsequent diagnosis of root-cause.

In terms of the front-end -- have you considered linking one of your remote control buttons to the command "/etc/init.d/gdm restart"? This would enable you to restart the front end with your remote control when you run into problems. In my case, I linked the restart to the remote control's power button. This a thought...

Marc

Top

tkoster

Post subject:

Posted: Sun Jul 30, 2006 1:00 pm

Joined: Mon Apr 04, 2005 10:50 am

Posts: 120

What happens when you stop the backend on purpose to tweak card settings or repair the database? Does it stay stopped or automatically restart?

Top

marc.aronson

Post subject:

Posted: Sun Jul 30, 2006 6:04 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

tkoster, the script I've written does not distinguish a deliberate stop of the backend from an unintended halt. The way I deal with this is executing "/etc/init.d/mbemon stop" before stopping the backend.

For my purposes, I've created a single stop script that stops apache, mbemon, mythbackend & mysql with the issuance of a single command. I've also created the corresponding start script...

Cesman, I agree that the ultimate goal is to diagnose back-end failures and fix the underlying bug so that over time the backend becomes 100% stable. That is the reason that the script captures diagnostic information before doing a restart. My problem has been that the failures, while very infrequent, tend to happen while I am out of the country, and I wind up talking my wife through the reset process from 10K miles away, after she has been unable to use the system for some period of time. I felt this was a better soluton for me, given my circumstances...

Marc

Top

marc.aronson

Post subject:

Posted: Sun Jul 30, 2006 6:21 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

Cesman, I've been thinking some more about your post. It seems like another one of your concerns is the risk of entering a "restart loop" if the cause of the halt is a persistant problem. Is this correct?

Marc

Top

tjc

Post subject:

Posted: Sun Jul 30, 2006 6:28 pm

Joined: Thu Mar 25, 2004 11:00 am

Posts: 9551

Location: Arlington, MA

Yes. It's a classic systems programming problem. Automatic restart need to be handled very carefully lest it make the problem far worse. There are various ways you can try to mitigate the risk like restart counts and doubling restart delays, but even with those it can bite you hard.

Top

marc.aronson

Post subject:

Posted: Sun Jul 30, 2006 6:49 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

TJC, agreed. This is a good point. While it's not a bullet-proof approach, I've modified the script to track restarts. In this new version, if it detects a need to restart the backend more than 4 times over the previous 15 checks, it will log a message in it's log file and exit, so that it doesn't trigger a permenant restart loop.

I've tested this condition and it seems to work properly, but I want to let it run for a bit before I post the updated version.

Marc

Top

cesman

Post subject:

Posted: Sun Jul 30, 2006 7:54 pm

Joined: Fri Sep 19, 2003 7:05 pm

Posts: 5088

Location: Fontana, Ca

Yes, that is my concern.

_________________
cesman

When the source is open, the possibilities are endless!

Top

marc.aronson

Post subject:

Posted: Sun Jul 30, 2006 8:22 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

I've updated the script in my original post to reduce the risk of this happening. In this new version, if it detects a need to restart the backend more than 4 times over the previous 15 checks, it will log a message in it's log file and exit, so that it doesn't trigger a permenant restart loop.

Marc

Top

randomhtpcguy

Post subject:

Posted: Wed Aug 02, 2006 7:55 am

Joined: Mon Nov 07, 2005 10:09 am

Posts: 153

marc. Looks like your making good progress. To answer your question for deliberate restarts I do /etc/init.d/monit stop first and then the others you mentioned.

The frontend needs to be started from a wrapper which creates a pid file like the backend, mysql, gdm so i think it will be more complicated to setup using monit (they give the suggested wrapper in the faq) whereas the backend and mysql already start this way so are very easy because you just add the pid file to the config and your done.

Autorestarting the frontend might be more trouble than helpful. Its obvious when it has crashed and nothing is lost when your away and it crashes. gdm restart from remote would be sufficient when the remote is working.

I still have a periodic freezes of the frontend in mythmusic when i navigate quickly between songs, but the remote stops responding (no keyboard ) so the gdm restart doesn't work. I would like to have the box power button do a lowlevel kill or magic sysrq or some safe reboot, but so far i've had to do a hard reset which causes me to have to run fsck manually on reboot and is dangerous but luckily appears to fix things without moving stuff to the lost+found.

Obviously, i wouldn't want this type of crash(freeze) to be automatically reset and ignored as cesman points out. The only reason i bring this up is ... could mysql or the backend be crashing and restarting in a loop causing the frontend to crash. The restart limit with increasing delays would likely help. I think anytime monit or your script has to take action the admin should be notified. When i'm home i periodically check the uptimes on the monit webpage. Its in the logs but as you say this is designed for when your away so maybe an email would be nice.

Cesman- an "unusual increased rate of logging" detection and alert is something i should try before I suggest. I know it would be complicated as some normal processes probably grow the log like building a dvd or transcoding. I will look around though. Usually the small size of root solves this problem as it fills up and prevents logging in. So I guess in principle this forces the user to fix the problems early and often. I guess its more clever than I realized.

Top

cesman

Post subject:

Posted: Wed Aug 02, 2006 8:39 am

Joined: Fri Sep 19, 2003 7:05 pm

Posts: 5088

Location: Fontana, Ca

The idea is to find the issue and resolve it permanetly.

_________________
cesman

When the source is open, the possibilities are endless!

Top

marc.aronson

Post subject:

Posted: Wed Aug 02, 2006 10:56 pm

Joined: Tue Jan 18, 2005 2:07 am

Posts: 1532

Location: California

randomhtpcguy, it sounds like your system has a nasty problem with the kind of front-end lockup you are seeing. Have you checked /var/log/messages and /var/log/kern.log to see if there are any diagnostic messages that would help solve the problem? I was having problems with front-end lock ups on my old hardware. I would get messages that read something like this:

Code:

Mar  8 20:45:09 mythhd kernel: NVRM: Xid: 16, Head 00000000 Count 003ada32
Mar  8 20:45:13 mythhd kernel: NVRM: Xid: 8, Channel 0000001e

Based on what I read in various forums, that problem is sometimes caused by a motherboard problem, so I rebuilt my system with a new mother board & CPU and that problem has gone away. I'm not sure if this applies to you, but I thought I'd mention it.

Marc

Top

Page 1 of 2

[ 23 posts ]

Go to page 1, 2 Next

Board index KnoppMyth General

All times are UTC - 6 hours

Who is online

Users browsing this forum: No registered users and 9 guests

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum