crash page from Section 8 of the unix 7th manual


     CRASH(8)                                                 CRASH(8)

     NAME
          crash - what to do when the system crashes

     DESCRIPTION
          This section gives at least a few clues about how to proceed
          if the system crashes.  It can't pretend to be complete.

          Bringing it back up. If the reason for the crash is not evi-
          dent (see below for guidance on `evident') you may want to
          try to dump the system if you feel up to debugging.  At the
          moment a dump can be taken only on magtape.  With a tape
          mounted and ready, stop the machine, load address 44, and
          start.  This should write a copy of all of core on the tape
          with an EOF mark.  Caution: Any error is taken to mean the
          end of core has been reached.  This means that you must be
          sure the ring is in, the tape is ready, and the tape is
          clean and new.  If the dump fails, you can try again, but
          some of the registers will be lost.  See below for what to
          do with the tape.

          In restarting after a crash, always bring up the system
          single-user.  This is accomplished by following the direc-
          tions in boot(8) as modified for your particular installa-
          tion; a single-user system is indicated by having a particu-
          lar value in the switches (173030 unless you've changed
          init) as the system starts executing.  When it is running,
          perform a dcheck and icheck(1) on all file systems which
          could have been in use at the time of the crash.  If any
          serious file system problems are found, they should be
          repaired.  When you are satisfied with the health of your
          disks, check and set the date if necessary, then come up
          multi-user.  This is most easily accomplished by changing
          the single-user value in the switches to something else,
          then logging out by typing an EOT.

          To even boot UNIX at all, three files (and the directories
          leading to them) must be intact.  First, the initialization
          program /etc/init must be present and executable.  If it is
          not, the CPU will loop in user mode at location 6.  For init
          to work correctly, /dev/tty8 and /bin/sh must be present.
          If either does not exist, the symptom is best described as
          thrashing.  Init will go into a fork/exec loop trying to
          create a Shell with proper standard input and output.

          If you cannot get the system to boot, a runnable system must
          be obtained from a backup medium.  The root file system may
          then be doctored as a mounted file system as described
          below.  If there are any problems with the root file system,
          it is probably prudent to go to a backup system to avoid
          working on a mounted file system.

     CRASH(8)                                                 CRASH(8)

          Repairing disks. The first rule to keep in mind is that an
          addled disk should be treated gently; it shouldn't be
          mounted unless necessary, and if it is very valuable yet in
          quite bad shape, perhaps it should be dumped before trying
          surgery on it.  This is an area where experience and
          informed courage count for much.

          The problems reported by icheck typically fall into two
          kinds.  There can be problems with the free list: duplicates
          in the free list, or free blocks also in files.  These can
          be cured easily with an icheck -s. If the same block appears
          in more than one file or if a file contains bad blocks, the
          files should be deleted, and the free list reconstructed.
          The best way to delete such a file is to use clri(1), then
          remove its directory entries.  If any of the affected files
          is really precious, you can try to copy it to another device
          first.

          Dcheck may report files which have more directory entries
          than links.  Such situations are potentially dangerous; clri
          discusses a special case of the problem.  All the directory
          entries for the file should be removed.  If on the other
          hand there are more links than directory entries, there is
          no danger of spreading infection, but merely some disk space
          that is lost for use.  It is sufficient to copy the file (if
          it has any entries and is useful) then use clri on its inode
          and remove any directory entries that do exist.

          Finally, there may be inodes reported by dcheck that have 0
          links and 0 entries.  These occur on the root device when
          the system is stopped with pipes open, and on other file
          systems when the system stops with files that have been
          deleted while still open.  A clri will free the inode, and
          an icheck -s will recover any missing blocks.

          Why did it crash? UNIX types a message on the console type-
          writer when it voluntarily crashes.  Here is the current
          list of such messages, with enough information to provide a
          hope at least of the remedy.  The message has the form
          `panic: ...', possibly accompanied by other information.
          Left unstated in all cases is the possibility that hardware
          or software error produced the message in some unexpected
          way.

          blkdev
               The getblk routine was called with a nonexistent major
               device as argument.  Definitely hardware or software
               error.

          devtab
               Null device table entry for the major device used as
               argument to getblk. Definitely hardware or software

     CRASH(8)                                                 CRASH(8)

               error.

          iinit
               An I/O error reading the super-block for the root file
               system during initialization.

          out of inodes
               A mounted file system has no more i-nodes when creating
               a file.  Sorry, the device isn't available; the icheck
               should tell you.

          no fs
               A device has disappeared from the mounted-device table.
               Definitely hardware or software error.

          no imt
               Like `no fs', but produced elsewhere.

          no inodes
               The in-core inode table is full.  Try increasing NINODE
               in param.h.  Shouldn't be a panic, just a user error.

          no clock
               During initialization, neither the line nor pro-
               grammable clock was found to exist.

          swap error
               An unrecoverable I/O error during a swap.  Really
               shouldn't be a panic, but it is hard to fix.

          unlink - iget
               The directory containing a file being deleted can't be
               found.  Hardware or software.

          out of swap space
               A program needs to be swapped out, and there is no more
               swap space.  It has to be increased.  This really
               shouldn't be a panic, but there is no easy fix.

          out of text
               A pure procedure program is being executed, and the
               table for such things is full.  This shouldn't be a
               panic.

          trap
               An unexpected trap has occurred within the system.
               This is accompanied by three numbers: a `ka6', which is
               the contents of the segmentation register for the area
               in which the system's stack is kept; `aps', which is
               the location where the hardware stored the program sta-
               tus word during the trap; and a `trap type' which
               encodes which trap occurred.  The trap types are:

     CRASH(8)                                                 CRASH(8)

          0         bus error
          1         illegal instruction
          2         BPT/trace
          3         IOT
          4         power fail
          5         EMT
          6         recursive system call (TRAP instruction)
          7         11/70 cache parity, or programmed interrupt
          10        floating point trap
          11        segmentation violation

          In some of these cases it is possible for octal 20 to be
          added into the trap type; this indicates that the processor
          was in user mode when the trap occurred.  If you wish to
          examine the stack after such a trap, either dump the system,
          or use the console switches to examine core; the required
          address mapping is described below.

          Interpreting dumps. All file system problems should be taken
          care of before attempting to look at dumps.  The dump should
          be read into the file /usr/sys/core; cp(1) will do.  At this
          point, you should execute ps -alxk and who to print the pro-
          cess table and the users who were on at the time of the
          crash.  You should dump ( od(1)) the first 30 bytes of
          /usr/sys/core. Starting at location 4, the registers R0, R1,
          R2, R3, R4, R5, SP and KDSA6 (KISA6 for 11/40s) are stored.
          If the dump had to be restarted, R0 will not be correct.
          Next, take the value of KA6 (location 022(8) in the dump)
          multiplied by 0100(8) and dump 01000(8) bytes starting from
          there.  This is the per-process data associated with the
          process running at the time of the crash.  Relabel the
          addresses 140000 to 141776.  R5 is C's frame or display
          pointer.  Stored at (R5) is the old R5 pointing to the pre-
          vious stack frame.  At (R5)+2 is the saved PC of the calling
          procedure.  Trace this calling chain until you obtain an R5
          value of 141756, which is where the user's R5 is stored.  If
          the chain is broken, you have to look for a plausible R5, PC
          pair and continue from there.  Each PC should be looked up
          in the system's name list using adb(1) and its `:' command,
          to get a reverse calling order.  In most cases this proce-
          dure will give an idea of what is wrong.  A more complete
          discussion of system debugging is impossible here.

     SEE ALSO
          clri(1), icheck(1), dcheck(1), boot(8)