man(1) Manual page archive


NAME
     crash - what to do when the system crashes

DESCRIPTION
     This section gives at least a few clues about how to proceed
     if the system crashes.  It can't pretend to be complete.

     How to bring it back up.   If the reason for the crash is
     not evident (see below for guidance on `evident') you may
     want to try to dump the system if you feel up to debugging.
     At the moment a dump can be taken only on magtape.  With a
     tape mounted and ready, stop the machine, load address 44,
     and start.  This should write a copy of all of core on the
     tape with an EOF mark.  Caution: Any error is taken to mean
     the end of core has been reached.  This means that you must
     be sure the ring is in, the tape is ready, and the tape is
     clean and new.  If the dump fails, you can try again, but
     some of the registers will be lost.  See below for what to
     do with the tape.

     In restarting after a crash, always bring up the system
     single-user.  This is accomplished by following the direc-
     tions in boot procedures (VIII) as modified for your partic-
     ular installation; a single-user system is indicated by hav-
     ing a particular value in the switches (173030 unless you've
     changed init) as the system starts executing.  When it is
     running, perform a dcheck and icheck (VIII) on all file sys-
     tems which could have been in use at the time of the crash.
     If any serious file system problems are found, they should
     be repaired.  When you are satisfied with the health of your
     disks, check and set the date if necessary, then come up
     multi-user.  This is most easily accomplished by changing
     the single-user value in the switches to something else,
     then logging out by typing an EOT.

     To even boot UNIX at all, three files (and the directories
     leading to them) must be intact.  First, the initialization
     program /etc/init must be present and executable.  If it is
     not, the CPU will loop in user mode at location 6.  For init
     to work correctly, /dev/tty8 and /bin/sh must be present.
     If either does not exist, the symptom is best described as
     thrashing.  Init will go into a fork/exec loop trying to
     create a Shell with proper standard input and output.

     If you cannot get the system to boot, a runnable system must
     be obtained from a backup medium.  The root file system may
     then be doctored as a mounted file system as described
     below.  If there are any problems with the root file system,
     it is probably prudent to go to a backup system to avoid
     working on a mounted file system.

     Repairing disks.   The first rule to keep in mind is that an
     addled disk should be treated gently; it shouldn't be

 1

     mounted unless necessary, and if it is very valuable yet in
     quite bad shape, perhaps it should be dumped before trying
     surgery on it.  This is an area where experience and
     informed courage count for much.

     The problems reported by icheck typically fall into two
     kinds.  There can be problems with the free list: duplicates
     in the free list, or free blocks also in files.  These can
     be cured easily with an icheck -s.  If the same block
     appears in more than one file or if a file contains bad
     blocks, the files should be deleted, and the free list
     reconstructed.  The best way to delete such a file is to use
     clri (VIII), then remove its directory entries.  If any of
     the affected files is really precious, you can try to copy
     it to another device first.

     Dcheck may report files which have more directory entries
     than links.  Such situations are potentially dangerous; clri
     discusses a special case of the problem.  All the directory
     entries for the file should be removed.  If on the other
     hand there are more links than directory entries, there is
     no danger of spreading infection, but merely some disk space
     that is lost for use.  It is sufficient to copy the file (if
     it has any entries and is useful) then use clri on its inode
     and remove any directory entries that do exist.

     Finally, there may be inodes reported by dcheck that have 0
     links and 0 entries.  These occur on the root device when
     the system is stopped with pipes open, and on other file
     systems when the system stops with files that have been
     deleted while still open.  A clri will free the inode, and
     an icheck -s will recover any missing blocks.

     Why did it crash?   UNIX types a message on the console
     typewriter when it voluntarily crashes.  Here is the current
     list of such messages, with enough information to provide a
     hope at least of the remedy.  The message has the form
     `panic: ...', possibly accompanied by other information.
     Left unstated in all cases is the possibility that hardware
     or software error produced the message in some unexpected
     way.

     blkdev
          The getblk routine was called with a nonexistent major
          device as argument.  Definitely hardware or software
          error.

     devtab
          Null device table entry for the major device used as
          argument to getblk.  Definitely hardware or software
          error.

     iinit
          An I/O error reading the super-block for the root file

 2

          system during initialization.

     out of inodes
          A mounted file system has no more i-nodes when creating
          a file.  Sorry, the device isn't available; the icheck
          should tell you.

     no fs
          A device has disappeared from the mounted-device table.
          Definitely hardware or software error.

     no imt
          Like `no fs', but produced elsewhere.

     no inodes
          The in-core inode table is full.  Try increasing NINODE
          in param.h.  Shouldn't be a panic, just a user error.

     no clock
          During initialization, neither the line nor pro-
          grammable clock was found to exist.

     swap error
          An unrecoverable I/O error during a swap.  Really
          shouldn't be a panic, but it is hard to fix.

     unlink - iget
          The directory containing a file being deleted can't be
          found.  Hardware or software.

     out of swap space
          A program needs to be swapped out, and there is no more
          swap space.  It has to be increased.  This really
          shouldn't be a panic, but there is no easy fix.

     out of text
          A pure procedure program is being executed, and the
          table for such things is full.  This shouldn't be a
          panic.

     trap
          An unexpected trap has occurred within the system.
          This is accompanied by three numbers: a `ka6', which is
          the contents of the segmentation register for the area
          in which the system's stack is kept; `aps', which is
          the location where the hardware stored the program sta-
          tus word during the trap; and a `trap type' which
          encodes which trap occurred.  The trap types are:

          0    bus error
          1    illegal instruction
          2    BPT/trace
          3    IOT
          4    power fail

 3

          5    EMT
          6    recursive system call (TRAP instruction)
          7    11/70 cache parity, or programmed interrupt
          10   floating point trap
          11   segmentation violation

     In some of these cases it is possible for octal 20 to be
     added into the trap type; this indicates that the processor
     was in user mode when the trap occurred.  If you wish to
     examine the stack after such a trap, either dump the system,
     or use the console switches to examine core; the required
     address mapping is described below.

     Interpreting dumps.   All file system problems should be
     taken care of before attempting to look at dumps.  The dump
     should be read into the file /usr/sys/core; cp (I) will do.
     At this point, you should execute ps -alxk and who to print
     the process table and the users who were on at the time of
     the crash.  You should dump ( od (I)) the first 30 bytes of
     /usr/sys/core.  Starting at location 4, the registers R0,
     R1, R2, R3, R4, R5, SP and KDSA6 (KISA6 for 11/40s) are
     stored.  If the dump had to be restarted, R0 will not be
     correct.  Next, take the value of KA6 (location 22(8) in the
     dump) multiplied by 100(8)  and dump 1000(8) bytes starting
     from there.  This is the per-process data associated with
     the process running at the time of the crash.  Relabel the
     addresses 140000 to 141776.  R5 is C's frame or display
     pointer.  Stored at (R5) is the old R5 pointing to the pre-
     vious stack frame.  At (R5)+2 is the saved PC of the calling
     procedure.  Trace this calling chain until you obtain an R5
     value of 141756, which is where the user's R5 is stored.  If
     the chain is broken, you have to look for a plausible R5, PC
     pair and continue from there.  Each PC should be looked up
     in the system's name list using db (I) and its `:' command,
     to get a reverse calling order.  In most cases this proce-
     dure will give an idea of what is wrong.  A more complete
     discussion of system debugging is impossible here.

SEE ALSO
     clri, icheck, dcheck, boot procedures (VIII)

BUGS

 4