man(1) Manual page archive


     DOC2TXT(1)                                             DOC2TXT(1)

     NAME
          doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings,
          msexceltables - extract printable text from Microsoft
          documents

     SYNOPSIS
          doc2txt [ file.doc ]
          doc2ps [ file.doc ]
          wdoc2txt [ file.doc ]
          xls2txt [ file.xls ]
          aux/olefs [ -m mtpt ] file.doc
          aux/mswordstrings mtpt/WordDocument
          aux/msexceltables [ -qaDnt ] [ -d delim ] [ -c column-range
          ] [ -w worksheet-range ] mtpt/Workbook

     DESCRIPTION
          Doc2txt is an rc(1) script that uses olefs and mswordstrings
          to extract the printable text from the body of a Microsoft
          Word document and write it on the standard output.  Doc2ps
          is similar, but emits PostScript corresponding to the docu-
          ment.  Wdoc2txt is similar to doc2txt, but uses plumb(1) to
          send the output to a new acme(1) window instead.  Xls2txt
          performs a similar function for Microsoft Excel documents.

          Microsoft Office documents are stored in OLE (Object Linking
          and Embedding) format, which is a scaled down version of
          Microsoft's FAT file system.  Olefs presents the contents of
          an MS Office document as a file system on mtpt, which
          defaults to /mnt/doc.  Mswordstrings or msexceltables may
          then be used to parse the files inside, extracting a text
          stream.  Msexceltables may be given options to control the
          formatting of its output.

          -a        Attempt conversion of non-tabular sheets in the
                    workbook (charts).
          -d delim  Sets the inter-field delimiter to the string
                    delim, by default a single space.
          -D        Enables debugging output.
          -c range  Range is a comma-separated list of column numbers
                    and ranges.  Ranges are separated by dashes.
                    Limit processing to just those columns named; by
                    default all columns are output.
          -n        Disables field padding to column width.
          -q        Disable quoting of textural fields (see quote(2).)
          -t        Truncate fields to the column width.
          -w range  Range is a comma-separated list of worksheet num-
                    bers and ranges, this limits the sheets output
                    using the same syntax as the -c option above.
                    Suppressed chart pages are always included in the
                    sheet count.

     DOC2TXT(1)                                             DOC2TXT(1)

     EXAMPLE
          Extract pieces of an MS Excel spreadsheet.
               aux/olefs report.xls
               msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
               unmount /mnt/doc

     SOURCE
          /rc/bin              doc2txt, doc2ps, wdoc2txt, and xls2txt
          /sys/src/cmd/aux     the others

     SEE ALSO
          strings(1)
          ``Microsoft Word 97 Binary File Format'', at Microsoft's
          developer (MSDN) home page.
          ``LAOLA Binary Structures'',
          http://user.cs.tu-berlin.de/~schwartz/pmh
          ``OpenOffice.Org's Excel Documentation'',
          http://sc.openoffice.org/excelfileformat.pdf