This file contains a write-up of the more technical aspects of uuencoding
and uudecoding.  First read the file UUSER.DOC, then read this for more
details.

Documentation for UUENCODE/DECODE 5.32

UU-encoding is a way to code a file which may contain any characters into
a standard character set that can be reliably sent over diverse networks.


THE CHARACTER ENCODING:

The basic scheme is to break groups of 3 eight bit characters (24 bits)
into 4 six bit characters and then add 32 (a space) to each six bit
character which maps it into a readily transmittable character.
Another way of phrasing this is to say that the encoded 6 bit characters
are mapped into the set:
        `!"#$%&'()*+,-./012356789:;<=>?@ABC...XYZ[\]^_
for transmission over communications lines.

As some transmission mechanisms compress or remove spaces, spaces are
changed into back-quote characters (a 96).  (A better scheme might be to
use a bias of 33 so the space is not created, but this is not done.)

A newer, less popular, encoding method, called XX-encoding uses the set:
        +-01..89ABC...XYZabc...xyz

In my opinion, XX-encoding is superior to UU-encoding because it uses
more "normal" characters that are less likely to get corrupted.  In fact
several of the special characters in the UU set do not get through an
EBCDIC to ASCII translation correctly.  Conversely, an advantage of the
UU set is that it does not use lower case characters.  Now-a-days both
upper and lower case are sent with no problems; maybe in the
communications dark ages, there was a problem with lower case.

This "UU" encode/decode pair can handle either XX or UU encoding.  The
encode program defaults to creating a UU encoded file; but can be run
with a "-x" option to create an XX encoding.

The decode program defaults to autodetect.  However the program can get
confused by comment lines preceding the actual encoded data.  The decode
mode can be forced to UU or XX with the "-u" or "-x" parameter.

Another option is for the character mapping table to be inserted at the
front of the file.  The format for this is discussed later.  The table
parameters are detected and used by this decode program.  (A table will
override the "-x" or "-u" parameters.)  The encode program can be run
with a "-t" option which tells it to put the table into the encoded file.

A third encode mapping is the one used by Brad Templeton's ABE program.
This is not handled by these programs as the check and control
information surrounding the actual encoded data is in a different form.

From a theoretical view, this encoding is breaking down 24 bits modulo 64.
Note that 64**3 is = 2**24.  The result is 24 bits in for 32 bits out, a
33% size increase.  Note that 85**5 > 2**32.  Also note that there are 94
transmittable ASCII characters (from 0x21 through 0x7e).  Thus modulo 85
encoding (the atob encoder) transforms 32 bits to 5 ASCII chars or 40
bits for a 25% size increase.

The trade off in the module 85 encoding is that many communications
systems do not reliably transmit 85 ASCII characters.  The tilde, carat,
brackets, and sometimes upper or lower case frequently get corrupted.

There are two other popular encoding techniques.  One is BinHex used on
Apple Computers. The current version is BinHex 4.0.  BinHex uses another
mapping into 64 characters.  The first encoded line in a BinHex file is
an encoded structure that contains the file name, size, checksum, date
and time The remaining lines are encoded data.

The other technique that I have seen is BinMail used on Unisys A-series.


COMPOSING A LINE OF ENCODED CHARACTERS:

A small number of eight bit characters are encoded into a single line and a
count is put at the start of the line.  (Most lines in an encoded file
have 45 encoded characters.  When you look at a UU-encoded file note that
most lines start with the letter "M".  "M" is decimal 77 which, minus the
32 bias, is 45.)

BinHex does not use this count character, every encoded line contains 64
characters.  Except the last is limited by the size obtained from the
first line.

This encode program optionally puts a check character at the end of each
line. The check is the sum of all the encoded characters, before adding
the mapping, modulo 64.

Note: Horton 9/1/87 UUENCODE has a bug in the line check algorithm; it
uses the sum of the original, not the encoded characters.  This decode
program accepts either form of line check character.

In previous versions (4.13 and lower) the line check characters was
generated by default by this encode program and was suppressed with the
"-L" option.  One reason to suppress them is if they will be decoded by
one of the old Horton decoders.  Most decoders either accept this form of
check or simply stop looking after the line length is exhausted.  My
feelings are mixed about the line checksums because errors of this type
essentially never occur.

Given modern, error-free communications systems and the CRC checks on the
entire file (see below) I have made the default for uuencoding to have NO
line level check characters effective version 4.21.  The "-L" option on
uuencode turns on generation of line checksums.  If you have a really bad
communications system and you want to isolate a problem, turn them on.

Uudecode automatically checks for the presence of line checksums; so the
default for uudecode is to leave line level checks on; if there are some
problems the "-L" option for uudecode turns them off.  Sometimes there is
junk at the end of the line which causes spurious line checksum errors.

I have encountered various other ways that encoders end lines.  One
encoder put an "M" at both the start and end of the line.  Another used a
line count character.  This decode program checks all of these.  I would
not be surprised if some encoder out there ends lines with sequential
astrological symbols.  If you encounter some other weird form of encoded
file, let me know. (The -L option turns line level checking off.)


PACKAGING THE LINES INTO FILES:

The lines of encoded data can be preceded by comments and by network
addressing information.  The encoded data is directly preceded by a line
containing:

             begin <file-mode> <file-name>

This line is created by the encoding program.  The decode program scans
the file looking for "begin" in column 1.  The following line is the
encoded data.

Some encoders put file time and date information on the begin line:

             begin <file-mode> <file-name> <date> <time>

My UUdecoder will accept this form of begin line, but does not use the
time and date information.

The final end of encoded data is an encoded line with zero encoded
characters (a back-quote), followed by a line containing "end".

For integrity checking, some encode programs insert checksums for the
entire file.  This decode tries to check for all known types of file
checksums.  This is discussed in more detail below.

This encode program puts a header line, containing the section number and
file name, in front of every section:

         "section <number> of uuencode of file <file name>"

At the end of a section the encode program inserts a line containing
checksum and file size information.  This can be suppressed with the "-c"
option.

Other encoders use a variety of section lines:

        The format of the Archive-name line is:
                "Archive-name: <name>/part<number>"
        for example:
                Archive-name: diskutl/part02

        The format of the part line is:
                <name> part<number>/<max-number>
        for example:
                diskutl part02/03

        WinCode uses:
                [ section: 1/2 file: diskutl.exe . . . .

        enuu uses:
                section 001/002  diskutl.exe  . . . .

This program checks for consistency of these names and numbers as of
release 5.0.  The problem is distinguishing random text from valid lines.

For each line that uudecode thinks is a "section" line, tests are made to
validate the current section number, the maximum section number and the
file name.  The program is conservative and may sometimes erroneously
give an "invalid section line" type of error.  Inspect the file; if
uudecode made a mistake, edit or delete the indicated line; and continue.
If the problem appears to be a uudecode problem, not just some random
comment lines that caused a one time problem, please contact me.

All the "integrity fields" (the checksum, the line check, and the section
header line) are inserted in a way that they will be ignored by other
UUDECODE programs that cannot handle them.  This decode program does not
require any of these fields; if not present, integrity checking is not
done.  This program pair is 100% downward compatible!


FILE NAMES:

See UUSER.DOC for a discussion of file naming conventions.



DECODE and VALID LINES:

The below information is to help you solve infrequent problems when
decoding files.  Normally you do not have to be concerned with any of
this stuff.

UUdecode sometimes get confused and thinks header lines are encoded data.
Sometimes this is because the separator line between sections (the "cut"
line) is indistinguishable from valid decodable data.  (An example is the
line "---" used as a cut line on several DOS BBS systems.)  You can tell
UUdecode that a specific line is a cut line and not a decodable line with
the -Z option:

    uudecode -Z "---" myfile

Other times there is not a cut line between file sections or there is
some other problem with the file.  If so, edit the file and try again.

When decode encounters a premature end-of-file or some data which is not
decodable, it assumes the end of a file section.  Decode is conservative
when it encounters data it cannot decode (better an error than a bad file).

Usually this undecodable data is valid "trailer" data put at the end of
file for data transmission purposes.  However the file may be bad.  So
decode continues to scan the file, if decode then encounters a line which
is decodable it assumes the file is bad.

When decode encounters a valid end of file section it must get the next
file in sequence.  If the file name ends with a number, decode tries the
next file name in numeric sequence.  Otherwise decode asks for a file
name.  If this file does not contain decodable data, decode asks for
another file to try.

If multiple sections are saved in a single file, each section must have
some type of section line for validation.  Decode builds a table of
section information so it can go back and reread if sections are not
saved in order.

The "SECTION" line inserted by the UUENCODE program is used for validity
checking only.  If not present, decode will accept any file containing
encoded lines.


OTHER FILE FORMS:

Sometimes files are wrapped in shell archives that automatically check
sequencing and call uudecode for you on Unix systems.  If you prefer to
download the raw files to MS-DOS, uudecode 5.32 will filter simple shell
scripts, that use the Unix 'sed' command, and decode the data
automatically.

There is one more rarely used feature of ENCODE: many input files can be
encoded into one large encode file.  (I have never seen this used.) The
end of an input file is a zero length encoded line, followed by another
"begin" line instead of by an "end" line.  This decode program will
decode this sort of file; but the encode will only handle a single
input file.


FILE LEVEL CHECKSUMS:

There are three types of file checksums found in encoded files:

       UUENCODE 2.14 and below inserted lines that gave the section
       size and the original input file size.  This is supplanted
       by a better technique in 3.07; but 3.07 UUDECODE still checks
       and validates the old form

       UUENCODE 3.07 and Rahul Dhesi's encode scripts compute a Unix
       "sum -r" on the encoded sections and on the original input file.
       A difference is that UUENCODE 3.07 puts the expected "sum -r"
       and size at the end of a file while Rahul''s scripts put them at
       beginning.  This UUDECODE analyzes either.

       The third form of checksum is a full 32 bit CRC that Rahul's
       script inserts.  My code does not handle this.  Rahul has written
       the BRIK program to check them.  If there is a "sum -r" failure,
       BRIK analysis should be considered.

Several encoders put in a line containing just the original file size.
My uudecode checks these.


TABLE LINES:

Some encoded files but the mapping used at the front of the encoded file,
just in front of the "begin" line.  The format for this is:

                   table
                   first 32 characters
                   second 32 characters

All this starts in column 1.

If decode encounters a table specification, it uses it and overrides any
command line parameters.  Encode will create the table lines if run with
the "-t" parameter.


COMPLETION CODES:

On successful completion, UUDECODE sets ERRORLEVEL to 0.  If there are
any problems, ERRORLEVEL is set to non-zero.

The purpose of "-e" is to automatically run an un-archiver (like PKZIP or
ARJ) when UUDECODE successfully completes.  If the "-e" option is given,
UUDECODE calls BAT file UNARCUUE on successful completion; UNARCUUE is
passed five parameters:

                the filename decoded (with path but no extension),
                the file extension,
                the input file name  (with path but no extension),
                the input file extension that is used,
                the number of sections.

Normally the file extension tells which un-archiver to call.
The UNARCUUE BAT file, can test these parameters and call the necessary
un-archiver.  If UNARCUUE is called, the return code from UUDECODE is the
return code passed back from UNARCUUE.  Note: one user had a problem in
that the routines called by UNARCUUE set the errorlevel to 1 which was
passed back to be the return code from UUDECODE.

The "-E" (upper case) option is like "-e" but you can supplythe name of
the file to execute.


BUGS and PROBLEMS:

I try to make this program as good as possible.  If you find a problem,
please send me a diskette.  You can mail a 3 1/2" diskette in a regular
envelope, with no special protection, with a single stamp.


CONCLUSION:

This works well for me.  On UNIX I find a program I want in 3 sections:
              PRG1, PRG2, PRG3.
I copy the three files down to my PC as PRG1.UUE, PRG2.UUE, and PRG3.UUE.  I
then just enter UUDECODE PRG and the thing decodes.


Done privately and not for profit (freeware).  Suggestions appreciated.
The programs are written in Turbo Pascal 5.5 with about 5% TASM for speed.
The source is not public domain.  If included in your for profit product,
please contact me.


Richard Marks
931 Sulgrave Lane
Bryn Mawr, PA 19010

Copyright Richard E. Marks, Bryn Mawr, PA, 1992, 1993, 1994

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Change log (started with 5.13):
5.32
Anaylsis of size lines used by other encoders. Analysis of uupost's
END lines.
Better 'sed' handling to work with Sun posting routines
Validation of file name on 'begin' line

5.28
split ot .001, .002.
Better analysis of begin line - line following must be valid encoded data.
Improved (again) part line analysis.
Better file name parameter analysis.

5.25
Fix memory overflow bug in 5.24.

5.24
Ignore pad characters inserted by some comm systems.

5.22 & .23
Improve analysis of "part" lines to accept form used in bin.pictures
group and by mail server on Garbo.
Fixes problem with encoded files that use blanks rather than backquotes.

5.20
Z command line option to specify a "cut" line between multiple sections.
Needed if cut line is a valid decodable line (of low probability) which
the user chooses to be interpreted as a "cut" line.  Plus other
improvements in detecting end of section.

5.16
Encode will split to a minimum of 75 (was 150) lines.
Passes more info to UNARCUUE

5.15
Fixes a problem with trailing blanks on lines.

5.14:
Fixes a minor bug in which a redundant error message was produced when
decoding single section files.


5.13 VERSUS 5.10:

5.13 decode has a command line option that disables all interactive
responses to make it more useable from some BBS systems.  Examine the
"y" and "Y" options.

5.13 can increment the number on files up to five digits.  The prior
limit was two digits.  You can now save files with names based on news
article numbers.

5.13 can decode files encoded into 100 or more parts.  A restriction is
that if there are more than 100 parts, the sections MUST be in order.
