=============================================================================
Section 4: Making backups...
=============================================================================

From: rickf@pmafire.inel.gov (Rick Furniss)
Organization: WINCO

Murphy's law #?? , preventive maintenence doesnt.

try this one:   /etc/dump /dev/rmt/0m /dev/dsk/0s1
          Or:   tar cvf /dev/root /dev/rmt0

Backups on unix can be one of the most dangerous commands used, and they
are used to prevent rather than cause a problem.  If any Unix utility were
a candidate for a warning message, or error checking, this would be it.

Just in case you didnt catch the HORROR above, the parameters are backworks
causing a TOTAL wipe out of the root file systems.

More systems have been wiped out by admins than any hacker could do in
a life time.

-----------------------------------------------------------------------------

From: grant@unisys.co.nz (Grant McLean)
Organization: Unisys New Zealand

One of my customers (who shall remain nameless) was having a problem with
insufficient swap space.  I recommended that he back up the system, boot
off the OS tape, repartition the disk, remake the filesystems and restore
the data (any idiot could do this, right? :-) ).  I also suggested that if
he wasn't confident of achieving all this, we could provide a skilled
person for a modest fee.  Of course he was fully confident so I left him
to it.

Next day I get a call from the guy to say he'd been there all night and
he'd had all sorts of funny messages when restoring from tape.

Eventually we tracked his problem down to the backup script he'd been
using.  It was a simple one liner:

  find / -print [ cpio -oc ] dd -obs=100k of=/dev/rmt0 2>/dev/null

This was a problem because:

  1) His system had two 300MB drives
  2) He only had a 150MB tape drive
  3) The same script was being run every night by a cron job
  4) All his backups were created by this script

(In case you haven't worked it out, the dd is to speed up writes to tape
but it has the unfortunate side effect that CPIO never finds out about
the end of tape.  Because the errors were going to the bit bucket, they
never knew their backups were incomplete until they came to restore from
them).

I would have loved to be a fly on the wall when he explained to his boss
that the data was gone and there was no way of getting it back.

-----------------------------------------------------------------------------

From: ravi@usv.com (Ravi Ramachandran)

Live 24 hour online system.  Does backup over the ethernet to a SCSI tape.
Unfortunately, no SCSI on this system to recover if root/ethernet dies.
This was a Compaq Systempro running SCO Unix.  Slated a downtime of 4-6am.
I thought that it will take me only 30 minutes, as I had installed a
similar (Adaptec) SCSI board on a similiar hardware on SCO. Only difference
was that this machine was running MPX (multiprocess extension) and you had
to deinstall it, install the SCSI, and then reinstall MPX (proper procedure).
I had made all my slot/IRQ charts the previous day, and so got busy removing
MPX.  Then said "mkdev tape", go through the IDs, and am almost at home
base.  Then... "link kit not installed, use floppy X1" when I tried to remake
the kernel.  For some reason, when I removed the multiprocessor extension,
the single processor files were not moved to their right location.  And if
I reinstalled the single, all my changes would be lost.  Finally, restored the
OS (from backup) on the remote machine, and then rcp-ed them over to bring back
the MPX version.  Unfortunately, rcp does not maintain the date/ permissions,
etc.  Got a limpimg version of the machine back on-line about 45 minutes
after its slated time, and spent the rest of the day fixing vagrant files.
The next week, I moved the online programs to another machine (a headache),
and reinstalled this machine from scratch.

-----------------------------------------------------------------------------

From: keith@ksmith.uucp (Keith Smith)
Organization: Keith's Computer, Hope Mills, NC

My dumbest move ever.  Client in Charlotte, NC (3 hours + away) has
Xenix box with like 15 users running single app.  They have a tape
backup of course.  Anyway they ran slam out of space on the 70MB disk
drive so I upgraded them from an MFM to a SCSI 150MB disk.  Restored
their app & data files, and they were off and running.  Anyway they did
an application directories backup (tar) on a daily basis and backed the
rest of the system up with tar on Monday morning.

Being a nice guy I built a menu system and installed the backups on the
menu so they could do it with a push of the button.  Swell,  It's Monday.
Call if anything else comes up.  1 week later I get a call.  Console is
scrolling messages, App seems to be missing yesterday's orders, etc.
Call in, and cannot log in.  'w' doesn't work.  Crazy stuff.  Really
strange.

Grab old drive/controller, fly to Charlotte replace drive, install
app backup tape.  They re-key missing stuff, etc.  Bring new disk back.
Won't boot, won't do anything.  Boot emergency floppy set.  Looking
around.  Can't figure but have backup tape from that morning that
"completed successfully".  tar tvf /dev/rct0.  Hmm, why all these
files look very OLD.  Uh, Where, Uh.  Look at menu command for the
"backup" is 'tar xvf /dev/rct0 /'

Anyway, I owned up to the mistake, re-loaded the SCSI drivers and
changed the command to 'tar cvf ..'

Hehehe,  Now I DOUBLE check what I put on a menu, and try not to be in a
*HURRY* when I do this stuff.

-----------------------------------------------------------------------------

From: mike@pacsoft.com (Mike Stefanik)
Organization: Pacific Software Group, Riverside, CA

One of the more interesting problems that I ran into was a customer that
was having problems with their SCSI tape drive on a XENIX box. Around midnight,
every night, the system would automatically backup and verify their data. One
day, the customer needed to restore some data files from the last night's
backup. She called because, although the restore worked just fine, she didn't
see the busy light on the drive come on, and it didn't sound like the tape was
moving. I dialed up the system, had her put a tape in and did a retension --
the drive started winding the tape back and forth, and we both concluded that
she was mistaken. After all, the tape was retensioning, and she wasn't getting
any backup or verify errors at all. I just chalked this one up to user
confusion.

A few days later, she called back saying that there really is something wrong
with the tape. She needed to restore some data from a few days ago, and like
before, the busy light on the drive didn't come on, but files did restore.
However when she started the application program, the data hadn't changed. I
dialed up the system again, and just on a fluke, issued a "df" -- it showed
their rather large root filesystem to be nearly full. Confused, I did a "find",
searching for files over 1MB. Of course, what I found was this huge file named
/dev/rct0. As I later discovered, their system had crashed a few weeks ago,
and she had simply answered "yes" to a bunch of questions that it asked when
she brought it back up. The /dev/rct0 device was removed (but /dev/xct0 was
still there, which allowed me to retension the tape) and the backup script
never checked to make sure that it was actually writing to a character device.

Needless to say, I modified the backup program to make sure that it was really
writing to a device, and I made her promise to call me whenever the system
crashed or asked "funny questions" when it was booting.

------------------------------------------------------------------------------
*NEW*

From: Nick Sayer 

And then there was the time the / disk was full but nobody knew where
the space was going. 'Course this was on an Ultrix box and everyone's
used to using Suns, so they were tarring to /dev/rst*. Sure enough,
/dev/rst8 was a 20M file in a 25M partition.

=============================================================================
Section 5: Blaming it on the hardware...
=============================================================================

From: kelley@epg.nist.gov (Mike Kelley)
Organization: NIST

We have a cluster of HP workstations and, once upon a time, were using
1/4-tape as the backup medium.  This was very slow and cumbersome, as
we were forever increasing the amount of disk space on our system, and
we decided to purchase HP's optical jukebox to use both as large
removable media and as the primary backup device.

We had been experiencing occasional problems with the 1/4-inch tape
backups, but HP's hardware service engineer convinced us that the
problems were resolved.  A complete backup was performed prior to
installation (by the HP engineer) of the jukebox.  Two unfortunate
things happened.  First, the problems on our backup tapes were due to
intermittent hardware problems on the tape drive which were not
discovered by the extensive diagnostics performed on the tape drive.
Second, the engineer installed the jukebox with the same hardware SCSI
address as our root file system.

As you may have anticipated, the attempt to mediainit the first
optical cartridge resulted in a rather ungraceful failure of the root
file system.  This was compounded by the fact that much of the data on
the backup tapes was not recoverable.

-----------------------------------------------------------------------------

From: robjohn@ocdis01.UUCP (Contractor Bob Johnson)
Organization: Tinker Air Force Base, Oklahoma

We had an operator lay a book on the console keyboard, throwing the console
into system monitor mode.  This stops the system clock, which locks every
session dead in it's tracks. At that time we had over 100 user sessions
running.  Most of our inbound lines are essentially modem lines on a very
large "rotor".  After their session hung for a minute or so, many users
disconnected and called back.  They got connected, but received no login
prompt (the system was in a sort of suspended animation).  Little did they
know that they were now on a different port than the one they just abandoned.

A call to the computer room soon identified the problem, and the operator was
given the commands to resume normal system operation.  As near as we can
figure, somewhere around half of the users had disconnected but the system
didn't notice because it never saw carrier drop on those ports (being dead).
New, different users had now connected to those ports.  We received several
semi-confused user calls, realized what had happened and invoked the magic
"/etc/shutdown NOW" command.  The procedure (should this ever happen again)
will be to manually panic the system and reboot.  I also surgically removed
the keycap from that particular key on our terminal - you have to work to
press it now!

-----------------------------------------------------------------------------

From: stehman%citron.cs.clemson.edu@hubcap.clemson.edu (Jeff Stehman)
Organization: Clemson University

Many years ago a tiny little college in the middle of nowhere purchased an
NCR tower, then a newfangled contraption.  A half-dozen of us were using it
for an assembly class.  The prof should have made his warnings about TRAP a
little more clear.  One student runs his program and it suddenly begans
spawning processes, rapidly filling the machine.  The prof came in, amused,
logged on as superuser, and killed a process.  Another process was
immediately spawned.  The prof tried again.  He was ignored.  He was also no
longer amused.  After several minutes he gave up and turned off the box.
The tower didn't even flinch.  He pulled the plug.  Nothing.  He ripped the
back off the box and dug around.  Finally he found the fuse and pulled it,
killing the machine.  Some of us later claimed we heard laughter as it went
down.

Many times since then I have wished other computers came with a backup
battery as standard issue.

-----------------------------------------------------------------------------

From: pinard@IRO.UMontreal.CA (Francois Pinard)
Organization: Universite' de Montre'al

Many things happened in those many years I've been with computers.
The most horrorful story I've seen is not UNIX related, but it is
certainly worth a tale.  Here it goes.

This big (:-) CDC 6600 system was bootable from tape drive 0, using
these 12 inches wheels containing 1/2" tape.  The *whole* system was
reloaded anew from the tape each time we restarted the machine,
because there was no permanent file system yet, the disks were not
meant to retain files through computer restarts (unbelievable today, I
know :-).  The deadstart tapes (as they were called) were quite
valuable, and we were keeping at least a dozen backups of those, going
back maybe one or two years in development.

The problem was that the two vacuum capstans which were driving the
tape 0, near the magnetic heads, were not perfectly synchronized, due
to an hardware misadjustment.  So they were stretching the tape while
they were reading it, wearing it in a way invisible to the eye, but
nevertheless making the tape irrecoverable.  Besides that, everything
was looking normal in the tape physical and electrical operations.  Of
course, nobody knew about this problem when it suddenly appeared.

All this happened while all the system administration team went into
vacation at the same time.  Not being a traveler, I just stayed
available `on call'.  The knowledgeable operators were able solve many
situations, and being kind guys for me (I was for them :-), they would
not disturb me just for a non-working deadstart tape.  Further, they
had a full list of all deadstart backup tapes.  So, they first tried
(and destroyed) half a dozen backups before turning the machine to the
hardware guys, whom destroyed themselves a few more.

The technicians had their own systems for diagnostics, all bootable
from tape drive 0, of course.  They had far less backups to we did.
They destroyed almost them all before calling me in.  Once told what
happened, my only suggestion was to alter the deadstart sequence so to
become able to boot from another tape drive.  Strangely enough, nobody
thought about it yet.  In these old times, software guys were always
suspecting hardware, and vice versa :-).

Happily enough, the few tapes left started, both for production and
for the technicians.  Tape drive 0 being quite suspectable, the
technicians finally discovered the problem and repaired it.  My only
job left was to upgrade the system from almost one year back, before
turning it to operations.  This was at the time, now seemingly lost,
when system teams were heavily modifying their operating system
sources.  This was also the time when everything not on big tapes was
all on punched Hollerith cards, the only interactive device being the
system console.  It took me many days, alone, having the machine in
standalone mode.  The crowd of users stopped regularily in the windows
of the computer room, taking bets, as they were used to do, on how
fast I will get the machine back up (I got some of my supporters
loosing their money, this time :-).

This was quite hard work for me, done under high pressure.  When the
remainder of the staff returned from trip, and when I told them the
whole tale, we decided to never synchronize our holidays again.

-----------------------------------------------------------------------------

From: ravi@usv.com (Ravi Ramachandran)

At one time, there were three of us working on a unique SVR3.2 motorola
based machine, on a R&D project.  I took care of all the SysAdmin tasks,
I had a back up administrator, and the third person had been stuck into
my group (company politics).  The group project files were in /user and
the individial ones in /user2.  We had managed to get backup from the
operations department for /user only (not even /; security paranoia?).
Anyway, I had another scsi hard disk that I used for making a disk copy
of the primary scsi hard disk every Friday.  This disk was connected, but
not mounted, so that I could do the disk backup from my desk when I wanted
to.  This machine used to sometimes get a scsi error such that you could
not log in, but the processes already running on the machine were not
affected.  If were logged in the console, you just powered off the machine
for a few minutes and rebooted it.  Around holidays time the other Admin
was off in a long vacation.  I had taken Monday off, and headed off for a
four day weekend.  The machine does the same blurp.  The third person
decides the power off the machine & turn it back on immediately.  It does
not come up properly.  She decides to reinstall the machine using the
installation tape that I had unfortunately left in the open.  Reformats the
hard disk, installs the base system, and is stuck at that point when I come
back in on Tuesday.  I almost blow a blood vessel but try to keep calm
'cause I had made a disk copy about 10 days before (too anxious to get on
my holiday the previous week).  Try to mount the disk... hit vaccuum.  Try
using dd to look at the disk... Seemed to be a large /dev/null :-?  When the
lady decided to reinstall the system, it asked her what scsi disks she
wanted to reformat, and she said "y" for both 0 & 1!!  All my
sample/trial&error work for a year had bitten the dust.
My only (small) consolation was that I was not the only one affected.

-----------------------------------------------------------------------------

From: williams@nssdcs.gsfc.nasa.gov (Jim Williams)
Organization: NASA Goddard Space Flight Center, Greenbelt, Maryland

Story One is about The Sun 3/260 That Froze Solid.  One day a user
reported that the Sun 3/260 he was using was "dead".  On inspection, I
found the Sun at the console prompt and the keyboard totally
unresponsive.  The L1-A sequence did nothing.  So I power cycled it.
Nothing.  A blank screen, no activity.  I was ready to call service,
then decided to try rebooting with the normal/diag switch set to diag.
On looking at the back of the pedestal, I saw that the ethernet cable
had been pressed up against the reset switch!  ARGGGHHHH!  The user
had pushed the machine back just enough to press the switch and keep
it pressed.  (I don't recall if there was a "watchdog reset" message
on the console when I found it, but I was new enough to Suns that that
would not have been a dead givaway.)

Story Two involved connecting an HP laserjet to a Sun 3/280.  This
sucker just would NOT do flow control correctly.  I put a dumb
terminal in place of the HP and manually typed ^S/^Q sequences to
prove that the serial port really was honoring X-ON/X-OFF.  But for
some reason the ^Ss from the HP didn't "taste right" to the Sun, which
ignored them.  Switching the HP serial port between RS422/RS232 had no
effect.  It evenually turned out to be some sort of flakeyness with
the Sun ALM-II board.  Everything worked fine after I moved the
printer to one of the built-in Zilog ports.  Death to flakey hardware...

-----------------------------------------------------------------------------

From: ken@sugra.uucp (Kenneth Ng)
Organization: Private Computer, Totowa, NJ

In article <1992Oct16.152629.29804@nsisrv.gsfc.nasa.gov: williams@nssdcs.gsfc.na
[story about connecting HP LJ to a Sun 3/280 with an ALM-II board deleted]

ARRRGGGHHH!!!! DEATH TO ALM-II BOARDS!  Funny though, I do have an HPLJ-2
hooked up to a SUN 690MP through the ALM-2 boards without problems.  However
I also had Sun going up the wall with myself with an Okidata 320 printer
that would hang the port until we reboot the machine  (not a nice thing to
do with a dozen stock brokers).  Funny thing is, we had ANOTHER Okidata 320
printer attached to the same Sun on another ALM-2 port, no problem with that
one.  Hm, switch the printers, no change.  Switch the cables, no change.
Switch the ports, no change.  Wierd.  Finally discovered it was the DATA that
was being sent.  The printer with problems was a label printer, which was
sending a control-s every 10-20 characters or so to pause the Sun.  Apparently
the Sun ALM-2 drivers can not handle control-s'es too frequently.  No problem,
Sun said, just switch to hardware flow control.  Puzzled me, because my docs
said the ALM boards had no hardware flow control.  But his docs said they
were there.  Took the printer off line, started the lpd, data scope showed the
data going out.  Talked to Sun again, tried RTS-CTS, DTR, 'crtscts' in printcap,
'-crtscts' in printcap.  Trying all kinds combinations.  Finally he asked me
which ALM-2 port I was using, 13 I responded.  Oh, ALM-2 ports only have the
hardware flow control in the first four ports.  Whoops :-).  Both docs were,
true, my docs said there was no hardware flow control, which was right, on
the last 12 ports.  His docs said that there was hw flow control, but he
missed the 'on the first four ports' part.  Now it works, and I hope Sun
now has this better documented.

-------------------------------------------------------------------------------

From: gary@resumix.portal.com (Gary M. Lin)
Organization: Resumix Inc.

My company markets turnkey solutions for resume-processing, so most of our
customers are non-technical HR recruiters.  We contract third-party field
service to a fairly recognizable name in the industry.

I received a call from an irate user who noticed intolerable delays after
some upgrades were done to the customer's branch offices.  His ELC would use
dial-up to establish a link before running software off the server in a
different site.

He attributed the delay to slow dial-up links and software changes, but then
the customer mentioned that quitting WordPerfect and switching to our applic-
ation took over an hour.  I asked what the system was doing during that hour.
He replied the disk was constantly spinning.  Puzzled, I checked his swap,
which was more than sufficient.  Then finally I noticed his ELC booted with
only 4 meg of memory.

Think the field technician swapped their CPU board a month ago and forgot to
move the SIMMs over.  The worst part of it was the customer went on with this
situation for a month before bringing it to our attention!

Moral of the story:  Check that the service guy puts everything back in.

-------------------------------------------------------------------------------

From: greep@Speech.SRI.COM (Steven Tepper)
Organization: SRI International

I once had problems with files that mysteriously refused to stayed
changed for very long.  It was a PDP-11 Unix system that had crashed,
and I brought it up single-user.  I would change some file and it
would stay changed for a minute or so but then revert to its earlier
state (contents, protection mode, etc).  What happened was that the
write-protect switch on the disk drive had gotten bumped into the "on"
position but the device driver failed to report any write errors.  As
long as the data stayed in kernel buffers the changes "took", but they
would disappear once the buffers were reused and the system had to
reread the disk.


=============================================================================
Section 6: Partitioning the drives...
=============================================================================

From: hirai@cc.swarthmore.edu (Eiji Hirai)
Organization: Information Services, Swarthmore College, Swarthmore, PA, USA

I wanted to create a second swap partition on another disk and made the
partition start at sector 0 of the disk! (which sounded ok at the time since
all other regular 'a' partitions started on sector 0) Every time I rebooted,
fsck would complain about missing partition tables - I initially suspected
that the disk was bad but I later realized that swapping was overwriting the
partition table.  I had lost an unknown percentage of the financial data for
the institution that I was working for at the time, right when they were
being audited!  Yikes!  Anyway, we were able to recover the data and life
returned to normal but I did wonder at the time whether I could still keep
my job there.

-----------------------------------------------------------------------------

From: matthews@oberon.umd.edu (Mike Matthews)
Organization: /etc/organization

We had just gotten a 1.2G disk drive for our Sun (which direly needed it) so
we felt we'd repartition everything.

All went well, except... on reboot, one of the partitions that was newly
restored from backup got a fsck error.  Fixed it, it rebooted, then another
one got an error.  fscked that one, rebooted it, and doggone it, the first
error was back!

We had a one cylinder overlap.  Sheesh.  At least Ultrix WARNS you of that.

-----------------------------------------------------------------------------

From: mt00@eurotherm.co.uk (Martin Tomes)
Organization: Eurotherm Limited

We had something really wierd happen one day.  I copied a file to
/usr/local on someone elses machine and all seemed to be OK.  A bit
later the user of the machine noticed that the files and directories they
were using on another disk partition were corrupted.  There were 2
gigbyte files on a 650Mb disk - and lots of them with wierd names and
permissions.  At first I did not connect the two events.  This disk
had given trouble when the power failed a week before, so I fsck'ed
it.  Now I have run fsck more times than I can begin to imagine and
seen plenty of errors, some needing 'manual intervention' but I had
never seen anything like this before!  It was spectacular.  And what
was more, when I ran it a second time things got worse.  Then I tried
to backup the /usr/local partition before restoring this corrupt data
and lo, that was corrupt too.  It turned out that our sysadmin had
created the /usr/local disk partition in the wrong place on the disk
and put it over the top of the alternate sectors partition.  By
writing to the /usr/local disk I had written all over the alts which
were mapped into the users partition.  Oh dear, what a mess.

Solution, rebuild all the partitions so they don't overlap and
restore, also buy the sysadmin a calculator.

Moral, always do your sums on the /etc/partitions file very carefully
before using mkpart.

-----------------------------------------------------------------------------

From: caa@Unify.Com (Chris A. Anderson)
Organization: Unify Corporation, Sacramento, California

At a company that I used to work for, the CEO's brother  was  the
"system  operator".   It was his job to do backups, maintentance,
etc.  Problem was, he didn't have a clue about Unix.  We were re-
quired to go through him to do anything, though.

Well,   I   was   setting   up   a   Plexus   P-95   to   be    a
news/mail/communications machine and needed to wipe the disks and
install a new OS.  El CEO requested that his brother do  the  in-
stallation  and disk partitioning.  He had done this before, so I
gave him the partition maps and let him at it.  When he was done,
everything  seemed to be ok.  Great, on with the install and set-
up.

Things went fine until I started  compiling  the  news  and  mail
software.   All  of  a sudden, the machine paniced.  I brought it
back up and the root file system was  amazingly  corrupt.   After
rebuilding  things,  it  all seemed to be fine -- diagnostics all
ran fine, etc.  So I started again -- this time keeping an eye on
things.  Sure enough, the root file system became corrupted again
when the system started to load.

This time I brought it down and checked everything.  The problem?
Swap space started at block zero and so did the root file system.
ARRRGGGHHHHH!!

Oh yes, the brother still works there.

-----------------------------------------------------------------------------

From: obi@gumby.ocs.com (Obi Thomas)
Organization: Online Computer Systems, Inc.

I once mistakenly partitioned my Sun's boot disk so that the swap
partition overlapped the usr partition. The machine ran fine for a long
time (many months), presumably because the swap space was always nearly
empty. Then, one day there was a memory parity error and the system crash
dumped at the *end* of the swap partition. What should have been a simple
reboot after the crash dump turned into a long and painful re-install of
the entire system (Suns cannot boot without a /usr partition).

Now when I partition a disk I sit there with a calculator and make sure
all the numbers add up correctly (offsets, number of cylinders, number of
blocks, and so on).

-----------------------------------------------------------------------------

From: dp@world.std.com (Jeff DelPapa)
Organization: The World Public Access UNIX, Brookline, MA

obi@gumby.ocs.com (Obi Thomas) writes:
[story about overlapping partitions deleted]

I remember a similar thing once - on a symbolics machine, a customer
declared a file in the FEP filesystem as a paging file, and as part of
the file system (it was one way to solve their disk space crunch) It
was caught before damage was done - we weren't sure if it was because
they hadn't done anything real yet, or simply the machine knew not to
mess with the IRS (the customer).

-----------------------------------------------------------------------------

From: kevin@sherman.pas.rochester.edu (kevin mcfadden)
Organization: University of Rochester

Me and my co-system admin were in the process of repartioning a drive
so that we could allocate more space for incoming mail.  We had
just finished backing up our Data directory from which we were going
to take 10MB from.  Next step was to to actually repartition it which
includes formating.  Anyway, it comes time to give a device name
and we do a df to see which one.  To make a short story long, there
was a /dev/sd2g and a /dev/sd3g, one which was 300MB of stuff we
could delete and the other was 600MB of applications.  We confused the
the two and accidently formatted the 600 MB of applications, which
of course had been backed up......a month ago.  It could have been
worse.

        BUT WAIT!!! It did.  Turns out it took 3 or 4 tries to get
the partition size correct (what the hell is it with telling it
how long it is in hex or whatever?).  It was at this point where
I started to cover my eyes and wander around the building because
we only found out the partition didn't work after spending 3 hours
restoring the applications.  4 * 3 = 12 hours to repartition!

---------------------------------------------------------------------------

From: Nick Sayer 

I had to swap out a 327M disk on a Sun with a 669.  So I partitioned the
669, then newfs'd a /, /usr and /home filesystem on partitions a, g and
h respectively.  I then copied the / and /usr partition from the 327 over
to the 669.

First, I forgot to run installboot on the new boot partition.  Whoops.
Get out the tape and boot miniroot (5 minutes), then mount / and
use installboot.  Fine.  Now it finds /vmunix correctly.

But on the 327, /usr was on the h partition, not g.  So when
I rebooted with the 669 in place, it mounted the home partition
on /usr.  fsck not found, reboot failed.  Well, that's simple, I'll just
edit /etc/fstab and reboot.  But vi is on /usr. And home is mounted
on /usr.  No problem, I'll just mount usr on /mnt or something and
do it that way.  Nope. vi is dynamically linked, and there's no
/usr/lib/ld.so.  Ok, so I'll go back to single user and try it there.
But how to reboot gracefully?  sync, shutdown, reboot... all in /usr,
(mounted on /mnt) and dynamically linked.  So I gave it the vulcan neck
pinch and booted into miniroot (5 minutes).  So miniroot is up.
Fine.  Mount the / partition and use ed on /a/etc/fstab. Panic,
dup ialloc.  The vulcan neck pinch had introduced a slight corruption
in the filesystem.  But how to preen it?  fsck is in /usr, and it's
dynamically linked.  Sigh.

The solution was to mount the usr partition as /usr right on top
of the home partition, run fsck to preen the root partition, reboot,
mount /usr again, then remount / read-write, change /etc/fstab
and reboot again.  So all was ok after an hour of fussing.
=============================================================================
Section 7: Configuring the system...
=============================================================================

From: peter@NeoSoft.com (Peter da Silva)
Organization: NeoSoft Communications Services

Well, we had one system on which you couldn't log in on the console for a
while after rebooting, but it'd start working sometimes.  What was happening
was that the manufacturer had, for some idiot reason, hardcoded the names
of the terminals they wanted to support into getty (this manufacturers own
terminals, that I can understand, but also a handful of common types like
adm3a) so getty could clear the screen properly (I guess hacking that into
gettydefs was too obvious or something).  If getty couldn't recognise the
terminal type on the command line, it'd display a message on the console
reading "Unknown terminal type pc100".  We ignored this flamage, which was
a pity.  'Cos that was the problem.

It did this *before* opening the terminal, so if it happened to run between
the time rc completed and the getty on the console started the console got
attached to some random terminal somewhere, so when login attempted to open
/dev/tty to prompt for a password it failed.

Moral: always deal with error messages even when you *know* they're bogus.
Moral: never cry wolf.
-----------------------------------------------------------------------------

From: hirai@cc.swarthmore.edu (Eiji Hirai)
Organization: Information Services, Swarthmore College, Swarthmore, PA, USA

rik.harris@fcit.monash.edu.au writes:
> I'll mount it in /tmp

Though this may strike most sane sysadmins as bad practice, SunOS (3.4 or so
- my memory is vague) shipped a command called "on".  If you were logged on
machine A and wanted to execute a command on machine B, you said "on B
command", sort of like rsh.

However, A would mount B's disks under some invokations of "on" and it would
mount it in /tmp!  Of course, lots of folks got bitten by this stupid
command and it was taken out after a long delay by Sun.

Anyone remember the details?  I've blocked out my memory of pre-4.0 SunOS.
Am I just hallucinating?

-----------------------------------------------------------------------------

From: robjohn@ocdis01.UUCP (Contractor Bob Johnson)
Organization: Tinker Air Force Base, Oklahoma

After changing my /etc/inittab file, I was going to kick init by sending
it a HUP signal to tell it the file had changed.  Unfortunately, I missed
and the 1 became a Q... kill -q 1.  Large systems die in interesting ways
when you lose init!


=============================================================================
Section 8: Upgrading the system...
=============================================================================

From: rsj@wa4mei (Randy Jarrett)
Organization: Amateur Radio Gateway WA4MEI, Chamblee, GA

Here's one that will show that you shouldn't work on a system
that you don't thourghly understand.

At my "previous" employer I was instructed to install a new
(larger) disk drive in a RS/6000 system.  Since a full backup
of the system was done the previous day I just looked at the file
systems vi a df to see which were on the drive that I was replacing.
After this I did a tape backup of these filesystems, ran smit and
did a remove of these filesystems.  I then installed the new disk
and brought the system back up.  When I ran smit and when I was able
to do the installation of the new drive and setup the file systems
I was figuring that this was going to be an easy one.  WRONG!!  I was
aware that you could expand filesystems under AIX but was not aware
that it would expand them 'across physical drives'!!!  I first
realized that I was in trouble when I went to read in the backup tape
and cpio was not found.  I did an ls of the /usr/bin directory and it
said that the file was there but when I tried to run it it was not
found.  And of course when I went looking for the original install tape
it was not to be found....

-----------------------------------------------------------------------------

From: matthews@oberon.umd.edu (Mike Matthews)
Organization: /etc/organization

When I had first gotten my NeXTstation, it had the lil' 105M hard drive in
it.  I had a 330M external, but alas, no cable for it.  (Life was not fun
when I was essentially netbooting off a "test" machine.... ".. um, guys, did
you just reboot is-next?")

Finally got the cable, just in time for the winter holiday (read: no
network).  Brought the machine home, and I figured I'd just copy the
configuration files over from the internal to the external (as a nice gesture
to my users so they wouldn't have to change their passwords and everything).

The external was a brand new BuildDisk'd disk (had stock NeXTstep on it).
NeXT keeps the private information of each machine (/dev, /etc, stuff like
that) in a /private directory to make netbooting easier.

Hey, I'll just move /private from the 105M to /private on the external.  So I
deleted the external's /private and tried to move it via the workspace.

/dev is in /private.

/dev contains device files.  Can't move them.

BUT.  The workspace happily deleted all the files it DID copy, so the
internal couldn't boot (no /etc) and the external couldn't boot (no /dev).
This is before the advent of boot floppies so I was stuck for about a week at
home with $5000 of NeXT computer that I couldn't boot.

The moral?  *NEVER* move something important.  Copy, VERIFY, and THEN delete.

-----------------------------------------------------------------------------

From: grog@lemis.uucp (Greg Lehey)
Organization: LEMIS, W-6324 Feldatal, Germany

I'm currently trying to work out how ISC Unix/386 handles COFF files, and
discovered the /shlib directory, which I suspected wasn't really used
(*wrong*). So, to try it out, I did:

+ root adagio:/ 819 -> mv shlib slob
+ root adagio:/ 820 -> xterm
+ /usr/bin/X11/xterm: Can not access a needed shared library

So far, so good. So, put it back:

+ root adagio:/ 821 -> mv slob shlib
+ /bin/mv: Can not access a needed shared library

Oops! So, tried it from a different system, but didn't have
permission, so:

+ root adagio:/ 822 -> chmod 777 slob
+ /bin/chmod: Can not access a needed shared library

OK, so let's just cp them across.

+ root adagio:/ 823 -> cd slob
+ root adagio:/slob 824 -> mkdir /shlib
+ /bin/mkdir: Can not access a needed shared library
+ root adagio:/slob 825 ->

Then I wrote a program which just did a link(2) of the directories.
Yes, gcc and ld didn't have any problems, but even after the link was
in place, it still didn't work. I had to reboot (but nothing else),
after which it did work. No idea why that made any difference.

-----------------------------------------------------------------------------

From: erik@src4src.linet.org (Erik VanRiper)
Organization: The Source for Source

I run on a 386/25.  Small system, 4 inbound lines, etc.  I was installing a
new SCSI drive to complement my 2 MFM's.  Took me forever to get everything
just right.  Things finally worked, so I figured I would shutdown and play
with the jumper settings to see what this thing could do.  What did I do?
Well, I just turned off the power, that's all.

erk.  Just rebuilt the kernal, did not do a haltsys, or a shutdown, or anything.
Just shut the power off.  ARGH!  Took me 3 weeks to clean up the mess.

You tend to get in this cycle of "try" "haltsys" "power off" "change jumpers"
"power on" "try".  Well, once everything worked, I guess I was a wee bit
excited and forgot a step.  :-)

-----------------------------------------------------------------------------

From: almquist@chopin.udel.edu (Squish)
Organization: Human Interface Technology Lab (on vacation)

Two miserable flubs:

1) /etc/rc cleans tmp but it wasn't cleaning up directories so I changed
the line:
  (cd /tmp; rm -f - *)
to
  (cd /tmp; rm -f -r - *; rm -f -r - .*)

About 15 minutes later I had wiped out the hard drive.

2) One of the user discs got filled so I needed to move everyone over to
the new disc partition.  So, I used the tar to tar command and flubbed:

cd /user1; tar cf - . | (cd /user1; tar xfBp - )

Next thing I know /user1 is coming up with lots of weird consistency errors and
other such nonsense.  I meant to type /user2 not /user1.  OOOPS!

My moral of the story is when you are doing some BIG type the command and
reread what you've typed about 100 times to make sure its sunk in (:

-----------------------------------------------------------------------------

From: anne@maxwell.concordia.ca (Anne Bennett)
Organization: Concordia University, Montreal, Canada

After about four months as a Unix sysadm, and still feeling rather like a
novice, I was asked to "upgrade" a Sun lab (3/280 server and ten 3/50
diskless clients) from SunOS 4.0.3 to 4.1 -- of course, this "upgrade" was
actually a complete re-install.

Well, the server had no tape drive, not even any SCSI controller.  There
were no other machines on its subnet other than the clients, so I had no
boothost (at that time, I did not know that the routers could be
reconfigured to pass the appropriate rarp packets, nor do I think our
network people would have taken kindly to such a hack!).  The clients did
have SCSI controllers, but I had no portable tape drive.  Luckily, I had
a portable disk.

So, with great trepidation (remember, I was still a novice), I set up
one of the clients, with the spare disk, to be a boothost.  I booted
the server off the client and read the miniroot from a tape on a remote
machine, and copied it to the server's swap partition.  Then I manually
booted the miniroot on the server by booting off the temporary boothost
with the appropriate options, and specified the server's swap partition
as containing the kernel to be loaded.  Once in the miniroot, I started
up routed to permit me to reach the tapehost, and finally invoked
suninstall.  From then on, it worked like a charm.

Needless to say, I was extremely pleased with myself for figuring all of
this out.  I then settled down to do the "easy stuff", and got around to
configuring NIS (Yellow Pages).  I decided to get rid of everything I
didn't need, under the assumption that a smaller system is easier to
understand and keep track of.  The Sun System and Network Administration
Manual, which is in many ways an admirable tome, had on page 476 a
section on "Preparing Files on NIS Clients", which said:

   "Note that the files networks, protocols, ethers, and services need
    not be present on any NIS clients.  However, if a client will on
    occasion not run NIS, make sure that the above mentioned files do
    have valid data in them."

So I removed them.  Several hours later, when I had finished configuring
the server to my satisfaction, reloading the user files, etc., I finally
got around to booting up the clients.  Well, I *tried* to boot up the
clients, but got the strangest errors: the clients loaded their
kernels and mounted /, but failed trying to mount /usr with the message
"server not responding. RPC: Unknown protocol".  I was mystified. I tried
putting back the generic kernels on server and clients, several different
ifconfig values for the ethernet interfaces, enabling mountd and rexd on
server's inetd.conf, removing the clients' /etc/hostname.le0 (which I had
added)... all to no avail.  'Twas the last work day before the Christmas
break, and I was flummoxed.

Of course, I finally connected the error message "unknown protocol"
with the removed /etc/protocols (and other) files, restored these
files, after which everything was fine again.  I was pretty mad, since
I had wasted a whole day on this problem, but *technically*, the Sun
manual above is correct.

It just neglected to mention that of course, *no* machine is running
NIS at boot time, therefore *every* machine needs valid data in the
networks, services, protocols, and ethers files *at boot time*. Grrr!

----------------------------------------------------------------------------

From: yared@anteros.enst.fr (Nadim Yared)
Organization: Telecom Paris, France

My story happened on a Sun Sparcstation 2

I once wanted to update the libc.so.1.7 to libc.so.1.8 by myself, so
I got root, and then ftp the /lib/libc.so.1.8 to my /lib. Unfortunately
there was not enough room on this partition. So all i got was a file
with zero length.

The problem is that I ran /usr/etc/ldconfig in the directory /lib,
and that was all. Every command could not be executed, cause ld.so
checked for /libc.so.1.8, being the newest one. All i needed was a
statically linked mv, but SUN does not provide usually the source.
Even going single user didn't do anything. So i had to install a
miniroot on the swap partition, and cp /bin/mv from the CD-ROM,
and execute-it.

----------------------------------------------------------------------------
*NEW*

From: TRIEMER@EAGLE.WESLEYAN.EDU
Organization: Wesleyan College

I have been trying to put a at&t 3b2/310 machine on the net for a
while, I'll skip the unbelievable hardware problems.  I'll skip the
paranoid system admins that forced me to build a temporary net to show
them that the ethernet board worked.  Anyway, I get it up and running
on the temp net - it works fine - a little slow, but hey.  Ok, so I'm
ready to stick it on the net - you need to power down to do that right.
So, I powered down.  Bad, bad bad mistake.  I had been running a sysadm
shell script - I needed to change a password so that I could get into an
account.  Well, would you believe that the script, despite the fact that
I wasn't in the passwd option anymore held onto the passwd file!  Stupid
machine, stupid script.  Anyway... what that means is that when I boot
up the machine, it passes diagnostics (A small miracle) runs unix and
doesn't let anyone log in!  I almost freaked.  Anyway, so...

There's an undocumented option on the installation disks called 
'magic mode'  At one point it offers 4 options (none of which is magic)
If you type magic mode at that point, you can get it... believe it or not
some at&t person had the nerve, and bizarre sense of humor to add one
extra line to magic mode- you see when you type 'magic mode' it says

  Poof!

That was just about the last thing I wanted to see... the rest was in a
sense trivial... ran an fsck... it fixed it all for me.  So the moral of
the story... never ever assume that some prepackaged script that you are
running does anything right.