Recovering a Debian System after running rm /*

Flying - felixtsao

The Oxford English Dictionary[foot] I cannot believe that this is actually defined in the OED. I assumed it was a fake site or something. I also cannot believe that I actually used “The OED defines […]”. I didn’t even use that the best man speech at my brother’s wedding.[/foot] defines an ohnosecond as:

a moment in which one realizes that one has made an error, typically by pressing the wrong button.

It’s more commonly referred to in Operations Management parlance as:

OHGODFUARGHFLKJAFWHATIDIDIDOARGHNO

It is unfortunately something that will happen to everyone during their systems administration career, and the variations are almost endless, some notable occurrences include:-

  • Copying SSL libraries over from a Debian host to a RHEL host
  • Setting a new root password and immediately losing it
  • Copying over an out of date backup CMS to a production system
  • Running one of the many variations of ‘rm’ at the wrong level

Unfortunately in a recent scenario, a poor hypothetical sysadmin managed to issue:

rm /*

instead of:

 rm ./*

This removed every non directory at the / level. The impact of this varies between operating systems, and even between Linux distributions. We’re lucky that in this scenario, there was no ‘-rf’ specified – or it’d be ‘recover from backup’ time, however this situation did (hypothetically) pose an interesting conundrum.

The Problem

In Debian x86_64 systems, /lib64 is a symlink to /lib, and you’ll find most applications (for instance, ‘ldd’) are linked to libraries in /lib64:

linux-vdso.so.1 => (0x00007fff251ff000)
libc.so.6 => /lib/libc.so.6 (0x00007f278eb38000)
/lib64/ld-linux-x86-64.so.2 (0x00007f278eea2000)

In the event of /lib64 not being accessible, most applications will fail to run because they’ll be missing a myriad of dependencies and won’t be able to find them. After a bit of brief investigation and some furious attempts to revive it with frankly disappointing results, including:

  • Using a statically compiled symlinker such as sln (Available by default on CentOS and RHEL, not on the affected Debian box)
  • Copying over sln via netcat and writing it out (proved surprisingly difficult)
  • Trying to copy over a symlink via rsync (couldn’t rsync/scp/sftp as they need to exec another process – which they can’t because of missing libraries)
  • Using BusyBox (Needs dynamic linking)
  • Writing a linker in C, compiling it, getting it over there via a mixture of cat, echo \x{..}\x{..}, and other incantations (I lost the will to live around this point)

The Epiphany

I eventually remembered a slideshow – chmod -x chmod – which was surprisingly relevant. You see, those more eagle eyed may have noticed we would end up missing one important dependency: ld-linux-x86-64.so.2.

ld-linux and ld-linux-x86-64 find and load shared libraries used by other applications – preparing programs to run, and then actually executing them too. Most Linux binaries require dynamic linking, meaning at runtime the libraries that the application depends on are loaded in from a shared source rather than compiled into the executable, unless the -static option was used during compilation. As this is quite unlikely (with most modern distributions), this means if you cannot access ld-linux.so, you’re in trouble. Luckily, you can still use ld-linux.so to execute arbitrary commands, and it’ll resolve the dependencies relative to your LD_LIBRARY_PATH at that point. A simple:

/lib/ld-2.11.1.so /bin/ln -s /lib/ /lib64/

Restored the symlink and allowed normal execution of binaries again, leaving our hypothetical sysadmin off the hook, except for having to write a mildly humiliating email to the rest of the operations team who, understandably, responded a bit like this.

Photo Flying by felixtsao (CC)

fail2ban time offset issues

While trying to set up fail2ban, I found that even though my regex/logs matched up nothing was being banned/caught by fail2ban

After a bit of investigation it seems that the auth.log time was being written in GMT whereas fail2ban was expecting it in BST:

==> /var/log/auth.log <==
Oct 11 20:52:21 ns2 sshd[18119]: Invalid user test from 1.2.3.4
Oct 11 20:52:21 ns2 sshd[18119]: Failed none for invalid user test from 1.2.3.4 port 47862 ssh2
Oct 11 20:52:28 ns2 sshd[18119]: Failed password for invalid user test from 1.2.3.4 port 47862 ssh2
==> /var/log/fail2ban.log <==
2010-10-11 21:52:04,017 fail2ban.filter: DEBUG  /var/log/auth.log has been modified
2010-10-11 21:52:04,029 fail2ban.filter.datedetector: DEBUG  Sorting the template list

Fairly simple fix of:

rm /etc/localtime
ln -s /usr/share/zoneinfo/Europe/London /etc/localtime

and I am now successfully banning myself from accessing my server.

LVM Stale NFS File Handles (Part 1)

So, here’s an interesting issue

(initramfs) mount
rootfs on / type rootfs (rw)
none on /sys type sysfs (rw,nosuid,nodev,noexec)
none on /proc type proc (rw,nosuid,nodev,noexec)
udev on /dev type tmpfs (rw,size=10240k,mode=755)
/dev/pudding/root on /mnt type ext3 (rw,errors=continue,data=ordered)

 

So I’m using BusyBox, with an LVM volume mounted on /mnt. Happy?

(initramfs) ls /mnt
ls: /mnt/initrd.img.old: Stale NFS file handle
ls: /mnt/vmlinuz: Stale NFS file handle
ls: /mnt/vmlinuz.old: Stale NFS file handle

Only one directory (was, a while ago) exported by NFS, which isn’t one that is affected, and the box has never mounted anything by NFS. It seems like the error can be caused when a file is open and the disk falls out from underneath it, and an ambiguous error code is sent back which is interpreted as a stale filehandle. Either way, the superblock on this particular FS is corrupted, so the next step would be to attempt to recover using one of the backup superblocks. I’ll attempt this later and let you know how it goes. I’m sure you’ll be on the edge of your seats.

MegaRaid Lies

Dell PowerEdge 1850. I’ve never seen it in the flesh, but believe it has a MegaRAID card.

# lsscsi
[0:0:6:0]    process PE/PV    1×2 SCSI BP      1.0   -
[0:1:0:0]    disk    MegaRAID LD 0 RAID1   69G 516A  /dev/sda
# grep -i raid /var/log/dmesg
[   20.251664] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
[   20.690899] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
[   20.690929] megaraid: probe new device 0×1028:0×0013:0×1028:0x016c: bus 2:slot 14:func 0
[   20.690964] megaraid 0000:02:0e.0: PCI INT A -> GSI 46 (level, low) -> IRQ 46
[   21.324054] megaraid: fw version:[516A] bios version:[H418]
[   21.332182] scsi0 : LSI Logic MegaRAID driver
[   21.332598] scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
[   24.821907] scsi 0:1:0:0: Direct-Access     MegaRAID LD 0 RAID1   69G 516A PQ: 0 ANSI: 2

Seems fine, right?

# ./MegaCli64 -adpCount
Controller Count: 0.
Exit Code: 0×00

Hrm.

/opt/MegaRAID/MegaCli # omreport storage controller
No controllers found

Starting to get tweaky.

Update

Thanks to jtopper for the help so far. Getting a bit further, but still:

wget http://www.lsi.com/DistributionSystem/AssetDocument/files/support/rsa/utilities/megaconf/ut_linux_megarc_1.11.zip
unzip ut_linux_megarc_1.11.zip
sudo ./megarc.bin -AllAdpInfo
        Failed to get driver version
        No Adapters Found

And…

$ grep MAJOR megarc
MAJOR=`grep megadev /proc/devices|awk ‘{print $1}’`
$ grep -ci mega /proc/devices
0

Further Update

I’ve finally managed to get to the bottom of this. Looks like any app which creates the /dev/megadev0 device does it with the wrong major. To fix this, based on some brilliant info, I used a major of 10 (now that 252 is used for usbmon), and a minor from /proc/misc.

# mknod /dev/megadev0 c 10 59
# ./megarc -dispCfg -a0
        **********************************************************************
              MEGARC MegaRAID Configuration Utility(LINUX)-1.11(12-07-2004)
              By LSI Logic Corp.,USA
        **********************************************************************
          [Note: For SATA-2, 4 and 6 channel controllers, please specify
          Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]
        Type ? as command line arg for help
        Finding Devices On Each MegaRAID Adapter…
        Scanning Ha 0, Chnl 0 Target 15
        **********************************************************************
              Existing Logical Drive Information
              By LSI Logic Corp.,USA
        **********************************************************************
          [Note: For SATA-2, 4 and 6 channel controllers, please specify
          Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]
          Logical Drive : 0( Adapter: 0 ):  Status: OPTIMAL
        —————————————————————————-
        SpanDepth :01     RaidLevel: 1  RdAhead : Adaptive  Cache: DirectIo
        StripSz   :064KB   Stripes  : 2  WrPolicy: WriteBack
        Logical Drive 0 : SpanLevel_0 Disks
        Chnl  Target  StartBlock   Blocks      Physical Target Status
        ——  ———  —————   ———      ———————————
        0      00    0×00000000   0x0887c000   ONLINE
        0      01    0×00000000   0x0887c000   ONLINE

Hope this helps someone, and many thanks to these guys.