I am Alex Smith.
I can currently be found working with KITD, Channel 4, mySociety, the Open Rights Group, no2id.
Celery is a distributed task queue for Python. It’s pretty useful, and a lot of apps I’m involved in deploying seem to be using it lately.
Something it seems to struggle with is stability; in the event of a database disappearing, being unable to resolve a database’s hostname, or a single connection to a database failing, it just shuts down.
I needed this to not happen, when running things in “the cloud” (sorry) you’re very much at the mercy of other people controlling your networking/tin/everything - so you need to write applications that are capable of a little bit of failure (even if the application was originally written in this way to avoid split brain or similar). To get around this, we implemented monit. I am definitely not a fan of apps automatically restarting, but it was the only trivial resolution in this situation. Just append this to your monit config and you should be sorted. My understanding is that there isn’t a better solution yet, but would be interested to know if anyone has seen one.
check process celeryd with pidfile /var/run/celeryd.pid
start program = "/etc/init.d/celeryd start" with timeout 10 seconds
stop program = "/etc/init.d/celeryd stop"
if changed pid then restart
if 5 restarts within 5 cycles then timeout
alert youremailaddresshere
(I appreciate this is especially tedious, but this is for my reference)
When using nginx as a caching proxy, I found myself needing to ignore particular parameters for both the cache key and the values being passed to the backend. In this particular situation the value I wanted to ignore was ‘uid’. An example URI being:
or
To ignore this, in the top of my site configuration I put:
proxy_cache_key "$scheme$host$uri$is_args$args";
in the server stanza:
if ($args ~ (.*?)(?:^|(&))uid=[^&]*(?:(\2.*)|&(.*))?) {
set $args $1$3$4;
}
if ($args ~ (^\w)) {
set $args ?$args;
}
and the location stanza:
proxy_pass http://appservers$uri$args;
So now my backend servers see:
GET /foo.ext?env=bar&node=qux
or
GET /bar.ext
and seldom few hits get through to there anyway, as the cache key flattens it appropriately.
Easy.
EDIT: The ‘easy’ bit is a lie, it seems. Thanks to @davidgl for pulling me out of regex hell. Several revisions here helped by him.
While trying to set up fail2ban, I found that even though my regex/logs matched up nothing was being banned/caught by fail2ban
After a bit of investigation it seems that the auth.log time was being written in GMT whereas fail2ban was expecting it in BST:
==> /var/log/auth.log <==
Oct 11 20:52:21 ns2 sshd[18119]: Invalid user test from 1.2.3.4
Oct 11 20:52:21 ns2 sshd[18119]: Failed none for invalid user test from 1.2.3.4 port 47862 ssh2
Oct 11 20:52:28 ns2 sshd[18119]: Failed password for invalid user test from 1.2.3.4 port 47862 ssh2
==> /var/log/fail2ban.log <==
2010-10-11 21:52:04,017 fail2ban.filter : DEBUG /var/log/auth.log has been modified
2010-10-11 21:52:04,029 fail2ban.filter.datedetector: DEBUG Sorting the template list
Fairly trivial fix of:
rm /etc/localtime
ln -s /usr/share/zoneinfo/Europe/London /etc/localtime
and I am now successfully banning myself from accessing my server. Vunderbar.
450 Requested action aborted [7.2] 20412, please visit www.messagelabs.com/support for more details about this error message.
It took a remarkably large amount of searching to find out what ‘[7.2]’ meant in this error message, and why we kept getting a mailserver’s IP blacklisted, but if this happens to you, hopefully this will help resolve it.
When MessageLabs returns a [7.2], this seems to mean that they’ve checked the IP address of the host which is connecting to their MX against the CBL. Connections will be dropped immediately, rather than mail being rejected, as such:
# telnet cluster8a.eu.messagelabs.com 25
Trying 85.158.143.51…
Connected to cluster8a.eu.messagelabs.com (85.158.143.51).
Escape character is ‘^]’.
450 Requested action aborted [7.2] 20412, please visit www.messagelabs.com/support for more details about this error message.
Connection closed by foreign host.
The easiest way to get around this is to fix your mail server, then request delisting from the CBL.
In a completely unrelated note (ahem), it seems that you may be added to the CBL if you send an email from a domain where the sending mail server is explicitly disallowed by SPF records (such as -all with no matching include), to a gmail address; Google will automatically (?) submit the IP address to the CBL and your problems will begin (again).
I highly recommend robtex as a lazy way to check your hosts against blacklists.
Hypothetical situation. You installed VMWare ESX, possibly upgraded from 3.5 to 4, went with the embedded SQL Server, and Many Years Later the VirtualCenter server no longer starts. You look through the event logs and the best you can find is:
Faulting application vpxd.exe, version 4.0.10021.0, faulting module kernel32.dll, version 5.2.3790.4480, fault address 0x0000bef7.
So you decide to look at general application eventlog events rather than just for VMware:
Could not allocate space for object ‘dbo.VPX_EVENT’.’PK_VPX_EVENT’ in database ‘VIM_VCDB’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
“Great”, you think. I can just pass this over to a DBA to get them to increase the filegroup size. Then you dig a bit deeper and look at the event log for SQLServer:
CREATE DATABASE or ALTER DATABASE failed because the resulting cumulative database size would exceed your licensed limit of 4096 MB per database.
“Oh no!” you sob. You really don’t want to try migrating to an enterprise database right now. Worry not, there’s a VMWare solution. This easy process is:
SET @DELETE_DATA = 0
to
SET @DELETE_DATA = 1
****************** SUMMARY *******************
Deleted 8400 rows from VPX_TASK table.
Deleted 2585209 rows from VPX_EVENT_ARG table.
Deleted 1662120 rows from VPX_EVENT table.
Deleted 0 rows from VPX_HIST_STAT1 table.
Deleted 0 rows from VPX_SAMPLE_TIME1 table.
Deleted 0 rows from VPX_HIST_STAT2 table.
Deleted 0 rows from VPX_SAMPLE_TIME2 table.
Deleted 0 rows from VPX_HIST_STAT3 table.
Deleted 0 rows from VPX_SAMPLE_TIME3 table.
Deleted 105331 rows from VPX_HIST_STAT4 table.
Deleted 373 rows from VPX_SAMPLE_TIME4 table.
Hopefully this will save someone a bit of googling.
Well, it’s one louder, isn’t it? It’s not ten. You see, most blokes, you know, will be playing at ten. You’re on ten here, all the way up, all the way up, all the way up, you’re on ten on your guitar. Where can you go from there? Where?
Imposing ridiculously over the top security policies? Want to make sure any SSH private keys on your jump-off/administration server have a passphrase?
Don’t waste time trying to get expect working…
expect <<EOF
spawn ssh-keygen -f file -y
expect -timeout 1 "Enter passphrase:" {exit 1}
EOF
Just look at the damn file (thanks @ealexhudson and @Azquelt) and check if it’s got ‘Proc-Type: 4,ENCRYPTED’ in
Without
root@a-server ~ # find /home/*/.ssh/ -name "id_*sa" -exec grep -L ENCRYPTED {} \; | wc -l
19
With
root@a-server ~ # find /home/*/.ssh/ -name "id_*sa" -exec grep -l ENCRYPTED {} \; | wc -l
1
Lovely. This of course doesn’t solve the issue of checking, from the SSH public keys, whether the private keys have passphrases or not.
So, here’s an interesting issue
(initramfs) mount
rootfs on / type rootfs (rw)
none on /sys type sysfs (rw,nosuid,nodev,noexec)
none on /proc type proc (rw,nosuid,nodev,noexec)
udev on /dev type tmpfs (rw,size=10240k,mode=755)
/dev/pudding/root on /mnt type ext3 (rw,errors=continue,data=ordered)
So I’m using BusyBox, with an LVM volume mounted on /mnt. Happy?
(initramfs) ls /mnt
ls: /mnt/initrd.img.old: Stale NFS file handle
ls: /mnt/vmlinuz: Stale NFS file handle
ls: /mnt/vmlinuz.old: Stale NFS file handle
Only one directory (was, a while ago) exported by NFS, which isn’t one that is affected, and the box has never mounted anything by NFS. It seems like the error can be caused when a file is open and the disk falls out from underneath it, and an ambiguous error code is sent back which is interpreted as a stale filehandle. Either way, the superblock on this particular FS is corrupted, so the next step would be to attempt to recover using one of the backup superblocks. I’ll attempt this later and let you know how it goes. I’m sure you’ll be on the edge of your seats.
If you’ve recently noticed that your renamed users in bitlbee have changed back to UIDs, it’s probably because Facebook have changed their UID string from being ‘uNNNNNN’ to ‘-NNNNNN’. Not a huge problem, just change the script:
% diff bitlbee_rename.pl.old bitlbee_rename.pl
22c22
< if($channel == $bitlbeeChannel && $nick == $username && $nick =~ m/^u\d+/ && $host == “chat.facebook.com”)
—-
> if($channel == $bitlbeeChannel && $nick == $username && $nick =~ m/^-\d+/ && $host == “chat.facebook.com”)
25c25
< $server->command(“whois $nick”);
—-
> $server->command(“whois \”$nick\”“);
(My) updated version here. Hope this helps at least one person.
Update
@TheSamoth has pointed out that new bitlbee (v >=1.2.5)doesn’t actually require the rename script, as it has the functionality built in and can be enabled with. Thanks!
account set facebook/nick_source full_name
Dell PowerEdge 1850. I’ve never seen it in the flesh, but believe it has a MegaRAID card.
# lsscsi
[0:0:6:0] process PE/PV 1x2 SCSI BP 1.0 -
[0:1:0:0] disk MegaRAID LD 0 RAID1 69G 516A /dev/sda
# grep -i raid /var/log/dmesg
[ 20.251664] megaraid cmm: 2.20.2.7 (Release Date: Sun Jul 16 00:01:03 EST 2006)
[ 20.690899] megaraid: 2.20.5.1 (Release Date: Thu Nov 16 15:32:35 EST 2006)
[ 20.690929] megaraid: probe new device 0x1028:0x0013:0x1028:0x016c: bus 2:slot 14:func 0
[ 20.690964] megaraid 0000:02:0e.0: PCI INT A -> GSI 46 (level, low) -> IRQ 46
[ 21.324054] megaraid: fw version:[516A] bios version:[H418]
[ 21.332182] scsi0 : LSI Logic MegaRAID driver
[ 21.332598] scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
[ 24.821907] scsi 0:1:0:0: Direct-Access MegaRAID LD 0 RAID1 69G 516A PQ: 0 ANSI: 2
Seems fine, right?
# ./MegaCli64 -adpCount
Controller Count: 0.
Exit Code: 0x00
Hrm.
/opt/MegaRAID/MegaCli # omreport storage controller
No controllers found
Starting to get tweaky.
Update
Thanks to jtopper for the help so far. Getting a bit further, but still:
unzip ut_linux_megarc_1.11.zip
sudo ./megarc.bin -AllAdpInfo
Failed to get driver version
No Adapters Found
And…
$ grep MAJOR megarc
MAJOR=`grep megadev /proc/devices|awk ‘{print $1}’`
$ grep -ci mega /proc/devices
0
Further Update
I’ve finally managed to get to the bottom of this. Looks like any app which creates the /dev/megadev0 device does it with the wrong major. To fix this, based on some brilliant info, I used a major of 10 (now that 252 is used for usbmon), and a minor from /proc/misc.
# mknod /dev/megadev0 c 10 59
$ sudo ./megarc -dispCfg -a0
**********************************************************************
MEGARC MegaRAID Configuration Utility(LINUX)-1.11(12-07-2004)
By LSI Logic Corp.,USA
**********************************************************************
[Note: For SATA-2, 4 and 6 channel controllers, please specify
Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]
Type ? as command line arg for help
Finding Devices On Each MegaRAID Adapter…
Scanning Ha 0, Chnl 0 Target 15
**********************************************************************
Existing Logical Drive Information
By LSI Logic Corp.,USA
**********************************************************************
[Note: For SATA-2, 4 and 6 channel controllers, please specify
Ch=0 Id=0..15 for specifying physical drive(Ch=channel, Id=Target)]
Logical Drive : 0( Adapter: 0 ): Status: OPTIMAL
—————————————————————————-
SpanDepth :01 RaidLevel: 1 RdAhead : Adaptive Cache: DirectIo
StripSz :064KB Stripes : 2 WrPolicy: WriteBack
Logical Drive 0 : SpanLevel_0 Disks
Chnl Target StartBlock Blocks Physical Target Status
—— ——— ————— ——— ———————————
0 00 0x00000000 0x0887c000 ONLINE
0 01 0x00000000 0x0887c000 ONLINE
Hope this helps someone, and many thanks to these guys.
I’ve been having some issues with Firefox of late; flash has been causing it to crash more than usual. People have been very quick to recommend Chrome, but time and time again I’ve had to say no. A very simple reason. The address bar search just gets on my nerves. I’m very used to Firefox’s searching going on both URI and <title> element. I think Chrome uses the page body, and it just gets to me.
If I eventually manage to figure it out, I’m sure Chrome will be my browser of choice as I hear that a FoxyProxy-a-like is now available, too!