Celery is a distributed task queue for Python. It’s pretty useful, and a lot of apps I’m involved in deploying seem to be using it lately.
Something it seems to struggle with is stability; in the event of a database disappearing, being unable to resolve a database’s hostname, or a single connection to a database failing, it just shuts down.
I needed this to not happen, when running things in “the cloud” (sorry) you’re very much at the mercy of other people controlling your networking/tin/everything – so you need to write applications that are capable of a little bit of failure (even if the application was originally written in this way to avoid split brain or similar). To get around this, we implemented monit. I am definitely not a fan of apps automatically restarting, but it was the only trivial resolution in this situation. Just append this to your monit config and you should be sorted. My understanding is that there isn’t a better solution yet, but would be interested to know if anyone has seen one.
check process celeryd with pidfile /var/run/celeryd.pid start program = "/etc/init.d/celeryd start" with timeout 10 seconds stop program = "/etc/init.d/celeryd stop" if changed pid then restart if 5 restarts within 5 cycles then timeout alert youremailaddresshere
(I appreciate this is especially tedious, but this is for my reference)