How to write a reliable background job

Background jobs are essential to getting scheduled and period tasks done. Because they run behind the scenes, they can often fail in silent and confusing ways. The following ideas will help you write jobs that are more likely to succeed, isolate failures, and better support the automatic re-running of your job runner.

I’ve included some pseudo code to illustrate each point.

Only do one thing per job

The more things you do in a background process, the higher your chance of failure. Be careful of running an entire workflow across multiple clients or accounts in a single job. It might be better to spawn a job per client or even a job per action.

Instead of running a single job that iterates through all your clients and generates an invoice for each one, consider one job that iterates through the clients and have it generate one job per client to perform the invoice generation. When you build your jobs like this, a failure to generate a single invoice won’t prevent every other client from receiving their invoice.

If you email each client their invoice, you might even consider that as a separate job. If the invoice is successfully generated but the email fails to go out (because your SMTP server or sending API is down) you will want to be able to restart the email sending without regenerating the invoice.

function invoice_job() {
  var clients = get_all_clients()
  for client in clients {
    queue("invoice_client_job", client.id)
  }
}

function invoice_client_job(client_id) {
  var client = get_client(client_id)
  var invoice = generate_invoice(client)
  queue("email_invoice", invoice.id)
}

function email_invoice_job(invoice_id) {
  // send invoice email
}

When each job is small and focussed, you have a lower point of failure, and you are able to re-run failed jobs without unintended consequences.

Make your jobs idempotent

Most job runners will re-run a job if it fails. What you don’t want is for your job to accidentally run the same task again. Even if you’ve followed the advice of the previous step, you might find a partially run job needs to re-run (because it timed out), and it might have to step over already completed steps before continuing on.

Let’s revisit the invoicing job. The invoice all clients job might timeout for some reason before its queued every invoice job. When the job restarts, it should only iterate over those clients who don’t have an invoice yet. Because you may have queued invoice jobs that haven’t completed, this won’t be entirely reliable, so each invoice job should also only generate an invoice if one wasn’t generated yet.

function invoice_job() {
  var clients = get_uninvoiced_clients()
  for client in clients {
    queue("invoice_client_job", client.id)
  }
}

function invoice_client_job(client_id) {
  var client = get_client(client_id)
  if(client_needs_invoice(client)) {
    var invoice = generate_invoice(client)
  }
  queue("email_invoice_job", invoice.id)
}

function email_invoice_job(invoice_id) {
  if(invoice_has_not_been_emailed(invoice_id)) {
    // send invoice email
  }
}

By including a check at the beginning of each job, you avoid unwanted duplication.

Include timestamps in your job arguments

Don’t rely on the current time when running a job. Always include a date or timestamp in your job arguments. If your job fails over night and you or your job runner re-runs it the next day, you want to know that the job will process the correct data.

If your job to generate invoices for the end of the month fails, and you have to re-run it the next day, you don’t want it generating invoices for the wrong month.

function invoice_job(date) {
  var clients = get_uninvoiced_clients_for(date)
  for client in clients {
    queue("invoice_client_job", client.id, date)
  }
}

function invoice_client_job(client_id, date) {
  var client = get_client(client_id)
  if(client_needs_invoice_for(client, date)) {
    var invoice = generate_invoice_for(client, date)
  }
  queue("email_invoice_job", invoice.id)
}

function email_invoice_job(invoice_id) {
  if(invoice_has_not_been_emailed(invoice_id)) {
    // send invoice email
  }
}

Including the date or time in your job arguments guarantees that the job processes the correct data no matter when it runs.

Log generously

Because background jobs run on a schedule and outside of normal app processes, they can be very hard to debug. By including ample log statements you make it much easier for yourself to diagnose problems.

Sometimes your system will be running multiple jobs of the same kind. Because of this it is helpful to include a unique indicator per job so you can filter the logs down to a specific job and work through only that one job’s logs.

In this example I include the client id and date in every logging line. If your logging system supports it, you could also tag the log entries.

function invoice_job(date) {
  log("Generating invoices for ${date}")
  var clients = get_uninvoiced_clients_for(date)
  for client in clients {
    log("Queuing invoice job for client ${client.name} (${client.id}, ${date})")
    queue("invoice_client_job", client.id, date)
  }
}

function invoice_client_job(client_id, date) {
  var client = get_client(client_id)
  if(client_needs_invoice_for(client, date)) {
    log("Generating invoice for ${client.name} (${client.id}, ${date})")
    var invoice = generate_invoice_for(client, date)
  } else {
    log("No invoice needed for ${client.name} (${client.id}, ${date})")
  }
  log("Queuing invoice job client ${client.name} (${client.id}, ${date})")
  queue("email_invoice_job", invoice.id)
}

function email_invoice_job(invoice_id) {
  if(invoice_has_not_been_emailed(invoice_id)) {
    log("Sending email for invoice ${invoice.number} (${invocie.client_id}, ${invoice.date})")
    // send invoice email
  }
}

Use a monitoring service

While logging is important, it is only helpful in diagnosing problems after the fact. Because jobs run in the background, they often fail silently. You don’t want to find out that your invoicing job failed only after your bank balance drops.

You might think that emailing yourself a report after each job is enough (e.g. cron will send an email on each run). It’s hard to filter through the noise and pick-up when you haven’t received an email. It is far better to have a pro-active notification when a job fails or even fails to run.

0 0 1 * * /usr/local/bin/invoice_clients.sh; curl -d "s=$?" https://notify.do/my-monitor

If you don’t have a monitoring service, consider Sitesure monitoring. Sitesure will listen for jobs on a schedule and notify you if they don’t check-in or report a failure. Sitesure supports both scheduled jobs and ad-hoc jobs that tend to run within expected time windows (heartbeat monitoring).

Conclusion

With some upfront planning you can write background jobs that are more reliable and serve you well.
What problems have you encountered with your background jobs, and what tips do you have for making them more reliable? Let me know.