Worker

The Worker Protocol #

Workers interact with Rota through the Worker Protocol. Note that not all jobs are completed this way: if you configure your jobs to be triggered through a push-based mechanism like Amazon Lambda, then the worker protocol will not come into play. This protocol is specific to traditional worker processes that request work from Rota, instead of Rota pushing work to the worker.

The API at its core is based around the /work WebSocket endpoint (TODO : link to API docs). Each worker connects to this WebSocket and begins sending Heartbeat Messages. When the worker indicates it is ready for work, the server sends through Assignment Messages with information about work to be done.

In proper operation, the worker should not have to interact with any other API than this WebSocket unless the job implementation is wants to.

Heartbeat Messages #

These messages serve multiple purposes depending on what data is filled out in them, namely Registration, Status, and Results. They are sent upon initial connection, during important changes in state, and if it has been more a second since the last heartbeat to assure Rota that the worker is still online.

Registration #

The receipt of the first heartbeat message registers the worker to the system. It is expected to include metadata like its hostname and operating system which is used for monitoring and diagnostics. It also includes a list of queues the worker is interested in grabbing work from and job specs that the worker implements.

Status #

The worker will regularly notify Rota of its current status, including changes in variable worker metadata (like CPU and RAM usage), changes in job status (such as progress indicators), logs (if requested when the job was assigned and supported by the worker), and alerts.

Workers generally batch up updates until they reach a certain size or its time to assure Rota that they are still online. Some events, such as alerts, may be reported sooner.

Results #

Whenever a job is fails or succeeds, results are reported in the next heartbeat message. These are usually done immediately. When these updates occur, they include any relevant return data or error details.

The worker should indicate it is ready for additional work as part of the result heartbeat unless they are entering a warm shutdown and do not want to take on any more jobs (or are in some similar condition).

Assignment Messages #

Heartbeat messages include an indicator of how many jobs the worker wants assigned to it. In general, a worker should only request as many jobs as it will run in parallel (e.g. one per ready thread).

When Rota wants to assign a job to the worker, it sends an assignment message with a copy of the job. Rota assumes the worker is processing the job and will only be concerned if a heartbeat is not received within five seconds.

Handling Job Spec Versions #

It is not uncommon for there to be multiple versions of a worker running at the same time with incompatible job specifications and implementations. In order for a job specification to be available in Rota, it must be explicitly registered, which is usually done during a deployment process. These explicitly registered job specifications are considered the status quo.

When a worker is notifying Rota of what jobs it can do, it includes the identifiers of the job specs it includes plus checksums of the job specifications. If the checksum does not match that of the status quo, the worker is considered incapable of doing that job and Rota will not assign it.

This is not all or nothing. A worker may have some mismatching specs and some matches specs, and Rota will assign jobs for the matching specs.