Job processing (introduction)

Failures and retries

A job can fail on a BOINC worker node for a variety of reasons:

The application crashes.
The user on that node aborts the job.
The job exceeds its memory or disk space limits.
The job times out.

In some cases the job would succeed on a different node. So BOINC provides a 'retry' mechanism: if a job fails on a node, a second copy (or 'instance') of the job is sent to a different node. This is repeated until an instance succeeds, or until a limit on the number of instances is reached, in which case the job is marked as failing and no further instances are created.

Errors and replication

A job instance can complete but produce incorrect output files, because

The host's CPU or GPU malfunctions. This is rare, but it can happen on hosts that are overclocked and/or overheated.
The user 'cheats' and uses a program that, masquerading as a BOINC client, returns job results without doing any computation. This can happen in a system (like Gridcoin) that gives monetary rewards for computing.

In some cases it may be possible to detect incorrect results by examining the outputs of a single job instance, perhaps by

checking the syntax of the output files
checking that numerical values lie in a plausible range
checking that the results plausibly correspond to the inputs; e.g. in physical simulations, that total energy is about the same.

But cheaters can potentially evade such checks. So BOINC provides another (optional) mechanism: replication. When this is used, each job is run on two different worker nodes. If the results agree, they are deemed to be correct, and one of the instances is marked as 'canonical'. If they don't agree, a third instance is created and sent to a different worker node. This continues until either a pair of agreeing instances is found, or a threshold on the number of instances is reached (in which case the job is marked as failing.

Different types of CPUs and GPUs, and different math libraries, can produce slightly different floating-point results. These differences can compound, as in the 'butterfly effect'; Two equally correct results can have different numbers. The comparison of replicated jobs for such applications must be 'fuzzy'. Typically this means that corresponding numbers are allowed to differ by some (application-specific) amount.

Staging input and output files

Batches

Pipeline components

work generator validator assimilator

Job submission (and file management)

local remote via RPC python bindings remote via web interface

Home

Job processing (introduction)

Failures and retries

Errors and replication

Staging input and output files

Batches

Pipeline components

Job submission (and file management)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!