1 |
J. Roeleveld <joost@××××××××.org> wrote: |
2 |
> |
3 |
> These schedules then also can't be restarted from the beginning |
4 |
> when they stop halfway through without risking massive consistency |
5 |
> problems in the final data. |
6 |
|
7 |
So you have a command which might break due to hardware error |
8 |
and cannot be rerun. I cannot see how any general-purpose scheduler |
9 |
might help you here: You either need to be able to split your command |
10 |
into several (sequential) commands or you need something adapted |
11 |
for your particular command. |
12 |
|
13 |
> And then multiple of those starting at random times with |
14 |
> occasionally a whole bunch of the same schedule put into the |
15 |
> queue with dependencies to the previous run. |
16 |
|
17 |
That's not a problem. Only if the granularity of one command is |
18 |
not fine enough, it becomes a problem. |
19 |
|
20 |
> If, during that time, one of the machines has a hardware failure |
21 |
> or the scheduling process crashes on one or more of the servers, |
22 |
> the last state needs to be recoverable. |
23 |
|
24 |
One must distinguish two cases: |
25 |
|
26 |
1. The machine running "schedule-server" has a hardware failure. |
27 |
(Let us assume tha "schedule-server" does not have a software failure - |
28 |
otherwise, you have problems anyway.) |
29 |
2. Some other machine has a hardware failure. |
30 |
|
31 |
Case 2. is not bad (as concerns the scheduling): Of course, the |
32 |
machine will not report that it completed the job, and you will |
33 |
have to think how to complete the job. But it is clear that in |
34 |
such exceptional cases you have to interfere manually in some sense. |
35 |
|
36 |
In order to deal with case 1., you can regularly (e.g. each minute) |
37 |
dump the output of "schedule list" (possibly suppressing non-important |
38 |
data through the options to keep it short). |
39 |
One could add a logging option to decrease the possible race of 1 minute, |
40 |
but in case of hardware failure a possible race cannot be excluded anyway. |
41 |
|
42 |
In case 1. you manually have to re-queue the jobs and think what to do |
43 |
with the already started jobs. However, I cannot imagine that this |
44 |
occurs so frequently that this exceptional case becomes something |
45 |
one should seriously think about. |