Software faults, for example aging-related bugs, are known to cause performance degradation of service systems which are running on software components. Although they cause undesirable impacts to systems, it is quite difficult to remove these bugs completely during software development. This is because their manifestations are only possible after continuous operation of the software on a certain execution environment. Software rejuvenation is a method that has been developed to restore service systems that suffer from software aging, and they do so by rebooting the execution environment. However, restarting the system causes service interruption and introduces additional costs due to data loss, job drops, or the termination of system processes. This calls for the careful planning of software rejuvenation processes.
In a recent paper published in Reliability Engineering and System Safety, Professor Naoto Miyoshi and his PhD student Fumio Machida considered a condition-based software rejuvenation strategy for a job processing system that is deteriorating. They have analytically derived the optimal policy for determining when to trigger the software rejuvenation process in relation to the performance of the job completion time. Their system is modelled as an M/M/1 queue, and they have formulated the software rejuvenation decision as an optimal stopping problem, which is different from conventional studies on software rejuvenation models.
The authors defined the states of the system in relation to the quantity of queued jobs. The transition between the states occurs either when the current job is completed or when a new job arrives. Job arrivals follow a Poisson arrival process while job service times are exponentially distributed with a service rate.
Depending on the quantity of remaining jobs, the system may decide to immediately execute software rejuvenation, or to progress with operation until the completion of the next job. If at a decision point the rejuvenation action is chosen, all the system jobs are dropped, and this incurs a rejuvenation cost which is proportional to the number of dropped jobs. Conversely, when the waiting action is chosen, then the rejuvenation decision is put on hold until the next job is completed and this incurs a cost of delayed job.
In their analysis, the authors considered the case where the cost of delayed job is smaller as compared with the cost of dropped job due to rejuvenation. The decision for software rejuvenation ends with the rejuvenation action. Therefore, the authors formulated their problem as an optimal stopping problem whereby the optimal policy was solved using the optimality equation. The interrelations among infinite states becomes an obstacle to solving the optimality equation, and hence the authors took an analytical approach to overcame this.
By proving the various propositions, Machida and Miyoshi contend that the optimal policy is determined by the relation among the cost of delayed job, the cost of dropped job, and the deteriorating traffic intensity. The authors noted that when the cost of delayed job is less than the cost of dropped job multiplied by one minus the deteriorating traffic intensity, then awaiting action is chosen irrespective of the quantity of queued jobs. However, if the reverse is true then a rejuvenation action should be chosen. The policy derived by the authors provides a reasonable and easy guide for determining the software rejuvenation trigger time for a job processing system that is deteriorating.
Fumio Machida, Naoto Miyoshi. Analysis of an optimal stopping problem for software rejuvenation in a deteriorating job processing system. Reliability Engineering & System Safety, Volume 168, December 2017, Pages 128-135Go To Reliability Engineering & System Safety