Journal of Mathematical ScienCes, Vol. 73, No: 5, 1995
A SLIDING BACKUP SCHEME ROBUST CONTROL SYSTEMS
FOR ORGANIZING
R. N. Bulatov, S. A. Zaraiskii, and V. R. Fazylov
UDC 519.8
In systems of automated control of technological processes wider and wider application is being made of decentralized control systems using networks of microcomputers and microcontrollers [1]. A serious obstacle to the development of decentralized multi-machine systems of automated control of technological processes is the insufficient reliability of Soviet computing hardware [2]. In complex highly automated production the failure of an individual processor or peripheral device can lead to undesirable consequences (accident, spoilage, equipment stoppage, and so forth). In setting up complex and expensive systems of automated control of technological processes one..faces the problem of guaranteeing their robustness, i.e., functioning with acceptable efficiency during breakdowns and failures of the computing hardware. The concept of robustness is often identified with resistance to failure. For example, in the standard work Reliability in Engineering [3] resistance to failure is defined as the property of maintaining working capability in the event of failure of one or more elements. It is also pointed out in [3] that the functioning capability of complex objects can have certain intermediate states with decreased levels of quality of functioning. In the works [4, 5] resistance to failure means the ability of a system to recover all its functions automatically after a failure, and robustness is the ability to reorganize itself automatically after a failure in such a way as to recover or maintain its basic functioning (possibly with some reduction in quality) by redistributing the functions of the parts that have failed among those that are functioning. Therefore we can regard resistance to failure as a particular case of robustness. The publications on resistance to failure of control systems axe concerned with primary control systems, and resistance to failure is guaranteed by automatic transition to backups or dominant backup (auxiliary computational complexes, systems for control of nuclear power station units, multi-processor systems, and the like). Failure-tolerant control systems are applied in those systems where there is no grace time and the defense against failure can start to function or the transition to the backup can be effected in a short time without losses in the control system, while the expense of guaranteeing resistance to failure is justified. The widespread introduction of failure-tolerant high-performance equipment into automated systems of controlling technological processes by using double and triple computational complexes is hindered by the limited production and the high cost of these systems. The effort to increase the reliability of high-performance equipment using the SM computer by keeping backup machines can lead to a decreased reliability in the functioning of the high-performance equipment [6] due to overload on the common diskette. Many publications have been devoted to the problem of constructing robust computational systems [7-13], but the proposed approaches do not take account of the fact that systems of automated control of technological processes contain distributed data bases and the control functions in an automated system of control of technological processes are connected informationally. The first two authors [14] have given requirements on the design of failure-tolerant multi-machine high-performance equipment for data processing systems. In the present article we propose a mathematical model of the problem of organizing a sliding backup scheme according to robustness criteria and a way of increasing the robustness of multi-machine high-performance equipment in automated control systems by using the grace time. The mechanical application of a sliding backup scheme [I5] is impossible, since the accumulated information on the course of the process is essential to maintain continuous control, i.e., it is impossible to reassign control functions from failed high-performance equipment to backups without an adequate data base in the failed and backup devices. In implementing a sliding backup method it is necessary that all the data needed for sustaining the most important control functions of the failed device be duplicated in the backup so as to maintain the control process during the time required for the failed device to recover. It is Translated from Issledovaniya po Prikladnoi Matematike, No. 18, 1992, pp. 16-26. 1072-3374/95/7305-0529 $12.50 9
Plenum Publishing Corporation
529
not possible to transfer the hard drive to the backup device by the application of hard drives of "Winchester" type. Therefore after the failed device has recovered it is necessary to copy all the data from the backup device (or all the updates in operational information made during the recovery), in order for the recovered high-performance device to take over its control functions once again. We note that the lowest level of control of technological processes is based on the microcontroUer (often built into the equipment) and frequently any failure in the microcontroller can damage the normal functioning of the equipment. The robustness of the lowest level of control can be guaranteed only by application of duplicated microcontrollers (as in the "Lomikont 112" and "Lomikont 122" microcontroUers) or by increasing the reliability of the microcontroller by reserving individual blocks of them (as in the SM9107). The lowest levels of integration of the automated control system, namely automated control of production at the factory, control of storage and delivery of goods, and control of technological processes of primary units, are based on a local network of homogeneous high-performance equipment [1]. Usually the intervals between data exchange sessions between the microcontroUer and the hig-performance equipment range from several minutes to tens of minutes (the intervals between exchange sessions are determined by the length of the technological cycles of the equipment). Delays of a few minutes in carrying out exchanges between the microcontroller and the device (arising, for example, as a result of the reconfiguration of multi-machine devices ) may lead only to stoppage of the equipment, and as a rule are acceptable. S t a t e m e n t o f t h e P r o b l e m . We shall construct a model of a sliding backup scheme under the following assumptions: 1) the software can be divided into groups of components of special software having no common informational blocks; the more such groups there are, the greater the possibility of mobile transfer of software and the smaller the cash expenditures on backup functions; 2) the software can be divided into three classes: - software not admitting even brief delays in execution (including all software to maintain the functioning of the work of the equipment and controlling the CPU) - software of first class; - software that maintains control of expensive equipment and can admit a delay of several minutes in exchange with the equipment without danger of accident - software of second class; - software for which a delay in execution (due to failure of high-performance equipment) causes relatively small losses in the technological mode (for example, by transition to manual control) and software that does not participate directly in the control of the technological process and can be recovered without any large losses after the equipment has been restored (service routines, processing of statistics, and the like) software of third class; 3) the functioning of software of first and second classes is independent of the functioning of software of third class; 4) the routines can be implemented independently of one another; 5) software of first and second classes constitutes a relatively small portion of the total software; 6) software of third class is endowed with relative coefficients of importance [9] (since in practical situations it is impossible to estimate the losses due to nonperformance of a specific partial control function
[16]); 7) the bus of the local network is absolutely reliable. Software of first class is characteristic for primary control of continuous technological processes (chemical and petrochemical production). For them a backup of primary transformers (transmitters), microcontrollers, connection channels and equipment is characteristic. When software of this class has high relative weight in a system of control of technological processes, it is necessary to use failure-tolerant equipment or execute a routine on two or more machines simultaneously controlling the results of the solution [17-19]. Software of second class is characteristic for discrete production with executive technological units (transportation, warehouses, and the like). The largest volume of software in a control system for technological processes of continuous-discrete production is made up of third class software. 530
We now state the problem of a sliding backup scheme with a robustness criterion: It is required to distribute programs over a network of machines and organize an operational backup of the data base (operational information) on a backup machine in such a way that when any device fails the transition to the backup machines saves the software of first and second classes unconditionally, as well as the most valuable software of third class. When there is a breakdown in the control system, it is necessary to carry out a dump in the control system using copies of the operational d a t a [11]. In reconfiguring multi-machine equipment one must not allow the machines or the bus to be loaded above a certain limiting level, since otherwise a sharp increase in the intensity of equipment failures may occur [6, 20]. Therefore in the proposed model less important tasks of the third class can be temporarily flushed from the control system. A Mathematical Model of the Problem. introduce the following notation:
To describe a mathematical model of this problem, we
N is a set of indices of homogeneous devices connected via bus to a local network of a distributed automated system; I is a set of indices of programs t~ving no common informational blocks; Ik is the set of indices of the kth class (UIk = I, Ik N I 1 = O for all k r j). V is the available amount of operational memory (RAM) of the backup machines under the RAMresident software. V, is the available amount of RAM of the nth machine under the RAM-resident software. V/, is amount of RAM of the n t h machine needed for running the ith program. V/is the amount of RAM of the backup machines needed for running the ith program (under breakdown conditions a simplified version of the program may be run). Wi is a parameter assuming the value "1" if the ith program is RAM-resident and "0" otherwise. C is the maximum admissible level of use of the capacity of the bus of the local network. Ci() ) are the bus capacities needed for communication between the ith and j t h programs when they are located on different machines. C} 2) are the additional bus capacities needed to back up the ith program. P is the maximal admissible level of use of the capacities of the processors of the backup machines. Pn is the maximal admissible level of use of the capacities of the processor of the nth machine. P(2) are the processor resources of the nth device needed to run the ith program. P(n2) are the additional capacities of the processor of the nth machine needed to maintain copies of the ith program on the backup machine when the ith program is located on the n t h machine. p(3) are the capacities of the processor of the backup machine needed for running the ith program (under breakdown conditions a simplified version of the program may be run). p(4) are the capacities of the processors of the backup machine needed to back up the data base from the ith program. R}a,) is a criterion for the possibility of locating the ith program on the n t h machine (equal to 1 if the location is possible, otherwise to 0). R}~) is a criterion for connection of failures of the ith and j t h programs (if failure of the ith program causes failure of the j t h , it is 1, otherwise 0). Oi is the coefficient of importance of a program (i E/3). xin is the assignment variable of the ith program on the n t h machine (equal to 1 when the program is assigned to that machine, otherwise to 0). yin is the assignment variable of the copy of the ith program on the n t h machine (equal to 1 when the copy is assigned, otherwise to 0). 531
The condition for assigning a program to a machine is:
iEI,
(1)
i e I, n E N.
(2)
y ] x., = 1, h EN
< /;?!1)
tin
The state condition for assigning a copy of a program on the machines is:
~-':. v~. = I,
(3)
ielaUI2,
hEN
EYi,
(4)
<_l, iEIa,
hEN Yin
~ 1-
i E I, n E N.
Xi.,
(5)
The restrictions on the load level of a machine under normal functioning are:
~ ( P } l ) x , . + P}~.)v,.) <_ P.,
(6)
n e N.
iEI
The restrictions on the load level of the bus under normal functioning are:
EE iEI j E J
"t, ,,,~ lcf? x,..,) k2--'J + hE'c 2 EN
-< c.
(7)
The restriction on the load level of the processor of the backup machine is:
P?) + ~ M%,, < P. JEll
(s)
iEl hEN
Since SM-based computers are equipped with hard drives of "Winchester" type of large capacity, we can assume that the restrictions on the R O M capacity are insignificant. Since virtual memory is used in the machine, the restrictions on R A M can be considered insignificant, and the restrictions on RAM for the RAM-resident software can be represented as
E ~ " W i x i n
nEN.
(9)
iEI
The restrictions on the load level for the processor of a backup machine when any machine (say the kth) fails are: iEI~
iEl hEN n#k
iEI~
iEIz
The restrictions on the R A M of a backup machine when one machine fails for the RAM-resident software are:
E Yi + E Vimik + E Wiyii _
iEl
(11)
iEI
When one machine fails, a reconfiguration of the multimachine equipment takes place. Due to the restrictions on the bus, the RAM, and the load on the processor of the backup machines, some of the third-class software cannot be run (we call such software inoperative), i.e., a degradation of the control system occurs. Moreover, besides the inoperative programs, the programs that exchange information with them also cannot be run. To estimate the degree of degradation of multimachine equipment we introduce, 532
in analogy with [8], a coefficient of robustness of the control system under failure of one machine as the sum of the coefficients of importance of the remaining third-class software:
Gk = ~ Qiylk + ~ ~ ( 1 iEz i~z/ez
Rl2')Qixik + ~ ~ Rij(2,OiYik, k E N.
(12)
iEz jEz
As a whole the coefficient of robustness of the control system can be taken as
G = 1 ~ Gk n
(13)
kEN
or
G = rain Gk. kEN
(14)
Thus the problem of the optimal sliding backup scheme has been stated as the problem of maximizing
(13) or (14) under restmctlons " " "" which " is " a Boolean programrmng " problem. (1)-(12), A p p r o a c h e s t o t h e S o l u t i o n o f t h e P r o b l e m . We remark that all the restrictions except (7) axe linear. If in addition the intensity of data exchange operations is not high, i.e., the condition 9
(c,j") +
)) < c
ieI jeI holds, then restriction (7) can be ignored, the problem becomes linear, and can be solved by Balash' method, for example (see [21]). Otherwise one can develop a modification of Balash' method that takes account of the nonlinear restriction (7), or one can develop heuristic methods 9 In particular the initial problem can be solved in two stages: first, distribute the software over the local network so as to minimize t h e load on the bus [22]; then solve the problem by Balash' method. We remark that t h e total number of problems, for example a u t o m a t e d control of the operation of a factory, may amount to several hundred, which can be combined in a hundred routines, the main portion being located uniquely on the machines, which significantly reduces the size of the problem. When there is a high bus load, it is advisable to divide the network into two local networks connected by a computer and maximize the robustness of each network independently or additionally unite the machines by high-speed buses of MVS type. When there is a shortage of backup machines but adequate bus capacity, the machines in the network can be divided into groups and the problem of maximizing robustness can be solved separately for each group. An estimate of the load can be carried out by the well-known regional m e t h o d or one can obtain statistical data on the organizational/design polygon of the automated control system by running software on the machines. The problems can be broken into software groups in accordance with the recommendations of the paper [23]. Literature Cited 1. V, N. Chukhman, "Problems of hardware maintenance of integrated automated control systems," Izmereniya, Kontrol', Avtomatizatsiya: Nauch.-Tekhn. Sb. Obzorov. TsNIITEIpriborostroi, No. 1(65), 63-73 (1988). . E. K. Maslovskii and O.L. Lebedev, "Properties of the construction of integrated systems of production control: a survey of approaches and results," Izmereniya, Kon~rol" Avr Nauch.-Tekhn. Sb. Obzorov. TsNIITEIpriborostroi, No. 1(65), 86-95 (1988). GOST 27.002-83. Reliability in Technology. Terms and Definitions [in Russian], Bureau of Standards, Moscow (1983). .
533
4. A. Avizhenis, "Resistance to failure--a property maintaining continuous functioning of digital systems," Tr. Inzh. po Elektrotekhn. i Radiotekh., 60, No. 10, 5-26 (1978). 5. I. A. Mamzelev, M. Yu. Rusakov, E. D. Chasovnikov, and E, D. Nikolaenko, "Failure-tolerant computing systems: a survey," garubezh. Radioelektr., No. 11, 3-28 (1983). 6. G. A. Egorov, K. V. Peselev, and V. V. Rodionov, The SM Computer: Installation and Application [in Russian], N. L. Prokhorov, ed., Finansy i Statistika, Moscow (1986). 7. A. S. Gil'man, "An estimate of the robustness of multifunction systems," Vopr. Promyshl. Kibern., TsNIIKA, Moscow, No. 29, 10-14 (1971). 8. Yu. N. Mel'nikov, "On the principles of maintaining the robustness of multifunction automated control systems," in: Hardware and Software for Automated Real-Time Control [in Russian], MDNTI, Moscow (1978), pp. 74-76. 9. 0 ' P. Vasil'ev and Yu. N. Mel'nikov, "On an approach to estimating the robustness of multifunctional territorially distributed information/computing systems," Avtomat. i Telemekh., No, 12, 133-137
(1981). 10. O. P. Vasil'ev, "On the problem of synthesis of territorially distributed automated control systems," Tr. Mosk. Energ. Inst., No. 544, 101-106 (1981). 11. A.-A. Aliev and A. I. Nikitin. "Dumping in distributed real-time systems," in: Mathematical Programming and Technical Maintenance of Computing Systems in Russian], Inst. Kibern. Akad. Nauk Ukr. SSR, Kiev (1987), pp. 28-33. 12. G. N. Cherkesov, Methods and Models for Estimating the Robustness of Complex Systems [in Russian], Znanie, Moscow (1987). 13. N. T. Berezyuk, A. Ya. Galunin, and N. I. Podlesnyi, Robustness of Microprocessor Control Systems [in Russian], Tekhnika, Kiev (1989). 14. R. N. Bulatov and S. A. Zaraiskii, "Requirements on the design of failure-tolerant distributed dataprocessing systems," in: Problems of Building and Installing Flexible Production and Robotic Complexes in the Machine-Tool Industry [in Russian] (Proceedings of the All-Union Practical Science Conference, Odessa, 9"13 Oct. 1989), VNIITEMP, Moscow (1989), p. 248. 15. B. A. Kozlov and I. A. Ushakov, Handbook for Computing the Reliability of Radioelectronic and Automation Hardware [in Russian], Sov, Radio, Moscow (1975). 16. I. M. Shenbrot, M. V. Antropov, and V. M. Alley, "Properties of design of distributed automated control systems," in: Applications of Microprocessors and Decentralization of Structures in an Automated Control System [in Russian], TsNIIKA (1983), pp. 15-19. 17. V. V. Bragin, "Organization of multiple processors in a local control network of GAP," in: Mathematical Maintenance of Integrated Systems of Automated Control/GAP [in Russian], KPTI, Kuibyshev (1985), pp. 27-30. 18. V. V. Bragin and B. N. Chetverikov, "Methods and means of maintaining failure'tolerant functioning of control processes for bandwidth shift generators," in: Flexible Systems of Production [in Russian], MDNTP, Moscow (1986), pp. 122-126. 19. V. N. Chetverikov, "Means of guaranteeing failure resistance in an operational ERA system," in: Distributed Data Processing and Local Computer Networks [in Russian], MDNTP, Moscow (1987), pp. 15-19. 20. R. K. Ieyr and D. J. Rosseti, "A measurement-based model for workload dependence of CPU errors," IEEE Trans. Comput., 35, No. 6, 511-519 (1986). 21. A. A. Korbut and Yu, Yu. Finkel'shtein, Discrete Programming [in Russian], Nauka, Moscow (1969). 22. V. M. Aliev, "Optimal distribution of operations, files, and programs in local automated control networks," Avtomat. i Telemekh., No. 4, 42-46 (1987). 23. A. I. Sbitnev, "Degradation of the software of automated control systems in multi,machine computational complexes," in: Mathematical Programming and Technical Maintenance of Computing Systems [in Russian], Inst. Kibem. Akad. Nauk Ukr. SSR, Kiev (1987), pp. 85-90.
534