[Technical Field]
Loo0 11
The present invention relates to a techniql~ef or managing a computer system
including management target apparatuses such as a host computer, a
net.work apparatus, and a storage apparatus for inatclncc.
[Background Art]
roo021
For a management of a computer system, by utilizing a technique for
specifying a failure cause on an event base such as an Event Correlation
technique, a manager of a computer system can detect a cause of a failure
that has occurred in the computer system (see Patent Literature 1).
[00031
For instance, Patent Literature 2 discloses a technique where an analysis
engine for analyzing a cause-and-effect relationship of an event such as a
plurality of failures that have occurred in management target apparatuses
applies a general rule composed of a condition statement and a conclusion
statement that have been defined in advance to an event related to a
management target apparatus, for instance an event in which a performance
value exceeds a predetermined threshold value, an expansion rule including
a cause event that is a cause of a performance degradation and a condition
event group caused by the cause event is created and a cause of a failure is
specified based on the expansion rule that has been created.
[00041
A recent computer system involves a lot of useful measures that can be
executed as a recovery measure to a failure (a measure to implement a
restoration from a failure, that is, a failure recovery), such as a measure to
implement a restoration from a failure by executing a suitable data
migration for a placement of a system resource (such as a virtual machine
and data). As a technique for executing a data migration for'instance, in an
environment in which a plurality of virtual host computers (that are virtual
machines, hereafter referred to as VM) are operated on a physical host
computer, a technique for taking over an operation of a VM from a certain
physical host computer to another physical host computer (a first VM
= r. 7 $7 - $3 % - i L 7 d!La _-" -.--. -. ---- ----
2
migration) and a technique for migrating a VM that has been stored into a
certain storage area to another storage area (a second VM migration) in
accordance with information indicating a performance of a VM and use
information of a resource are known. Here, the VM is a kind of data that is
stored into a storage area, and the VM migration (a first VM migration and a
second VM migration) is a kind of data migrat,inn hetween storage areas.
Moreover, as a technique for executing a data migration between data
storage areas (volumes) of a storage apparatus, a volume migration is known
(see Patent Literature 3).
100051
Non Patent Literature 1 discloses a technique for checking whether or not a
failure has been improved by a recovery measure after the recovery measure
to a failure is executed and for automatically executing another recovery
measure that has been defined in advance in the case in which a failure has
1101; been improved.
[OOO~]
Patent Literature 4 discloses a technique for recording details of a recovery
measure that was executed to a failure and for utilizing the recorded
information in the case in which a recovery measure is selected.
[Citation List]
[Patent Literature]
[00071
[PTL I]
US Patent Application No. 7107185
[PTL 21
Japanese Patent Application Laid-Open Publication No. 2010-86115
[PTL 31
US Patent Application No. 6108748
[PTL 41
International Publication No. 20111007394 pamphlet
[Non Patent Literaturc]
[00081
[NPL 11
"A Policy Description and its Execution Scheduling for Automated IT
Systems Management" (Yutaka Kudo, Tomohiro Morimura, Yoshimasa
Masuoka, and Norihisa Komoda), the C Society transactions of the Institute
--.- -- -- -.-- - -- -
& % I 2 2 . -- i -- 3 y'y&P ----- - - A - - ---
3
of Electrical Engineers of Japan, Vol. 131, No. 10, 2011
[Summary of Invention]
[Technical Problcm]
[OOO~]
In the case in which a failure that is specified by the Event Correlation
technique that is disclosed in Patent T,iterat.ure 1 or Patent Literature 2 is
I tackled, a manager does not know what kind of recovery measure is
I specifically exec.11,t.edf ~ar f ailure recovery and a restoration from a failure is
I costly unfortunately. Even in the case in which a mapping of a failure cause
I and a recovery measure to the failure cause is taken and a recovery measure
I to the failure cause can be created based on the mapping, a manager does not
I know what kind of recovery measure is preferentially selected in order to
I execute a work along an intention of a manager who carries out a recovery
I
I
I work from a failure on an actual operational management site. In other
words, in the case in which a failure cause and a recovery measure to the
failure cause are presented to a manager, even in the case, in which only
recovery measure that is limited to some extent is selected due to an
intention of a manager (such as a personal cost or an economical cost that is
required for a failure recovery, and a priority judgment based on an
importance of an apparatus that is a target of a recovery work), since a
number of inferable recovery measures are presented to a manager, it is
difficult for a manger to select a recovery measure.
[00101
In the case in which a technique that is disclosed in Non Patent Literature 1
is utilized, whether or not a failure has been improved by executing a
selected recovery measure is checked and another recovery measure that has
been defined in advance can be automatically executed in the case in which
the failure has not been improved. By this technique, in the case in which a
problem point remains after an execution of the recovery measure, another
recovery measure can be further executed. However, since it is not
considered what kind of recovery measure was executed by a manager in the
case in which a similar failure occurred in past times, a recovery measure
that is not intended by a manager is preferentially presented or executed in
some cases, whereby a cost may be increased in the case in which a manager
selects a recovery measure.
[Solution to Problem]
A management system in accordance with the first aspect manages a
computer system provided with a plurality of management target devices. A
storage device of the management system stores one or more rules indicating
a correspondence relationship between a cause event related to any one of
the plurality of management, tsarget, devices and one or more condition cvents
related to any one of the plurality of management target devices that is a
condition under which the cause event is n cause, plan inforinatioli irldicating
a correspondence relationship between the rule and a plan that is a recovery
measure that can be executed in the case in which a cause event of the rule is
a cause, and plan history information indicating the success or failure of a
failure recovery by an execution of the plan every when the plan is executed.
A control device of the management system executes a cause analysis of an
event that has occurred in any one of the plurality of management target
devices and specifies a first cause event that is a candidate of a cause of the
event that has occurred based on the one or more rules, specifies a plurality
of first plans that can be executed in the case in which the first cause event is
a cause based on the plan information, calculates an index value indicating a
possibility of succeeding in a failure recovery in the case in which the plan is
executed for each of the plurality of first plans based on the plan history
information, and displays data indicating any one or more plans of the
plurality of first plans according to a display mode decided based on the
index value. "Displaying data" can mean that a management system
displays data on a display device that is included in a management system or
can mean that data to be displayed is transmitted to a remote computer that
is coupled to a management system and that is provided with a display
device.
[Advantageous Effects of Invention]
[oolzl
The present invention can provide a technique for supporting a manager who
copes with a failure.
[Brief Description of Drawings]
[OO13 1
[Fig. 11
Fig. 1 is a block diagram showing an example of a computer system in
accordance with a first embodiment.
[Fig. 21
Fig. 2 is a block diagram showing an example of a host computer in
accordance with the first embodiment.
[Fig. 31
Fig. 3 is a block diagram showing an example of a storage apparatus in
accordance with the first embodiment.
[Fig. 41
Fig. 4 is a block diagram showing an example of a management server in
accordance with the first embodiment.
- -- - - -
[ ~ i51~ .
Fig. 5 is a block diagram showing an example of an apparatus performance
rn-anagement table in accordance with the first embodiment.
[Fig. 61
Fig. 6 is a block diagram showing an example of a volume topology
management table in accordance with the first embodiment.
[Fig. 71
Fig. 7 is a block diagram showing an example of an event management table
in accordance with the first embodiment.
[Fig. 81
Fig. 8 is a block diagram showing an example of a general rule in accordance
with. the first embodiment.
[Fig. 9A]
Fig. 9A is a view showing a first example of an expansion rule in accordance
with the first embodiment.
[Fig. 9B]
Fig. 9B is a view showing a second example of an expansion rule in
accordance with the first embodiment.
[Fig. 9Cl
Fig. 9C is a view showing a third example of an expansion rule in accordance ~ with the first embodiment.
[Fig. 9D]
Fig. 9D is a view showing a fourth example of an expansion rule in ~ accordance with the first embodiment.
[Fig. 101
Fig. 10 is a block diagram showing an example of an analysis result I
management table in accordance with the first embodiment.
[Fig. 111
Fig. 11 is a block diagram showing an example of a general plan table in
accordance with the first embodirncnt.
[Fig. 121
Fig. 12 is a block diagram showing an example of an expansion plan table in
accordance with the first embodiment.
[Fig. 131
Fig. 13 is a block diagram showi.ng an example of a rule plan correspondencc
management table in accordance with the first embodiment.
[Fig. 141
Fig. 14 is a block diagram showing an example of a plan execution history
management table in accordance with the first embodiment.
[Fig. 151
Fig. 15 is a flowchart of a performance information acquisition processing in
accordance with the first embodiment.
[Fig. 161
Fig. 16 is a flowchart of a failure cause analysis processing in accordance I
with the first embodiment.
[Fig. 171
Fig. 17 is a flowchart of a plan expansion processing in accordance with the
first embodiment.
[Fig. 181
Fig. 18 is a flowchart of a plan post-execution risk extraction processing in
accordance with the first embodiment.
[Fig. 191
Fig. 19 is a flowchart of a plan presentation processing in accordance with
the first embodiment.
[Fig. 201
Fig. 20 is a block diagram showing an example of a plan presentation screen
in accordance with the first embodiment.
[Fig. 211
Fig. 21 is a flowchart of a plan execution processing in accordance with the
first embodiment.
[Fig. 221
Fig. 22 is a block diagram showing an example of a management server in I
accordance with a second embodiment.
-,-.=- *.-- -
$Xj?=* H s 2 2 2 - 5 7 G . - 2 - . & T T 5 x F IE1PS :z-z - ------&--- -A- --------
[Fig. 231
Fig. 23 is a block diagram showing an example of a- test case repository in
accordance with the second embodiment.
[Fig. 241
Fig. 24 is a flowchart of a test case extraction processing in accordance with
the second embodiment.
[Fig. 251
Fig. 25 is a block diagram showing an. example of a computer system in
accordance with a third embodiment.
[ ~ i26~1 .
Fig. 26 is a block diagram showing an example of a management server in
accordance with the third embodiment.
[Fig. 271
Fig. 27 is a block diagram showing an example of a plan execution history
management table in accordance with the third embodiment.
[Fig. 281
Fig. 28 is a block diagram showing an example of a management server list
in accordance with the third embodiment.
[Fig. 291
Fig. 29 is a flowchart of a plan execution history exchange processing in
accordance with the third embodiment.
[Fig. 301
Fig. 30 is a block diagram showing an example of a plan presentation screen
in accordance with the third embodiment.
[Description of Embodiments]
[00141
The embodiments of the present invention will be described below with
reference to the drawings. The embodiments that will be described in the
following do not restrict the present invention in accordance with the claims,
and all of elements and combinations thereof that will be described in the
embodiment are not necessarily essential for means for solving the problems
of the invention. In the drawings, the same reference symbols indicate the
same composition elements through a plurality of drawings. In the following
descriptions, while the information in accordance with the present invention
will be described in the expression such as "aaa table", the information can
also be represented by other than the data structure such as a table. In
--.- ---...-,..- ---- --- ? % ? - T . . q - - - - - - -- . -- . -,Yy>l,-~.:y!--&~T-': ~.?.2z~!Sa~2:~-!-.&L~TLL*&--~-X >----
8
order to indicate that the information is not depended on a data structure,
the expression of "aaa table" can also be referred to as "aaa information" or
"aaa data" in some cases. Morcover, in the case in which the conterlls of the
information are described, the expressions of "identification information",
"identifier", "name", and "ID" are used. The expressions can be substituted
for each other.
[oo 151
In the following descript.ion.s, the descriptions will be done whilc n "program"
or a "module" is handled as a subject in some cases. In the case in which the
program (the module) is executed by a processor, the predetermined
processing is executed while using a memory and a communication port (such
as a management port and an I/O port). Consequently, a processor can also
be handled as a subject in the descriptions. The processing that is disclosed
while a program is handled as a subject can also be a processing that is
executed by a computer or an information processing apparatus such as a
management server. Moreover, a part or a whole of a program can also be
implemented by dedicated hardware. A device including a processor or a
processor and such dedicated hardware can also be referred to as a "control
device". A variety of programs can be installed to each of the computers by a
program distribution server or a storage medium that can be read by a
computer.
[00 161
An aggregate of one or more computers that are configured to manage a
computer system and to display the display information in accordance with
the present invention is referred to as a management system in some cases.
In the case in which a management server displays the display information,
the management server is a management system. Moreover, a combination
of the management server and a display computer (such as a WEB browser
start-up server) is also a management system. A processing that is
equivalent to the management server can also be implemented by using a
plurality of computers to speed up a management processing and to increase
reliability of a management processing. In this case, the plurality of
computers is a management system (in the case in which a display is
executed by the display computer, the display computer are included in the
plurality of computers).
[00171
(1) First embodiment
[OO 181
The first embodimcnt is related to a display processing of a candidate of a
failure cause by management software (such as a program in a management
server).
[00191
r00201
Fig. 1 is a block diagram showing an example of a computer system in
accordance with a first embodiment.
Coo2 11
A computer system is provided with one or more storage apparatuses 20000,
one or more host computers 10000, a management server 30000, and a WEB
browser start-up server 35000, which are coupled to each other by one or
more network apparatuses, such as a communication network 45000
configured by an IP switch 40000 and a router not shown.
[00221
The host computer 10000 receives an 110 (input/output) request of a file from
a client computer not shown and executes an access to the storage apparatus
20000 based on the received I/O request for instance. Moreover, the
management server 30000 manages an operation of the entire computer
system.
[00231
The WEB browser start-up server 35000 communicates with a GUI display
processing module of the management server 30000 via the communication
network 45000 and displays a variety of information on a browser screen that
is displayed by the WEB browser. A manager refers to the information that
is displayed on a browser screen of the WEB browser start-up server 35000 to
manage each apparatus in the computer system. However, the management
server 30000 and the WEB browser start-up server 35000 can also be
configured by one server.
100241
For apparatuses included in the computer system, an apparatus that is a
target of a management of the management server 30000 is referred to as a
management target apparatus in the following. In the present embodiment,
a management target apparatus is the host computer 10000, the storage
apparatus 20000, and a network apparatus such as the IP switch 40000.
However, other apparatuses such as a NAS (Network Attached Storage) and
a printer can also be included as a management target; apparatus. Moreover,
for devices included in the management target apparatus, a device that is a
target of a management of the management server 30000 is referred to as a
management target device.
[00251
[00261
Fig. 2 is a block diagram showing an example of a host computer in
accordance with a first embodiment.
[00271
The host computer 10000 is provided with a port 11000 for being coupled to
the communication network 45000, a processor 12000, and a memory 13000,
which are coupled to each other via a circuit such as an internal bus. The
host computer 10000 can also include a secondary storage device such as a
disk (a magnetic disk).
[00281
The memory 13000 stores a work application 13100 and an operating system
(0s) 13200. The work application 13100 uses a storage area that has been
provided from the operating system 13200 to execute an inputloutput (110) of
data to the storage area. The operating system 13200 executes a processing
for causing the work application 13100 to recognize a logical volume on the
storage apparatus 20000 coupled to the host computer 10000 via the
communication network 45000 as a storage area.
[00291
In the example of Fig. 2, the port 11000 is referred to as a single port I
I
I
including an I10 port for executing a communication by the storage
apparatus 20000 and an iSCSI (Internet Small Computer System Interface)
and a management port for acquiring management information in the host
computer 10000 by the management server 30000. However, the I10 port
and the management port can also be separated as different ports.
[OO~O]
Fig. 3 is a block diagram showing an example of a storage apparatus in - -. - - -- iTs g T -7$ Ti - zT# q - -y 1 q T L - y :s - ,x - - _---rJii 3A - -.------ , --- k I
11
accordance with a first embodiment.
LO0321
A storage apparatus 20000 is provided with an I10 port 21000 for being
coupled to the host computer 10000 via the communication network 45000, a
management port 21100 for being coupled to the management server 30000
via the communication network 45000, a management memory 23000 for
storing a variety of management information, a RAID (Redundant Arrays of
Inexpensive Disks) group 24000 for storing user data, and tl controller 25000
for controlling user data and management information in the management
memory, which are coupled to each other via a circuit such as an internal
bus. In the present embodiment, that the RAID group 24000 is coupled to
other device means that a disk 24200 that configures the RAID group 24000
is coupled to other device.
LO0331
The rrlanagement memory 23000 stores a management program 23100 for
managing the storage apparatus 20000. The managem'ent program 23100
communicates with the management server 30000 via the management port
21100 and provides the configuration information of the storage apparatus
20000 to the management server 30000.
lo0341
The RAID group 24000 is configured by one or more disks 24200. In the case
in which the RAID group 24000 is configured by a plurality of disks 24200,
the plurality of disks 24200 can make a RAID configuration. For the storage
apparatus 20000, one or more logical volumes 24100 are formed based on a
storage area in the RAID group 24000.
[00351
As long as the logical volume 24100 is configured by using a storage area of
one or more disks 24200, it is not necessary to make a RAID configuration.
Moreover, as a device that provides a storage area corresponded to the logical
volume 24100, as substitute for the disk 24200, a storage medium of other
kind such as a flash memory can also be adopted.
[00361
The controller 25000 is provided inside with a processor for controlling the 1
/ storage apparatus 20000 and a cache memory for temporarily storing data
that is transmitted to and received from the host computer 10000. The
controller 25000 is disposed between the I10 port 21000 and the RAID group
_--_- -- - - .- - - - - 5 - -- ..."
a f i i i s 3-J p . S --I- -----i----- 7 3 i $ - l . r = %-- 2---i-7 i Z c -3 7 :4,-3
12
24000, and transmits and receives data between the I10 port 21000 and the
RAID group 24000.
Lo0371
As long as the storage apparatus 20000 provides the logical volume 24100 to
any host computer 10000, receives an I10 request, and is provided with a
storage controller (the controller 25000 in the present embodiment) that
executes a read and a write to a storage device (the disk 24200 in the present
embodiment) according to the rcccivcd 110 rcqucst and a storage device that
provides a storage area, a configuration other than that of Fig. 3 can also be
adopted, for instance, a storage controller and a storage device that provides
a storage area can also exist in separate enclosures, respectively. Although
the management memory 23000 and the controller 25000 are configured as
separate devices in an example of Fig. 3, a configuration in which the
controller 25000 includes the management memory 23000 can also be
adopted. Moreover, a "storage apparatus" can also be changed to be called a
"storage system" as an expression in the both cases in which a storage
controller and a storage device exist in the same enclosure and in which a
storage controller and a storage device exist in separate enclosures.
[00381 .
[00391
Fig. 4 is a block diagram showing an example of a management server in
accordance with a first embodiment.
[OO~O]
The management server 30000 is provided with a management port 31000
for being coupled to the communication network 45000, a processor 31100, a
memory 32000 such as a cache memory that is one type of a storage device, a
secondary storage device 33000 such as an HDD (hard disk drive) that is one
type of a storage device, an output device 31200 such as a display for
outputting a processing result, and an inpit device 31300 such as a keyboard
for inputting an indication by a manager, which are coupled to each other via
a circuit such as an internal bus.
[00411
The memory 32000 stores the computer programs of a program control
module 32100, a configuration management information acquisition module
32200, an apparatus performance acquisition module 32300, a GUI display
-, .*. -, Q. - -- -- -
I , ~ ~ E L # Z. ~-s ~ - . i i ~ - - ~ ~i; ~7 - P u- 22-T
13
processing module 32400, an event analysis processing module 32500, a rule
expansion module 32600, a plan expansion module 32700, a plan postexecution
risk extraction modulc 32800, a plan presentation illodule 32900, a
plan execution module 32910, a plan execution result confirmation module
32920, a plan execution history extraction module 32930, and a plan
evaluation module 32940. In the present emlnodiment., each module is
provided as a software module of the memory 32000. However, each module
can also be provided as a hardware module. Moreover, a processing that is
executed by each module can be provided as one or more program codes, it is
not necessary that a clear boundary between modules exists. A module can
also be called a program.
8
Coo421
The secondary storage device 33000 stores an apparatus performance
management table 33100, a volume topology management table 33200, an
event management table 33300, a general rule repository 33400, an
expansion rule repository 33500, an analysis result management table 33600,
a general plan table 33700, one or more expansion plan tables 33800, a rule
plan correspondence management table 33900, and a plan execution history
management table 33950. The general rule repository 33400 stores one or
more general rules. The expansion rule repository 33500 stores one or more
expansion rules. The general rule and the expansion rule are information
I indicating a correspondence relationship between a combination of one or
more condition events that may occur in a management target device that
configures a computer system and a cause event that is a cause of a failure to I
I the combination of one or more condition events. The secondary storage
device 33000 is configured by a semiconductor memory and a disk, or any one
of a semiconductor memory and a disk for instance.
loo431
The GUI display processing module 32400 displays the acquired
configuration management information via the output device 31200 in
response to a request via the input device 31300 from a manager. The input
device 31300 and the output device 31200 can be separate devlces, or can be
one unified device.
Lo0441
The management server 30000 is provided with a keyboard or a pointer
device as the input device 31300, and a display or a printer as the output
~~~m&i~~,~! ."..- %~~ij$~@fi~~#'"I~F-~1$LS-.-iL-"----... ,-..- s . _. . _ __
14
device 31200 for instance. However, the management server 30000 can also
be provided with other apparatuses. Moreover, it is also possible that a
serial interface or Ethernet interface is used as substitute for an inputloutput
device, a display computer provided with a display, a keyboard, or a pointer
device is coupled to the interface, a display is executed with a display
computer by transmitting the display information to the display cornputcr
and by receiving the input information from the display computer, and an
input and an output of the inputloutput device are alternated by receiving an
input.
[00451
LO0461
Fig. 5 is a block diagram showing an example of an apparatus performance
management table in accordance with a first embodiment.
LO0471
An apparatus performance management table 33100 includes an apparatus
ID 33110 that is a field for storing an identifier of a management target
apparatus (hereafter referred to as an apparatus ID), a device ID 33120 that
1 is a field for storing an identifier of a management target device (hereafter
referred to as a device ID), a metric 33130 that is a field for storing a metric
name that indicates a kind of a performance value related to a management
target device, an apparatus OS 33140 that is a field for storing data that
indicates a type of an OS of a management target apparatus in which a
I threshold value abnormality of a performance value has been detected, a
I performance value 33150 that is a field for acquiring a performance value of
I a management target device from a management target apparatus including
the device and for storing the performance value, an alert execution
threshold value 33160 that is a field for storing a threshold value of an upper
limit or a lower limit of a normal range a performance value of a
management target device (hereafter referred to as an alert execution
threshold value) when an input is received [rum a user, a threshold value
type 33170 that is a field for storing data that indicates whether the alert
execution threshold value is an upper limit or a lower limit of a normal
range, and status 33180 that is a field for storing data that indicates whether
a performance value is a normal value or an abnormal value.
1
[00481
s" F: : ~3$~2--3-. -7 &. 2$ : 1 5 2 . -- -- - - - - - -
=7 Pa ,, ,,,, - rn I_ --. - -- ----- -----.---
I 15
For instance, the first entry from above in Fig. 5 is an entry related to a
controller CTLl (that is a controller 25000 of which a device ID is CTL1,
referrcd to as similarly in the case in which a managenlent target; device is
specified using a device ID) in a storage apparatus SYSl (that is a storage
apparatus 20000 of which an apparatus ID is SYS1, referred to as similarly
in the case in which a management target apparatiis is specified using an
apparatus ID). From the entry, it is known that in the case in which an
operation rate of a processor exceeds 20%, it is determincd as an overload by
the management server 30000 for the controller CTL1, that is, an alert
execution threshold value for the controller CTLl is 20%. Moreover from the
entry, it is known that it is determined that an operation rate of a processor
for the controller CTLl is 40% at the present moment and the present
performance value is an abnormal value
[00491
In Fig. 5, as a performance value of a management target device, an
operation rate of a processor (simply referred to as an operation rate in the
figure), an I10 amount per unit time, and a response time are mentioned as
an example. However, other kinds of a performance value can also be
adopted.
[00501
[005 11
Fig. 6 is a block diagram showing an example of a volume topology
management table in accordance with a first embodiment.
[00521
A volume topology management table 33200 is information (connection
information) for managing a connection relationship among a plurality of
management target devices in the computer system. The volume topology
management table 33200 includes an apparatus ID 33210 that is a field for
storing an apparatus ID of the storage apparatus 20000, a volume ID 33220
that is a field for storing an identifier (hereafter referred to as a .volume ID)
that is used in the storage apparatus 20000, an LU number 33230 that is a
field for storing an identifier (hereafter referred to as an LU number) of the
logical volume 24100 for recognizing the logical volume 24100 by the host
computer 10000, a controller name 32340 that is a field for storing a device
ID of the controller 25000 that is used in the case in which the host computer
.rp ..h,3. ", .. .' "r. -* . ~yi(~-:q-iE;7-j YfP q, 7 7
----.- - .e.-.---*.p--&.---~- ---L..- ---- -L* - PA- 'i
- '~
16 i
10000 accesses the logical volume 24100, a connection destination host ID
33250 that is a field for storing an apparatus ID of the host computer 10000
that accesses the logical volume 24100, and a connection destination drive
name 33260 that is a field for storing a device ID of a volume (a drive) in the
host computer 10000 in which the logical volume 24100 is a substance.
[00531
For instance, from the first entry from above in Fig. 6, it is known that a
logical volume VOLl of a storage apparatus SYSl is provided to a host
computer HOSTl as a logical unit (LU) that is indicated by an LU number of
LUI, the host computer HOSTl accesses the logical volume VOLl via a
controller CTL1, and the logical volume VOLl is recognized as a drive "Ivar"
on the host computer HOST1. In the present embodiment, as a device ID of
the logical volume 24100, there are two cases in which a volume ID is used
and in which an LU number is used. For instance, the logical volume VOLl
is referred to as a logical volume LU1 in some cases. However, the logical
volume VOLl and the logical volume LU1 indicate the same logical volume
24100.
[00541
[00551
Fig. 7 is a block diagram showing an example of an event management table
in accordance with a first embodiment. An event management table 33300 is
referred to from time to time in a failure cause analysis processing (Fig. 16)
described later.
[00561
An event management table 33300 includes an event ID 33310 that is a field
for storing an identifier imparted to an event of a failure or the like
(hereafter referred to as an event ID), an apparatus ID 33320 that is a field
for storing an apparatus ID of a management target apparatus in which an
event has occurred, an apparatus region ID 33330 that is a field for storing a
device ID of a management target device in which an event has occurred, a
metric 33340 that is a field for storing a metric name related to a
performance value of which a threshold value abnormality has been detected,
an apparatus OS 33350 that is a field for storing data that indicates a type of
an OS of a management target apparatus in which a threshold value
abnormality has been detected, a status 33360 that is a field for storing data
- - - -- - - I _*- v. -. -. - 7r-g.B -- - " -- --
I Y i s saPs-@_ f.. 2 -2 - - L L % A " _
17
that indicates a state in an event occurrence for a management target device
in which an event has occurred, an analyzed flag 33370 that is a field for
storing data that indicates whether or not an event has already been
analyzed by an event analysis processing module 32500, and an occurrence
date and time 33380 that is a field for storing data that indicates the date
and time when an event occurred.
[00571
For instance, from the first entry from above in Fig. 7, it is known that the
management server 30000 detects a threshold value abnormality of an
operation rate of a processor for the controller CTLl of the storage apparatus
SYSl and an event ID of an event corresponded to the threshold value
abnormality is "EV1".
100581
[00591
Fig. 8 is a block diagram showing an example of a general rule in accordance
with a first embodiment.
[0060]
A general rule is a rule indicating a correspondence relationship between a
cause event related to any one of a plurality of management target devices
and one or more condition events related to any one of a plurality of
management target devices that are conditions in which a cause event is a
cause of a failure and is a rule in which a management target device related
to a cause event and a condition event is represented by a type of the
management target device. In general, for an event propagation model for
specifying a cause in a failure analysis, a combination of events predicted to
occur due to a certain failure (cause) and the cause are described by the IFTHEN
form. The general rule is not restricted to one mentioned in Fig. 8,
and more rules can also be adopted.
[OO~II
A general rule includes a general rule ID 33430 that is a field for storing an
identifier of a general rule (hereafter referred to as a general rule ID), a
condition part 33410 that is a field for storing an observation event
equivalent to an IF part of a general rule described by the IF-THEN form,
. that is, data indicating each of one or more condition events, a conclusion
part 33420 that is a field for storing a cause event equivalent to a THEN part - x-,,-v ~c l-- : Tl-jFq 3 7 :sf z &+-?A %EY-@-X-~ a--- zct-a:3--2- - - -
18
of a general rule described by the IF-THEN form, that is, data indicating a
cause event, and an application topology 33440 that is a field for storing data
indicating topology information (connection information) that is referred to in
the case in which a general rule is expanded to a real system and an
expansion rule is created. Moreover, the condition part 33410 includes a
field 33450 for storing a number imparted to a condit.inn event (hereaft,er
referred to as a condition event number) for every condition event. In the
case in which one or more condition events indicated by the condition part
33410 are detected, it is determined that a cause event indicated by the
conclusion part 33420 is a cause of a failure. In the case in which a status of
the conclusion part 33420 becomes normal, it is expected that a problem of
the condition part 33410 is also solved. In the example of Fig. 8, two
condition events are described in the condition part 33410. However, the
number of condition events is not restricted.
[0062]
For instance, a general rule shown by an example of Fig. 8, that is, a general
rule Rulel (that is a general rule of which a general rule ID is Rulel, and
described similarly in the case in which a rule is specified by using an
identifier in the following) indicates that it is concluded that a threshold
value abnormality of an 110 amount in a unit time for the logical volume
24100 of the storage apparatus 20000 is a cause in the case in which a
threshold value abnormality of a response time for a drive of the host
computer 10000 and a threshold value abnormality of an I10 amount in a
unit time for the logical volume 24100 (LU) of the storage apparatus 20000
are detected as an observation event. In the case in which an expansion rule
is created based on the general rule, the volume topology management table
33200 is referred to as the topology information. As a condition event
included in the observation event, it can also be defined that a certain
condition is normal.
LO0631
[00641
Fig. 9A is a view showing a first example of an expansion rule in accordance
with a first embodiment. Fig. 9B is a view showing a second example of an
expansion rule in accordance with a first embodiment. Fig. 9C is a view
showing a third example of an expansion rule in accordance with a first - - - = '$.f&JiA- T -- - 3 . P .-.- -34P"P -... -,
19
embodiment. Fig. 9D is a view showing a fourth example of an expansion
rule in accordance with a first embodiment.
[00651
An expansion rule is a rule in which a general rule is expanded in a form
depending on a real configuration of the computer system. In other words,
the expansion rule is a rille indicating a correspondence relationship between
a cause event related to any one of a plurality of management target devices
and one or more condition cvcnts related to any one of a pluralily uf
management target devices that are conditions in which a cause event is a
cause of a failure and is a rule in which a management target device related
to a cause event and a condition event is represented by data indicating the
specific management target device. The expansion rule is created by
replacing a type of a management target apparatus and a type of a
management target device related to each of a condition event and a cause
event for a general rule by an apparatus ID of the specific management
target apparatus and a device ID of the specific management target device
that are defined by the volume topology management table 33200.
[00661
The configuration of an expansion rule will be described with reference to
Fig. 9A in the following. An expansion rule includes an expansion rule ID
33530 that is a field for storing an identifier of an expansion rule (hereafter
referred to as an expansion rule ID), an pre-expansion general rule ID 33540
that is a field for storing a general rule ID of a general rule that is a basis of
an expansion rule, a condition part 33510 that is a field for storing an
observation event equivalent to an IF part of an expansion rule described by
the IF-THEN form, that is, data indicating each of one or more condition
events, and a conclusion part 33520 that is a field for storing a cause event
equivalent to a THEN part of an expansion rule described by the IF-THEN
form, that is, data indicating a cause event. Moreover, the condition part
33510 includes a field 33550 for storing a condition event number imparted
to a condition event for every condition event.
[00671
For instance, an expansion rule shown by an example of Fig. 9A is created by
replacing a type of a management target apparatus and a type of a
management target device related to each of a condition event and a cause
event for the general ruleRulel shown in Fig. 8 by an apparatus ID of the
.,-- r*.; -72 -r > & -I-.. & -2 bzA: $2 5 u :>
[00691
Fig. 10 is a block diagram showing an example of an analysis result
management table in accordance with a first embodiment.
[00701
An analysis result management table 33600 includes a cause apparatus ID
33610 that is a field for storing an apparatus ID of a management target
apparatus related to an event that has been determined as a candidate of a
failure cause in a failure cause analysis processing (hereafter referred to as a
cause candidate event) (a first cause event), a cause region ID 33620 that is
a field for storing a device ID of a management target device related to a
cause candidate event, a metric 33630 that is a field for storing a metric
name related to a performance value related to a cause candidate event, a
certainty factor 33640 that is a field for storing a value (a certainty factor)
indicating the certainty of the cause event being the root cause, an expansion
rule 33650 that is a field for storing an expansion rule ID of an expansion
rule including a cause candidate event as a cause event, that is, an expansion ,
rule that is a reason of determining a cause candidate event as a candidate of
a failure causc, a reception event ID 33660 thal is a field for storing an event
ID of an event that has actually occurred for one or more condition events of
an expansion rule including a cause candidate event as a cause event, an
corresponded flag 33670 that is a field for storing data that indicates whether
or not a manager has actually executed a failure correspondence based on the
analysis result, and an analysis execution date and time 33680 that is a field
for storing data that indicates the date and time when a failure analysis
processing associated with an occurrence of an event was started. In the
prcscnt embodiment, a certainty factor is an occurrence rale of a condition
event in the past certain period of time.
LO07 11
For instance, from the first entry from ahnve in Fig. 10, it is known that the
management server 30000 has determined that a threshold value
ab~ormalityof an 110 amount in a unit timc for thc logical volume LU1 of the
storage apparatus SYSl is a candidate of a failure cause, an occurrence of an
event of which an event ID is indicated by "EV3" or "EV6" is a reason of the
determination, and a certainty, that is, an occurrence rate of a condition
event is 100% (212 x 100).
[00721
[00731
Fig. 11 is a block diagram showing an example of a general plan table in
accordance with a first embodiment.
[00741
A general plan table 33700 is information indicating a list of a general plan.
The general plan is a recovery measure to a failure that can be executed in
the computer system (hereafter referred to as a plan), and is a plan
represented in the form independent of an actual configuration of the
computer system. The general plan table 33700 includes the fields of a
general plan ID 33710 and a plan 33720. The general plan ID 33710 stores
an identifier of a general plan (hereafter referred to as a general plan ID).
The plan 33720 stores data that indicates a general plan that can be
executed in the computer system, for instance, a name of a general plan. As
a general plan, there can be mentioned for instance a reboot of the host
computer 10000, a configuration modification of the IP switch 4000, and a
volume migration and a VM migration of the storage apparatus 20000. The
general plan is not restricted to one shown in Fig. 11.
LO0751
[00761
Fig. 12 is a block diagram showing an example of an expansion plan table in
1 accordance with a first embodiment.
I _ - _ _ - - . - - - - . I J p Q l_______ j3'F:rH'T_ _._ -T X_-~- ~ ~ F ~ '2 ~ ~.:-&x~ ~ . ~ ~ - T
[00771
An expansion plan table 33800 is information for managing one or more
expansion plans. The expansion plan is a plan in which a general plan has
been expanded in the form dependent of an actual configuration of the
computer system. The expansion plan table 33800 is created based on an
expansion rule, the general plan table 33700, the volume topology
management table 33200, and the apparatus performance management table
33100 by the plan expansinn. module 32700.
[00781
The expansion plan table 33800 includes the fields of a plan detail 33810, a
general plan ID 33820, an expansion rule ID 33823, and a general rule ID
33825. The general plan ID 33820 stores a general plan ID of a general plan
that is a basis of an expansion plan. The expansion rule ID 33823 stores an
expansion rule ID of an expansion rule corresponding to an expansion plan
as information for recognizing a failure cause to which the expanded plan is
corresponded. That is, each expansion plan in the expansion plan table
33800 is a plan that can be executed in the case in which a cause event of an
expansion rule that is indicated by an expansion rule ID of the expansion
rule ID 33823 is a failure cause (a plan to the failure cause). In other words,
the expansion plan table 33800 is information for managing a correspondence
relationship between an expansion rule and one or more expansion plans
that is corresponded to the expansion rule. In the present embodiment, the
expansion plan table 33800 is created for every combination of an expansion
rule and a general plan. However, the expansion plan table 33800 can also
be created for every expansion rule, and other modes can also be adopted.
The expansion plan table 33800 is corresponded to information (plan
information) that indicates a correspondence relationship between a rule and
a plan that can be executed in the case in which a cause event of the rule is a
cause. The general rule ID 33825 stores a general rule ID of a general rule
that is a basis of an expansion rule corresponded to an expansion plan.
[00791
The plan detail 33810 stores the concrete processing contents about each of
one or more expansion plans that have been expanded and the state
information after the execution of the expansion plan. The plan detail 33810
includes the fields of an expansion plan ID 33830, a plan target 33840, and a
risk point 33890. The expansion plan ID 33830 stores an identifier of an
expansion plan (hereafter referred to as an expansion plan ID). The plan
target 33840 stores information that indicates a composition element (device)
related to an expansion plan and the information after the execution of the
plan or the like. The risk point 33890 stores data that indicates a problem
point that potentially remains after the execution of the plan (hereafter
referred to as a risk point).
[00801
The expansion plan table 33800 shown by an example of Fig. 12 manages an
expansion plan based on a general plan of which a general plan ID is Planl,
that is, an expansion plan related to a volume migration. In the case of the
expansion plan related to a volume migration, the plan target 33840 includes
the fields of a migration target volume 33850, a migration source apparatus
33860, and a migration destination apparatus 33870 for instance. The
migration target volume 33850 includes a volume ID 33850A that is a field
for storing a device ID of a logical volume 24100 that is a target of a volume
migration (hereafter referred to as a migration target volume) and an I10
Response Time prediction 33850B that is a field for storing a predicted value
of a response time of an I10 to a migration target volume after the execution
of a volume migration. The migration source apparatus 33860 includes an
apparatus ID 33860A that is a field for storing an apparatus ID of the
storage apparatus 20000 that is provided with a migration target volume
(hereafter referred to as a migration source apparatus) and an 110 Response
Time prediction 33860B that is a field for storing a predicted value of a
response time of an 110 to a migration source apparatus after the execution of
a volume migration. The migration destination apparatus 33870 includes an
apparatus ID 33870A that is a field for storing an apparatus ID of the
storage apparatus 20000 that is a migration destination of data of a
migration target volume (hereafter referred to as a migration destination
apparatus) and an 110 Response Time prediction 33870B that is a field for
storing a predicted value of a response time of an I10 to a migration
destination apparatus after the execution of a volume migration.
[008ll
For the information of the volume ID 33850~t,h e apparatus ID 33860A, and
the apparatus ID33870A, the plan expansion module 32700 acquires the
information from the volume topology management table 33200 and stores
the information. As a calculation method of a value that is stored into each
of the I10 Response Time prediction 33850B, I10 Response Time prediction
33860B, I10 Response Time prediction 33870B, an arbitrary method can also
be adopted. For instance, a value of each of lhe I10 Response Time prediction
33850B, I10 Response Time prediction 33860B, I10 Response Time prediction
33870B can be a value (a response time of an 110) that is obtained by the plan
expansion module 32700 that acquires an I10 amount in a unit time of a
migration target volume, a migration source apparatus, and a migration
destination apparstu~f rom thc apparatus performance manageiiiei~tt able
33100, subtracts a value of an 110 amount in a unit time of a migration target
volume from an I10 amount in a unit time of a migration source apparatus,
adds a value of an I10 amount in a unit time of a migration target volume to
an I10 amount in a unit time of a migration destination apparatus, predicts
an I10 amount of a migration source apparatus and a migration destination
apparatus after the execution of a volume migration, and takes a reciprocal
thereof. In an example of Fig. 12, an example in which the performance
information is stored is described as the contents of the plan detail 33810.
ow ever, the cost information related to a plan and the down time
information of a system due to a failure in the case in which a plan is
executed can also be stored for instance.
[0082]
Fig. 12 shows an example of an expansion plan related to a volume
migration. However, an expansion plan that is corresponded to other general
plan that is included in the general plan table 33700 can also be created
similarly. Even in the case in which other general plan is expanded to an
expansion plan, the plan expansion module 32700 refers to the volume
topology management table 33200, enumerates devices related to the plan,
refers to the apparatus performance management table 33100, simulates the
state information after an execution of the plan such as the performance
information, the capacity information, the cost, information, and the down
time information, and calculates a predicted value of a performance value
after the execution of the plan for a device related to the plan.
[00831
[0084]
I Fig. 13 is a block diagram showing an example of a rule plan correspondence
management table in accordance with a first embodiment.
--.... .*--- r "..+ I- p ~ ~ ~ d - - - - ,-...-!z. -- 5-% 2 -T .t7-1 7- ""., $ - -,--&--? -.
[0085]
A rule plan correspondence management table 33900 is information for
managing a correspondence relationship between a general rule and one or
more general plans corresponded to the general rule, that is, one or more
general plans that can be executed in the case in which a cause event of the
general rule is a cause. The rule plan cnrrespondence management table
33900 is corresponded to information (plan information) that indicates a
correspondence relationship between a rulc and a plan that can be executed
in the case in which a cause event of the rule is a cause. The rule plan
correspondence management table 33900 indicates a correspondence
relationship among a general rule, a list of a general plan that can be
executed in the case in which a cause of a failure is specified by applying the
general rule, and an event remains in an unsolved state in the case in which
each general plan is executed (hereafter referred to as an unsolved event).
[0086]
The rule plan correspondence management table 33900 includes the fields of
a general rule ID 33910, a general plan ID 33920, and an unsolved event ID
33930. The general rule ID 33910 stores a general rule ID of a general rule.
The general plan ID 33920 stores a general plan ID of a general plan. The
unsolved event ID 33930 stores an identifier of an event (an unsolved event)
that remains in an unsolved state in the case in which each general plan is
executed (hereafter referred to as an unsolved event ID). The unsolved event
ID is corresponded to a condition event number that is stored into the field
33450 of the condition part 33410 of the general rule. For instance, the
unsolved event ID 33930 stores "NONE" in the case in which an unsolved
event does not exist, and stores "ALL" in the case in which all of condition
events and cause events remain as an unsolved event.
[00871
[00881
Fig. 14 is a block diagram showing an example of a plan execution history
management table in accordance with a first embodiment.
[00891
A plan execution history management table 33950 is information (plan
history information) for managing an execution result (the success or failure I
I
of a failure recovery) for an expansion plan that has been executed for the
- - - - --
"9 pF'-&$*y 7.- Y,.Ffi4JJJ,.yJJyf3 q -a-T I.&? Efli JB -- - - - .4- -- ---_I-_ Id I_---- _ -
2 6
computer system, for instance an expansion plan that has been executed by
the plan execution module 32910. The plan execution history management
tablc 33950 includes the fields of an expansion rule ID 33960, an expansion
plan ID 33970, an execution success or failure 33980, and an execution date
and time 33990. The expansion rule ID 33960 stores an expansion rule ID of
an expansion rule. The expansion plan TD 33971) stores a,v expansion plan ID
of an expansion plan. The execution success or failure 33980 stores data that
indicates the success or failure of a failure recovery by an cxecution of an
expansion plan, that is, data that indicates whether or not a recovery of a
failure in which a cause event of an expansion rule that is indicated by an
expansion rule ID of the expansion rule ID 33960 is a failure cause is
succeeded by an execution of an expansion plan that is indicated by an
expansion plan ID of the expansion plan ID 33970. For instance, the
execution success or failure 33980 stores "OK in the case in which a failure
recover is succeeded in, and stores "NG" in the case in which a failure recover
is failed in. The execution date and time 33990 stores data that indicates the
date and time when an expansion plan is executed.
[00901
In the example of Fig. 14, each entry (a history element) of the plan execution
history management table 33950 indicates an expansion rule including a
cause event specified as a failure cause (more specifically, a candidate of a
failure cause), an expansion plan that has been executed to the failure cause,
and the success or failure of the failure recovery by an execution of the
expansion plan by an association with each other. However, a configuration
of the plan execution history management table 33950 is not restricted to the
above configuration. Other configuration can also be adopted as long as each
entry can indicate a failure cause, an expansion plan that has been executed
to the failure cause, and the success or failure of the failure recovery by an
execution of the expansion plan by an association with each other. For
instance, each entry can indicate a cause event that has been specified as a
failure cause, an expansion plan that has been executed to the failure cause,
and the success or failure of the failure recovery by an execution of the
expansion plan by an association with each other.
[009l]
In the next place, each processing that is executed by the management server
30000 will be described.
[00921
[00961
Fig. 15 is a flowchart of a performance information acquisition processing in
accordance with a first embodiment.
[00971
The program control module 32100 instructs an execution of a performance
information acquisition processing to the apparatus performance acquisition
module 32300 at a start-up of a program or for every elapse of a certain
period of time from the previous performance information acquisition
processing. In the case in which the execution indication is issued
repeatedly, it is not necessary to indicate the execution at strictly regular
time intervals as long as the execution indication is repeated.
[0098]
The apparatus performance acquisition module 32300 repeats the following
sequence of processing to each management target apparatus.
[00991
In the first place, the apparatus performance acquisition module 32300
instructs a transmission of performance information to each management - .- -* ( 2 - - - - 8TFl S T -717- - - 7471%- - -P-'-.++- . E32 --------- !I% T%, ----- J 4 :BY --
target apparatus (step 61010).
[01001
The apparatus performance acquisition module 32300 determines whether or
not there is a response from a management target apparatus (step 61020).
In the case in which there is a response from a management target
apparatus, that is, performance information has been received from a
management target apparatus (step 61020: Yes), the apparatus performance
acquisition module 32300 updates a value of a pcrformancc valuc 33160 of
the apparatus performance management table 33100 based on the received
performance information (step 61030). On the other hand, there is not a
response from a management target apparatus (step 61020: No), the
apparatus performance acquisition module 32300 terminates the
performance information acquisition processing.
[01011
In the next place, the apparatus performance acquisition module 32300
refers to a performance value of each management target device that has
been stored into the apparatus performance management table 33100, and
repeats the processing from the step 61050 to the step 61070 for each
performance value (step 61040).
[01021
The apparatus performance acquisition module 32300 confirms whether or
not a performance value exceeds an alert execution threshold value, and
updates a value of the status 33180 of the apparatus performance
management table 33100 based on the confirmation result (step 61050). The
apparatus performance acquisition module 32300 then determines whether
or not a status of a performance value has been changed, that is, a
performance value has been changed from a normal value to an abnormal
value or from an abnormal value to a normal value (step 61060). In the case
in which a performance value has been changed (step 61060: Yes), the
apparatus performance acquisition module 32300 registers an entry related
to an event corresponded to a change of a status of the performance value to
the event management table 33700 (step 61070). On the other hand, in the
case in which a performance value has not been changed (step 61060: No),
the apparatus performance acquisition module 32300 goes ahead with the
processing to the step 61040 if a state confirmation processing to all
performance values (processing from the step 61050 to the step 61070) has - ."" -I .. .-" ~" c i-iT.-- ----.- _ -- 3 3 ' : a ; -
29
not been terminated.
[01031
After the state confirnlation processirlg lo all performance values is
terminated, the apparatus performance acquisition module 32300 determines
whether or not there is an event (an entry related to an event) that has been
newly registered to the event management table 33700 (step 61080). In the
case in which there is an event that has been newly registered (step 61080:
Yes), the apparatus performance acquisition module 32300 instructs an
execution of the failure cause analysis processing (see Fig. 16) to the event -
-
analysis processing module 32500 (step 61090). On the other hand, in the
case in which there is not an event that has been newly registered (step
61080: No), the apparatus performance acquisition module 32300 terminates
the performance information acquisition processing.
[01041
[01051
Fig. 16 is a flowchart of a failure cause analysis processing in accordance
with a first embodiment. The failure cause analysis processing is
corresponded to a processing of the step 61090 of Fig. 15.
[O 1061
An event analysis processing module 32500 acquires an entry related to an
event in which a value of an analyzed flag 33370 has not been set to be "Yes"
from the event management table 33300 (step 62010).
[01071
In the next place, the event analysis processing module 32500 repeats the
processing of the step 62030 to each expansion rule in the expansion rule
repository 33500 (step 62020). The event analysis processing module 32500
calculates a certainty factor for an expansion rule of a processing target (a
certainty factor for a cause event of an expansion n l e of a processing target),
that is, an occurrence rate in the past certain period of time of one or more
condition events that are included in an expansion rule of a processing target
(step 62030).
[01081
The event analysis processing module 32500 subsequently sets the analyzed
I
flag 33370 of an entry that has been acquired in the step 62010 for the event
I management table 33300 to be "Yes" (step 62050). The event analysis
. ,- - - - - - -----
"7-y Qp] pJ7 ' m i .q 7 y .'Jff' - k- - - -- - ---
30
processing module 32500 then creates an entry of the analysis result
management table 33600 in which a cause event of the expansion rule has
been specified as a candidate of a failure cause (a first cause event) for each
of the expansion rules in which a certainty factor that has been calculated in
the step 62030 is not 0 among the expansion rules in the expansion rule
repository 33500, and registers the created entry t.o the analysis result
management table 33600 (step 62060).
To1091
In the next place, the event analysis processing module 32500 repeats the
processing from the step 62070 to the step 62100 to each expansion rule in
the expansion rule repository 33500 (step 62070). The event analysis
processing module 32500 determines whether or not a certainty factor that
has been calculated in the step 62030 for an expansion rule of a processing
target exceeds a certain value (step 62080).
C0ll01
In the case in which a certainty factor exceeds a certain value (step 62080:
Yes), the event analysis processing module 32500 instructs an execution of
the plan expansion processing for an expansion rule of a processing target to
the plan expansion module 32700 (step 62090). By this plan expansion
processing, an expansion plan corresponded to an expansion rule of a
processing target is created, that is, in the case in 'which a cause event of an
expansion rule of a processing target is a failure cause, an expansion plan to
the failure cause is created.
[Ollll
On the other hand, in the case in which a certainty factor does not exceed a
certain value (step 62080: No), the event analysis processing module 32500
does not execute the processing of the step 62090 for an expansion rule of a
processing target.
(01 121
After terminating the processing from the step 62070 to the step 62100 to
each expansion rule in the expansion rule repository 335000, the event
analysis processing module 32500 terminates the failure cause analysis
processing.
[01131
For instance, the condition events of the expansion rule shown in Fig. 9A are
two events of an event corresponded to a threshold value abnormality of a
--,. . - - -c- -x.- - > ,232- $AJ+Li$??i- - 2 3 2 ykX5 .=L&L-~~L":&-I?F ----LA -- -
3 1
response time for the drive "Ivar" of the host computer HOST1 (hereafter
referred to as an event A) and an event corresponded to a threshold value
abnormality of an I10 amount in a unit time for the logical volunle LU1 of the
storage apparatus SYSl (hereafter referred to as an event B).
[OI 141
In the case in which an entry related to the event. R (an event provided with
an event ID of "EV3" in the example of Fig. 7) is registered to the event
management table 33300, the event analysis processing modulc 32500 refers
to the event management table 33300 after waiting for a certain period of
time and specifies an event that has occurred in the past certain period of
time.
[01151
In the next place, the event analysis processing module 32500 calculates a
certainty factor (an occurrence rate of a condition event in the past certain
period of time) for an expansion rule ExRulel-1. As a result, since the event
A (an event provided with an event ID of EV6 in the example of Fig. 7) has
also occurred in the past certain period of time, a certainty factor for an
expansion rule ExRulel-1 is 100% (212 x 100).
[OI 161
In the case in which a certainty factor that has been calculated as described
above exceeds a certain value, the event analysis processing module 32500
instructs an execution of the plan expansion processing to the plan expansion
module 32700 and makes the plan expansion module 32700 to create an
expansion plan for a failure recovery. For instance, in the case in which the
above certain value is 30%, since a certainty factor for an expansion rule
ExRulel-1 is 100% and exceeds 30%, an expansion plan corresponded to the
expansion rule ExRule 1 - 1 is created.
[01171
[ollsl
Fig. 17 is a flowchart of a plan expansion processing in accordance with a
first embodiment. The plan expansion processing is corresponded to a
processing of the step 62090 of Fig. 16.
[O 1191
A plan expansion processing module 32700 acquires an entry that has been
newly registered for the analysis result management table 33600 (hereafter
I f,. :-.
yp; .$$$, .$$$,, .jyJ--?J"Z,J --j. q"' - - - ~ ---*------._ $ f:'- ,...,--$,-- 3 - $7, q :. p'm q --A,- B' .. 4L-..-
3 2
referred to as a newly registered entry) from the analysis result management
table 33600 (step 63010). The plan expansion processing module 32700
executes the processing from the following stcps 63030 to 63090 to each of
the newly reestered entry that has been acquired (step 63020).
[0120]
The plan expansion processing module 32700 acquires an expansion rule T l l
that has been stored into the expansion rule ID 33650 from the newly
registered entry of a processing target of t.he analysis result management
table 33600. In the following, an expansion rule that is indicated by the
expansion rule ID that has been acquired here is referred to as an expansion
rule of a processing target. The plan expansion processing module 32700
then acquires a general rule ID that has been stored into the pre-expansion
general rule ID 33540 of the an expansion rule of a processing target (step
63030). A general rule that is indicated by the general rule ID that has been
acquired here is a general rule that is a basis of an expansion rule of a
processing target.
[01211
In the next place, the plan expansion processing module 32700 refers to the
rule plan correspondence management table 33900 and specifies one or more
general plans corresponded to a general rule that is a basis of an expansion
rule of a processing target. Moreover, the plan expansion processing module
32700 refers to the rule plan correspondence management table 33900 and
specifies an unsolved event corresponded to a combination of a general rule
that is a basis of an expansion rule of a processing target and the specified
general plan (step 63040).
[o 1221
In the next place, the plan expansion processing module 32700 refers to the
volume topology management table 33200, creates one or more expansion
plans corresponded to an expansion rule of a processing target based on the
general plan that has been specified in the step 63040, and adds information
related to the created expansion plan to the expansion plan table 33800 (step
63050). For instance, in the case in which a general plan of a volume
migration is expanded, the plan expansion processing module 32700 specifies
all of the storage apparatuses 20000 that can be a migration destination
apparatus by referring to the volume topology management table 33200.
[01231
In the next place, the plan expansion processing module 32700 repeatedly
executes the processing of the step 63070 and the step 63080 to each
cxpansion plan that has been created in the step 63050 (step 63060). The
plan expansion processing module 32700 refers to the apparatus performance
management table 33100, calculates a predicted value of a performance value
after the execution of the plan by simulating the situation after the execution
of the plan, and updates a value of the plan target 33840 of an expansion
plan of a processing target based on the result information of the simulation
(step 63070). In the next place, the plan expansion processing module 32700
instructs an execution of a plan post-execution risk extraction processing (see
Fig. 18) to the plan post-execution risk extraction module 32800 (step 63080).
At this time, the plan expansion processing module 32700 inputs an unsolved
event ID of an unsolved event related to an expansion plan of a processing
target, that is, an unsolved event corresponded to a combination of a general
rule that is a basis of an expansion rule of a processing target and a general
plan that is a basis of an expansion plan of a processing target to the plan
post-execution risk extraction module 32800.
[01241
After terminating the processing from the step 63030 to the step 63090 to all
of the newly registered entries that have been acquired, the plan expansion
processing module 32700 instructs an execution of a plan presentation
processing (see Fig. 19) to the plan presentation module 32900 (step 63110).
After that, the plan expansion processing module 32700 terminates the plan
presentation processing.
[0125l
In the present embodiment, performance information, in particular a
predicted value of a response time of an 110 is taken, a predicted value of a
response time of an 110 is calculated by executing an simulation, and the
predicted value that has been obtained by the simulation is stored into the
plan target 33840 of the expansion plan table 33800. For instance, in the
case in which the expansion plan ExPlanl-1 is executed, data of the logical
volume LU2 is migrated from the storage apparatus SYSl to the storage
apparatus SYS2. However, the predicted value is calculated based on a
response time of an I10 of each of the current migration target volume (the
logical volume LU2), a migration source apparatus (the storage apparatus
SYSl), and a migration destination apparatus (the storage apparatus SYS2)
that can be obtained from the apparatus performance management table
33100. Here, an example of a simulation method is described. A value that
is stored into the expansion plan table 33800 can also be other than a
performance value as long as the value can be an index representing the
characteristics of the plan. The management server 30000 can execute a
simulation similar to that of a performance value by storing information of a
cost taken for a plan execution and information of time required for a plan
execution into thc volumc topology management table 33200 or the
apparatus performance management table 33100.
[0126]
[01271
Fig. 18 is a flowchart of a plan post-execution risk extraction processing in
accordance with a first embodiment. The plan post-execution risk extraction
processing is corresponded to a processing of the step 63080 of Fig. 17.
[01281
A plan post-execution risk extraction module 32800 uses an unsolved event
ID that has received from the plan expansion module 32700 to extracts an
unresolved event from the actually occurring condition events that have been
registered to the reception event ID 33660 of the newly registered entry of
the analysis result management table 33600 (step 64010). Here, an
unresolved event is an event corresponded to a condition event that is
indicated by an unsolved event ID among condition events that have actually
occurred.
[01291
In the next place, the plan post-execution risk extraction module 32800 refers
to the event management table 33300 and an expansion rule of a processing
target, and specifies an occurrence point (an apparatus and a device of an
occurrence source) of an unresolved event that has been extracted in the step
64010 (step 64020). In the next place, the plan post-execution risk extraction
module 32800 refers to the volume topology management table 33200, and
extracts an occurrence point of an unresolved event and any one or more of
an occurrence point of an unresolved event and a related point on an 110 path
(an apparatus and a device) as a risk point (step 64030).
10 1 301
In the case in which a risk point has been extracted in the step 64030 (step -_-- - --
- ~. E L - <- 3 y ~ s gPPW ? -- - - -
35
64040: Yes), the plan post-execution risk extraction module 32800 stores data
that indicates the extracted risk point into the risk point 33890 of an
expansion plan of a processing target of the exparlsiorl plan lable 33800 (step
64040), and terminates the plan post-execution risk extraction processing.
On the other hand, in the case in which a risk point has not been extracted in
the step 64030 (step 64040: No), t.he plan post-execution risk extraction
module 32800 terminates the plan post-execution risk extraction processing.
[01311
The risk point 33890 of the expansion plan table 33800 of Fig. 12 has not
stored data that indicates a risk point since a risk point has not been
extracted. As a risk point, points on an I10 path that is indicated by an entry
of the volume topology management table 33200, such as a drive of the host
computer 10000, a controller 25000 of the storage apparatus 20000, and a
logical volume 24100 of the storage apparatus 20000, can be extracted for
instance.
LO1321
[01331
Fig. 19 is a flowchart of a plan presentation processing in accordance with a
first embodiment. The plan presentation processing is corresponded to a
processing of the step 63110 of Fig. 17.
[O 1341
A plan presentation module 32900 acquires the information that indicates a
candidate of a failure cause and a certainty factor for a candidate of a failure
cause, that is, a cause apparatus ID 33610, a cause region ID 33620, a metric
33630, and a certainty factor 33640 from the analysis result management
table 33600 (step 65010).
[01351
In the next place, the plan presentation module 32900 executes a processing
of the step 65030 to each newly registered entry of the analysis result
management table 33600. The plan presentation module 32900 acquires the
information related to one or more expansion plans to a failure cause that is
indicated by the newly registered entry of a processing target (exactly, a
candidate of a failure cause), that is, one or more expansion plans
corresponded to an expansion rule that is indicated by the newly registered
entry of a processing target (an expansion rule that is a candidate for a
- --"_ZLIlr17C - g"*XX :y&-X-- --Ttz TL-..*T'-z&F Fj"n*-F?& , - ?LX A.-;za G?' - - - &e-kL- --------- --L _ - r -- -La--& -** - - -- _ _ _ - - - - _L
3 6
failure recovery) (a first plan) from the expansion plan table 33800 (step
65030). The expansion rule that is indicated by the newly registered entry is
an expansion rule that is indicated by an expansion rule ID that has been
stored into the expansion rule ID 33650 of the newly registered entry.
[01361
After terminating the processing of the step 65030 to all of the newly
registered entries, the plan presentation module 32900 executes the
processing from the step 65060 to the step 65080 to cach ncwly rcgistcrcd
entry of the analysis result management table 33600. The plan presentation
module 32900 executes the processing of the step 65070 to each of one or
more expansion plans to a failure cause that is indicated by the newly
registered entry of a processing target (a failure cause of a processing target).
[01371
In the step 65070, the plan presentation module 32900 calculates a score
value for an expansion plan of a processing target to a failure cause of a
processing target based on the execution result of an expansion plan that was
executed in the past and that is indicated by the plan execution history
management table 33950. Here, the score value is an index value that
indicates a possibility of succeeding in a failure recovery in the case in which
an expansion plan is executed, that is, a potential value of improving a
failure. For instance, the plan presentation module 32900 acquires all of the
entries corresponded to a combination of an expansion rule that is indicated
by the newly registered entry of a processing target and an expansion plan of
a processing target from the plan execution history management table 33950.
The plan presentation module 32900 then calculates a success rate in the
case in which an expansion plan of a processing target is executed to a failure
cause of a processing target based on data that indicates the success or
failure of a failure recovery of each of one or more entries that have been
acquired, more specifically, a rate of the number of entries in which "OK" has
been stored into the execution success or failure 33980 anlong the acquired
entries to the total number of the acquired entries as a score value.
[O 1381
In the present embodiment, a success rate is used as a score value. However
for instance, a value (s) that is obtained by the expression 1 can also be a
score value. The expression 1 is an expression for dividing the execution
results in the plan execution history management table 33950 for every
predetermined period of time, weighting a success rate (Ri) that has been
calculated for every period of time with a weighted value (112') based on the
period of time, and obtaining the total sum of a success rate (Ri12i) afber
weighting as a score value. In the expression 1, a success rate of a more
recent period of time is weighted more, and a score value is calculated in
such a manner that a value is higher to a more recent success. In the
expression 1, Ri represents a success rate of a period of time from i hours ago
to (i+n) hours ago (n is a predet,ermined value, for instance 1).
s = C(Ril21) . . . . (Expression 1)
[01391
A score value is not restricted to a success rate or a success rate after
weighting, and can also be a value other that the success rates. For instance,
the number of executions of an expansion plan in addition to a success rate,
that is, a value considering the number of the execution results in the plan
execulion history management table 33950 can also be used as score value.
Moreover, the number of executions of an expansion plan without any change
can also be used as score value. As an example of a case in which the
number of executions of an expansion plan in addition to a success rate is
considered, a score value can be decide in such a manner that a value is
higher when the number of executions is larger in the case in which success
rates are identical or similar for instance. Moreover, a score value can be
decide in such a manner that a value is higher in the case in which a period
of time from when an expansion plan was executed and a failure was
improved to the present time is longer and a failure has not occurred again in
the period of time for instance. Furthermore, the management server 30000
can prepare a plurality of kinds of calculation methods of a score value in
advance and switch a calculation method of a score value depending on a
state in an execution based on a predetermined policy.
[O 1401
After terminating the processing from the step 65060 to the step 65080 to all
of the newly registered entries, the plan presentation module 32900 extracts
a combination of a failure cause and an expansion plan that are executed the
number of times equal to or larger than the predetermined number of times
in the past and in which a score value is equal to or larger than a
predetermined value from combinations of a failure cause and an expansion
plan that are a target of the processing of the step 65070 (a calculation
processing of a score value) (step 65100). In this case, the plan presentation
module 32900 can also extract a combination of a failure cause and an
expansion plan in which the number of execution results in the plan
execution history management table 33950 is significantly large obviously for
instance. An extraction method is not restricted as long as the method can
indicate the characteristics of an expansion plan nf a, man.ager.
[01411
In the next place, the plan presentation module 32900 determincs whether or
not a combination in which a certainty factor for the failure cause is 100%
exists in combinations of a failure cause and an expansion plan that have
been extracted (step 65110).
[01421
In the case in which a combination in which a certainty factor is 100% does
not exist (step 65110: No), the plan presentation module 32900 creates a plan
presentation screen (see Fig. 20) based on the information that indicates a
candidate of a failure cause that has been acquired in the step 65010, a
certainty factor for a candidate of a failure cause, the information related to
an expansion plan that is a candidate that has been acquired in the step
65030, and a score value for each expansion plan that has been calculated in
the step 65070, and displays the created plan presentation screen on the
output device 31200 (step 65120). For instance, in the plan presentation
screen, one or more expansion plans of expansion plans that are a candidate
(hereafter referred to as a presentation plan) are arranged and displayed in
an order from higher score value. A presentation plan is an expansion plan
in which a score value is equal to or larger than a predetermined value
among expansion plans that are a candidate for instance. After that, the
plan presentation module 32900 terminates the plan presentation processing.
Lo1431
On the other hand, in the case in which a combination in which a certainty
factor is 100% exists (step 65110: Yes), the plan presentation module 32900
specifies an expansion plan that is included in a combination in which a score
value is highest in combinations in which a certainty factor is loo%, that is,
an expansion plan in which a score value is highest in.expansion plans to a
failure cause in which a certainty factor is 100%. The plan presentation
module 32900 then instructs an execution of a plan execution processing (see
Fig. 21) for the specified expansion plan to the plan execution module 32910
(step 65130). By the plan execution processing, an expansion plan in which a
score value is highest in expansion plans to a failure cause in which a
certainty factor is 100% is autornalically executed. After that, the plan
presentation module 32900 terminates the plan presentation processing:
[O 1441
Tn the present embodiment, in the case in which a failure cause in which a
certainty factor is 100% exists, the management server 30000 automatically
executes an expan~ion plan in which n score value is highest to a failure
cause in which a certainty factor is 100%. However, a determination
standard of whether or not the automatic execution is done is not restricted
to that a certainty factor is 100%. For instance, in the case in which a
certainty factor is equal to or larger than a predetermined value (such as a
value close to loo%), the management server 30000 can automatically
execute an expansion plan (a second plan) in which a score value is highest to
a failure cause in which a certainty factor is equal to or larger than a
predetermined value. Moreover for instance, in the case in which a certainty
factor is equal to or larger than a predetermined value and the maximum
value of a score value (a score value for a second plan) for each of a plurality
of expansion plans to a failure cause in which a certainty factor is equal to or
larger than a predetermined value is equal to or larger than a predetermined
value, the management server 30000 *Tps5 ----a- j y,$ TPP -T.cA L P -- -T .- $33" :=.f3Sr.g l*7L4-L- --
4 1
from an expansion plan with a less cost by clicking "Cost ($)" in the display
area 71010.
[OI 501
The plan execution button 71020 is a button for instructing an execution of
an expansion plan that has been selected. In the case in which the button is
pressed, the management server 30000 issues an executinn indication of an
expansion plan to a program that provides a function equivalent to an
expansion plan that has been selected. The program that has reccivcd the
execution indication of an expansion plan executes the expansion plan that
has been selected. Here, the program that executes an expansion plan is a
program in the memory 32000 of the management server 30000, such as a
volume migration program (not shown) and a VM migration program (not
shown).
[01511
Moreover, the display area 71010 can also display the predicted value of a
performance value before an execution of an expansion plan and a
performance value after an execution of an expansion plan, which has been
stored into the plan target 33840 of the expansion plan table 33800 in
addition. Furthermore, a performance value and a predicted value of a
performance value can also be displayed in a graph form as trend
.information.
[01521
Fig. 20 is an example of a plan presentation screen. The display area 71010
can also display the information that indicates the characteristics of an
expansion plan other than a cost required for an execution of an expansion
plan and a time required for an execution of an expansion plan, for instance,
a score value calculated in the step 65070 in addition. Furthermore, other
display mode can also be adopted.
101531
[01541
Fig. 21 is a flowchart of a plan execution processing in accordance with a first
embodiment.
101551
In the case in which one expansion plan is selected from the display area
71010 and the plan execution button 71020 is pressed in the plan
presentation screen, a plan execution module 32910 starts the execution of a
plan execution processing.
[01561
In the first place, the plan execution module 32910 instructs an execution of
an expansion plan that has been selected to a program that provides a
function equivalent to the expansion plan that, has heen selected (step
67010). Here, a program that executes an expansion plan is a volume
migration program and a VM migration program for instance. A processing
that is executed by the program is identical or similar to a processing of the
conventional technique that is disclosed in cited literatures. Moreover, the
plan execution module 32910 can also avoid a competitive situation by using
a general mechanism for carrying out the execution sequence control and
competition avoidance in the case in which the processing is executed.
[01571
In the next place, the plan execution module 32910 refers to an expansion
rule ID 33823 of the expansion plan table 33800, and specifies an expansion
rule corresponded to the expansion plan that has been selected (step 67020).
The plan execution module 32910 then extracts a condition event that.is not
corresponded to an unsolved event related to the expansion plan that has
been selected from condition events of the specified expansion rule (step
67030). Here, the plan execution module 32910 refers to the rule plan
correspondence management table 33900 and specifies an unsolved event
corresponded to a combination of a general rule that is a basis of the specified
expansion rule and a general plan that is a basis of the selected expansion
plan as an unsolved event related to the selected expansion plan.
[0158l
The plan execution module 32910 executes a processing of the steps 67050
and 67060 to each condition event that has been extracted. In the first place,
the plan execution module 32910 instructs an execution of a confirmation
processing of whether or not a failure has been improved to the plan
execution rcsult confirmation module 32920. The plan execution resull
confirmation module 32920 that has received an instl*uction of an execution
of a confirmation processing asks whether or not a failure corresponded to a
condition event of a processing target has been improved, that is, whether or
not it is in a state in which a condition event of a processing target has not
occurred to a management target apparatus of an occurrence source of a
>j--..$-t% -__ _ _ --_-. _-_-_- l-d;&yr.p . ~ - - ~ ~ z ?-%i Z~-ZGd S-'% .
49
embodiment executes a test corresponded to the test case for each of the test
cases that have been registered to the test case repository 34100 before an
operation start for instance. The nlailagement server 30000 or a manager
then registers a test result, that is, an execution result of an expansion plan
to the plan execution history management table 33950. For instance, in the
case in which a test case that is indicated by a combination of an expansion
rule ExRulel-1 and an expansion plan ExPlanl-1 has been registered to the
test case repository 34100, the msnagcmcnt ecrvcr 30000 or a manager
creates a failure situation (a failure situation in which a cause event - of the
-
expansion rule ExRulel-1 is a failure cause) in a pseudo fashion by
generating a condition event or a cause event of the expansion rule ExRulel-
1 in a pseudo fashion for instance, and executes the expansion plan ExPlanl-
1 under the situation. Moreover, the management server 30000 or a
manager registers data that indicates whether or not a recovery of a failure
in which a cause event of an expansion rule ExKulel-1 is a failure cause has
been succeeded in to the plan execution history management table 33950 by
an execution of the expansion plan ExPlanl-1. In the present embodiment,
an execution result of an expansion plan that has been obtained by the test is
also utilized in the case of a calculation of a score value.
[o 1801
In the second embodiment, the management server 30000 adds a
combination of an expansion rule and an expansion plan in which the history
data is not sufficient as a test case to the test case repository 34100. The
management server 30000 or a manager then executes a test corresponded to
a test case that has been registered to the test case repository 34100 in an
introduction of the management server 30000 for instance, and registers a
test result to the plan execution history management table 33950. By this
I configuration, sufficient history data can be ensured for all expansion plans,
and a bias in an execution history between expansion plans can be prevented.
Moreover, since a score value is calculated based on the sufficient history
data and the validity of the score value is ensured, the management server
30000 or a manager can select more suitable expansion plan based on a score
value.
[Ol811
(3) Third embodiment
[01821
- -a E .w - lTj; ?% Fg -= - 7 iyf .*-:Ti a -':< z-i 7 q 2 7 ",&;r
-. -.,.-....-- ""b -, ----- ~:-z---.------
In the next place, a third embodiment will be described. In the following
descriptions, a difference from the first embodiment will be described mainly,
and the descriptions of identical or similar composition elements, prograrris
provided with identical or similar functions, and tables provided with
identical or similar items will be omitted.
LO1831
As described in the second embodiment, in the case in which the history data
is insufficient, it is unclear whether or not the most suitable cxpansion plan
is selected based on a score value. Moreover, since the history data is less
likely to be increased for an expansion plan with low score value, an
expansion plan in which a high score value was calculated at first is likely to
be selected constantly after that. In the third embodiment, a computer
system is configured by a plurality of sub systems (a management unit of the
management server 30000, hereafter referred to as a domain), and the case
in which the management server 30000 is configured for every domain is
assumed. In the case in which a manager of other domain frequently
executes another expansion plan to a similar failure that has occurred for
other management target apparatus group that exists in other domain, it is
thought that the expansion plan is more suitable. In the present
embodiment, in the case in which a communication is executed between
management servers 30000 of a plurality of domains and the number of
histories of an expansion plan to the identical or similar failure is equal to or
larger than a certain number, a score value is calculated in consideration of
the configuration.
[O 1841
Fig. 25 is a block diagram showing an example of a computer system in
accordance with a third embodiment.
101851
A computer system in accordance with the third embodiment is provided
with a plurality of management servers 30000 for managing each of a
plurality of domains and a plurality of WEB browser start-up servers 35000
that are display computers of each of a plurality of management servers
30000. The plurality of management servers 30000 is utilized by different
managers.
Fig. 26 is a block diagram showing an example of a management server in
accordance with a third embodiment.
[01871
The memory 32000 of the management server 30000 stores a computer
program of a history transmitter and receiver module 32950 additionally.
Moreover, the secondary storage device 33000 of the management server
30000 stores a management server list 34200 ~ddit~innally.
[01881
Fig. 27 is a block diagram showing an example of a plan cxccution history
management table 33950 in accordance with a third embodiment.
[o 1891
The plan execution history management table 33950 in accordance with a
third embodiment further includes an external reception 33995 that is a field
for storing data that indicates whether or not it is history data that has been
received from the management server 30000 of other domain and a
transmission source server 33997 that is a field for storing data that
indicates the management server 30000 of a transmission source of the
history data for the history data that has been received from the
management server 30000 of other domain in addition to each field of the
plan execution history management table 33950 in accordance with the first
embodiment. For instance, in the case in which history data that is indicated
by an entry is history data that has been received from the management
server 30000 of other domain, that is, history data that has been obtained by
an execution of an expansion plan for other domain, "Yes" is stored into the
external reception 33995. In the case in which history data that is indicated
by an entry is not history data that has been received from the management
server 30000 of other domain, that is, in the case in which history data that
is indicated by an entry is history data that has been obtained by an
execution of an expansion plan for a domain (self-domain) that is managed by
the management server 30000 provided with the plan execution history
management table 33950, "NULL" is stored into the external reception
33995.
CO 1901
Fig. 28 is a block diagram showing an example of a management server list
in accordance with a third embodiment.
[01911
A management server list 34200 includes a server ID 34210 that is a field for
storing data that indicates each of a plurality of management servers 30000
(hereafter referred to as a server ID) in the computer system and an IP
address 34200 that is a field for storing an IP address that has been allocated
to each of a plurality of management servers 30000 in the computer system.
[01921
Fig. 29 is a flowchart of a plan execution history exchange processing in
accordance with a third embodiment.
[01931
In Fig. 29, the processing from the step 69010 to the step 69060 is
corresponded to the processing of the history transmitter and receiver
module 32950 of the management server 30000 on a transmission side
(hereafter referred to as a transmission side module), and the processing
from the step 69070 to the step 69075 is corresponded to the processing of the
history transmitter and receiver module 32950 of the management server
30000 on a reception side (hereafter referred to as a reception side module).
[O 1941
The transmission side module regularly or irregularly extracts one or more
entries in which an external reception field 33995 is not "Yes" from the plan
execution history management table 33950 of the management server 30000
on a transmission side (step 69010). The transmission side module then
classifies one or more extracted entries into one or more entry groups (step
69020). Here, an entry group is one or more entries in which a combination
of values of the expansion rule ID 33960 and the expansion plan ID 33970
corresponds with each other.
[01951
The transmission side module executes the processing from the step 69030 to
the step 69060 to each of one or more entry groups.
[01961
In the step 69040, the transmission side module determines whether or not
the number of entries that are included in an entry group of a processing
target is equal to or larger than a certain number. In the case in which the
number of entries that are included in an entry group of a processing target
is equal to or larger than a certain number (step 69040: Yes), the
transmission side module transmits data that includes all of data (history
data) that is indicated by each entry of an entry group of a processing target
(hereafter referred to as external history data) to all of other management
--"-. --- ---- - -- . --- - -
-i . i- -. 2 . t ~y eg_@:-l_~ ~ F j - . i i ~ - T . ZGi ' :2' 7 ---A- -----2- - --
5 3
servers 30000 that have been registered to the management server list 34210
(step 69050).
[01971
After terminating the processing from the step 69030 to the step 69060 to
each of one or more entry groups, the transmission side module terminates
the plan execution history exchange processing.
[0198]
The reception side module of cnch management server 30000 that has
received the external history data executes the processing from the step
69071 to the step 69075 to each entry that indicates history data included in
the external history data.
[01991
In the first place, the reception side module extracts one or more entries in
which a combination of values of the expansion rule ID 33960 and the
expansion plan ID 33970 corresponds with each other from the plan
execution history management table 33950 of the management server 30000
on a reception side (hereafter referred to as a reception side history
management table) (step 69072).
[02001
In the next place, the reception side module determines whether or not one or
more extracted entries include an entry in which a combination of a
transmission source server 33997 and the execution date and time 33990
corresponds with that of an entry of a processing target (step 69073). In the
case in which an entry that corresponds with an entry of a processing target
is not included (step 69073: No), the reception side module registers an entry
of a processing target to the reception side history management table (step
69074). In this case, the external reception 33995 of an entry that is
registered stores "Yes", and the transmission source server 33997 of an entry
that is registered stores a server ID of the management server 30000 on a
transmission 'side that is managed by the management server list 34200. On
the other hand, in the case in which an entry that corresponds with an entry
of a processing target is included (step 69073: Yes), the reception side module
does not execute a registration of an entry of a processing target to the
reception side history management table.
[02011
After terminating the processing from the step 69071 to the step 69075 to
-- - ---
px" "' - '" *yfl F& -- a?# 2; -% 7 E# 7 1 4 P 1 T-E. ..' i'g -A~L&&L-_ - 13 1y - . - - A - ----------- -- ---
each entry that indicates history data included in the external history data,
the reception side module terminates the plan execution history exchange
processing.
[02021
In the case in which the management server 30000 in accordance with the
present embodiment calculates a score va.l.11-e in the step 65070 of Fig. 19, thc
management server 30000 calculates a score value while also utilizing
history data t,hat has been registered to thc plan executioi~ history
management table 33950 by the plan execution history exchange processing,
that is, history data that has been received from the management server
30000 of other domain in addition to history data that has been obtained for
a self-domain. The management server 30000 can also calculate a score
value while handling history data that has been received from the
management server 30000 of other domain similarly to history data that has
been obtained for a self-domain, or can calculate a score value while
distinguishing history data that has been received from the management
server 30000 of other domain from history data that has been obtained for a
1
self-domain. s ore over, it is also possible that the management server 30000
does not utilize history data that has been received from a specific
management server 30000 of a plurality of management servers 30000 of
other domain, such as a management server 30000 of a domain of a different
operation form, for a calculation of a score value.
[02031
Fig. 30 is a block diagram showing an example of a plan presentation screen
in accordance with a third embodiment.
[02041
A plan presentation screen in accordance with the third embodiment further
displays data related to an execution history about the expansion plan for
every expansion plan in a display area 71010 of a plan presentation screen in
accordance with the first embodiment (Fig. 20). The data related to an
execution history includes the total number of execution histories including
an execution history that has been obtained for a self-domain and an
execution history that has been received from the management server 30000
of other domain, the number of execution histories that have been received
from the management server 30000 of other domain among the total number
of execution histories, and the number of the management servers 30000 of
_ __+ - ". I-- --- 7'
&p.ib sg!-&:-x F'!iZ?"-~>m~-.?-' .iT'-7C .--F=T -:a:x . -4
5 5
other domain that have transmitted an execution history for instance. From
data related to an execution history about the first expansion plan (an
expansion plan in which "#" is "1") for instance, it can be known that the
expansion plan has been executed 100 times in total and has been executed
20 times in number for three other domains. The data related to an
execution history can include the informat.inn that. specifically indicates a
domain of a management server 30000 in which the presented expansion
plan has heen P X P C I I . ~fo~r~ instance. Fig. 30 i~ an example of a plan
presentation screen, and a display form is not restricted to one shown in Fig.
30 as long as a screen in which a manager can understand a degree of the
breakdown of an execution history is adopted.
In accordance with the third embodiment, the management server 30000
scores an expansion plan while also utilizing history data that has been
received from the management server 30000 of other domain in addition to
history data that has been obtained for a self-domain. The management
server 30000 determines whether or not automatic coping is possible
depending on a certainty factor and a score value for a failure cause. In the
case in which automatic coping is possible, the management server 30000 can
carry out a failure recovery by automatically executing an expansion plan in
which a score value is highest. The management server 30000 can obtain an
approval of a manager before automatically executing an expansion plan. In
the case in which automatic coping is impossible, the management server
30000 arranges and displays data that indicates a plurality of expansion
plans to a failure cause in an order from an expansion plan with a higher
score value and presents the data to a manager. By this configuration, the
management server 30000 or a manager can rapidly select a suitable
expansion plan depending on the past actual achievement based on a score
value that has been calculated while utilizing not only history data that has
been obtained for a self-domain but also history data that has been obtained
for other domain, thereby reducing an operation management cost for a
failure recovery
The present invention is not restricted to the above embodiments that have
been described above, and it is obvious that various changes and
modifications can be thus made without departing from the scope of the
present invention.
[Reference Signs List]
[02071
10000: Host computer
20000: Storage apparatus
30000: Mana.gement. server
35000: WEB browser start-up server
40000: IP witch
45000: Communication network
WE CLAIMS:-
A management program for causing a computer that coilfigures a
management system configured to manage a computer system comprising a
plurality of management target devices, to execute the following:
executing a cause analysis of an event that has occurred in any one of
the plurality of management target devices and specifying a first cause event
that is a candidate of a calise of the event that has occurred bawd on one or
more rules indicating a correspondence relationship between a . cause event
related to any one of the plurality of management target devices and one or
more condition events related to any one of the plurality of management
target devices that is a condition under which the cause event is a cause;
specifying a plurality of first plans that can be executed in the case in
which the first cause event is a cause, based on plan information indicating a
correspondence relationship between the rule and a plan that is a recovery
measure that can be executed in the case in which a cause event of the rule is
a cause;
calculating an index value indicating a possibility of succeeding in a
failure recovery in the case in which the plan is executed for each of the
plurality of first plans, based on plan history information indicating the
success or failure of a failure recovery by an execution of the plan, every
when the plan is executed; and
displaying data indicating any one or more plans of the plurality of
first plans according to a display mode decided based on the index value.
[Claim 21
The management program according to claim 1, being configured to cause the
computer to execute the following:
extracting one or more plans in which the index value is equal to or
larger than a predetermined value from the plurality of first plans and
displaying data indicating the one or more extracted plans.
[Claim 31
A management program according to claim 2, being configured to cause the
computer to execute the following:
arranging and displaying the one or more extracted plans in an order
from larger index value.
[Claim 41
The management program according to claim 3, being configured to cause the
computer to execute the following:
calculating a certainty factor indicating the certainty of the cause
event being the cause for each cause event of one or more rules in the cause
analysis of an event that has occurred, and specifying the first cause event
based on the certainty factor; and
executing a second plan in which the index value is largest for the
plurality of first plans in the case in which thc certainty factor of thc first
cause event is equal to or larger than a predetermined value.
[Claim 51
The management program according to claim 4, being configured to cause the
computer to execute the following:
executing the second plan in the case in which the certainty factor of
the first cause event is equal to or larger than a predetermined value and the
index value of the second plan is equal to or larger than a predetermined
value.
[Claim 61
A management program according to claim 5, being configured to cause the
computer to execute the following:
after executing one plan of the plurality of first plans, adding data
indicating the success or failure of a failure recovery by an execution of the
one plan to the plan history information.
[Claim 71
The management program according to claim 6,
wherein the plan history information includes a plurality of history
elements indicated by associating a rule including a cause event specified as
a candidate of a cause in the past, a plan executed in the case in which a
cause event of the rule is specified as a candidate of a cause, and the success
or failure of a failure recovery by an execution of the plan with each other,
and
wherein the management prograin is configured to cause the
computer to execute the following:
determining whether or not, for every combination of one rule
of one or more rules and one plan corresponded to the rule, history elements
related to a combination of which the number is equal to or larger than a
predetermined number are included in the plan history information based on
- - - - - - - - - ~ ~ 4 L & 4 ~ . ~ ~ Q a 1 s Q 5 - a - 2 ~ 3 L ~ - -- - T Y ~ B X -
5 9
the plan information and the plan history information; and
under a failure situation in which a cause event-of the rule
that configures the combination is a cause, for a combillatio~il n which his tory
elements of which the number is equal to or larger than the predetermined
number are not included, executing a test for executing a plan that
configures the combination, creating a history element, related t.o the
combination based on the result of the test, and adding the created history
element to the plan history information.
[Claim 81
The management program according to claim 7,
wherein the plan history information includes a plurality of history
elements indicated by associating a rule including a cause event specified as
a candidate of a cause in the past, a plan executed in the case in which a
cause event of the rule is specified as a candidate of a cause, and the success
or failure of a failure recovery by an execution of the plan with each other,
and
wherein the management program is configured to cause the
computer to execute the following:
in the case in which history elements, of which the number is
equal to or larger than a predetermined number, related to a combination of
a rule indicated by a first history element included in the plan history
information and a plan indicated by the first history element are included in
the plan history information, transmitting data including a history element
related to the combination to a management system configured to manage a
computer system different from the computer system; and
in the case in which the data including a history element is
I received from the management system configured to manage a computer
system different from the computer system, adding a history element
included in the received data to the plan history information.
[Claim 91
Thc management program according to claim 8,
wherein the rule has a general rule in which a management target
device related to the cause event and the condition event is represented by a
type of the management target device and an expansion rule in which a type
of a management target device related to the cause event and the condition
event is represented by data indicating a specific management target device,
'p,E-,J:x ;"9'' "." " "T -7;; g!$ , -:. eP iis? -1 a iyg % -, ;:; $3, x q 7 7 .: &, '1 _- - --.__.- _-_-_-_--.- - - - - - - - - - - - - - - - - - . . --
60
wherein the plan has a general plan that is a recovery measure in the
form independent of an actual configuration of the computer system and an
expansion plan that is a recovery measure obtained by expanding the general
plan in consideration of an actual configuration of the computer system,
wherein the plan information indicates a correspondence relationship
between the general rule and the general plan t h a t can he execi~.t.edin the
case in which a cause event of the general rule is a cause,
wherein t.he plan history information indicates tho EUCCCEE or ftlilurc
of a failure recovery by an execution of the expansion plan every when the
expansion plan is executed, and includes a plurality of history elements
indicated by associating an expansion rule including a cause event specified
as a candidate of a cause in the past, an expansion plan executed in the case
in which a cause event of the expansion rule is specified as a candidate of a
cause, and the success or failure of a failure recovery by an execution of the
expansion plan with each other, and
wherein the management program is configured to cause the
computer to execute the following:
creating a plurality of expansion rules based on connection
information indicating a connection relationship between the plurality of
management target devices and the general rule;
specifying the first cause event based on the certainty factor
calculated for each cause event of the plurality of created expansion rules in
the cause analysis of an event that has occurred; and ,
specifying a general plan corresponded to a general rule that
is a basis of an expansion rule including the first cause event based on the
plan information and specifying each of a plurality of expansion plans
created by expanding the specified general plan as the first plan.
[Claim 101
A management system configured to manage a computer system provided
with a plurality of management target devices, comprising:
a storage device; and
a control device coupled to the storage device,
the storage device being configured to store
one or more rules indicating a correspondence relationship
between a cause event related to any one of the plurality of management
target devices and one or more condition events related to any one of the
- --.A - --A- - - - - - -
-P- .?-.? ;a_- &TE$-&-Z-Z g-d -2 ..2- !52' 5- - -,T $3 7 f; - -'B ZL
plurality of management target devices that is a condition under which the
cause event is a cause;
plan information indicating a correspondence relationship
between the rule and a plan that is a recovery measure that can be executed
in the case in which a cause event of the rule is a cause; and
plan history information indicating the success or faili~rc!n f a.
failure recovery by an execution of the plan every when the plan is executed,
and
the control device being configured to:
execute a cause analysis of an event that has occurred in any
one of the plurality of management target devices and specify a first cause
event that is a candidate of a cause of the event that has occurred based on
the one or more rules;
specify a plurality of first plans that can be executed in the
case in which the first cause event is a cause based on the plan information;
calculate an index value indicating a possibility of succeeding
in a failure recovery in the case in which the plan is executed for each of the
plurality of first plans based on the plan history information; and
display data indicating any one or more plans of the plurality
of first plans according to a display mode decided based on the index value.
[Claim 111
The management system according to claim 10,
wherein the control device is configured to arrange and display any
one or more plans of the plurality of first plans in an order from larger index
value.
[Claim 121
The management system according to claim 10,
wherein the control device is configured to:
calculate a certainty factor indicating the certainty of the
cause event being the cause for each cause event of one or more rules in the
cause analysis of an event that has occurred and specify the first cause event
based on the certainty factor; and
execute a plan in which the index value is largest for the
plurality of first plans in the case in which the certainty factor of the first
cause event is equal to or larger than a predetermined value.
[Claim 131
The management system according to claim 10,
wherein the plan history information includes a plurality of history
clcmcnts indicated by associating a rule including a cause event specified as
a candidate of a cause in the past, a plan executed in the case in which a
cause event of the rule is specified as a candidate of a cause, and the success
or failure of a failure recovery by an execution of t.he plan with each other,
and
wherein the control device is configured to'
determine whether or not, for every combination of one rule of
one or more rules and one plan corresponded to the rule, history elements
related to a combination of which the number is equal to or larger than a
predetermined number are included in the plan history information based on
the plan information and the plan history information; and
under a failure situation in which a cause event of the rule
that configures the combination is a cause, for a combination in which history
elements of which the number is equal to or larger than the predetermined
number are not included, execute a test for executing a plan that configures
the combination, create a history element related to the combination based
on the result of the test, and add the created history element to the plan
history information.
[Claim 141
A management system according to claim 10,
wherein the plan history information includes a plurality of history
elements indicated by associating a rule including a cause event specified as
a candidate of a cause in the past, a plan executed in the case in which a
cause event of the rule is specified as a candidate of a cause, and the success
or failure of a failure recovery by an execution of the plan with each other,
and
wherein the control device is configured to:
in the case in which history elements, of which the number is
equal to or larger than a predetermined number, related to a combination of
a rule indicated by a first history element included in the plan history
information and a plan indicated by the first history element are included in
the plan history information, transmit data including a history element
related to the combination to a management system configured to manage a
computer system different from the computer system; and
-=- F;F,; E- 2.3 - - .. $7 y -2 ?f ' &- Tj'P
i + 32 _-------.- a- -= --- -------- - -- A
63
in the case in which the data including a history element is
received from the management system configured to manage a computer
system different froill the computer system, add a history elenlent included
in the received data to the plan history information.
[Claim 151
The management system according to claim 10,
wherein the rule is a general rule in which a management target
device related to thc cauw event and thc condition cvcnt ic rcprcscntcd by a
type of the management target device,
wherein the plan information indicates a correspondence relationship
between the general rule and a general plan that is a recovery measure that
can be executed in the case in which a cause event of the general rule is a
cause and that is a recovery measure in the form independent of an actual
configuration of the computer system,
wherein the plan history information indicates the success or failure
of a failure recovery by an execution of the expansion plan every when an
expansion plaA that is a recovery measure obtained by expanding the general
plan in consideration of an actual configuration of the computer system is
executed,
wherein the storage device is further configured to store connection
information indicating a connection relationship between the plurality of
management target devices, and
wherein the control device is configured to:
create a plurality of expansion rules in which a type of a
management target device related to the cause event and the condition event
is represented by data indicating a specific management target device based
on the connection information and the general rule;
specify the first cause event based on the plurality of
expansion rules created based on one or more general rules in the cause
analysis of an event that has occurred; and
specify a general plan corresponded to a general rule that is a
basis of an expansion rule including the first cause event based on the plan
information and specify each of a plurality of expansion plans created by
expanding the specified general plan as the first plan.