STATUS

September 2008: Project Conclusion

The possibility of creating self-healing software in order to guarantee reliability and availability of software systems and service is one of the main challenges for research The WS-DIAMOND project is a first step towards self-healing software and specifically self-healing Web Services.

A self-healing Web Service is able to monitor itself, to diagnose the causes of a failure and to recover from it, where a failure can be either functional, such as the inability to provide a given service, or non-functional, such as a loss of service quality. Self-healing can be performed at the level of a single service, and at a more global level, with support to identify critical misbehaviour of groups of services and to provide Web Services with reaction mechanism to global level failures. The focus of WS-Diamond is on composite and conversationally complex Web Services, where composite means that a Web Service relies on the integration of various other services, while conversationally complex means that during service provision a Web Service needs to carry out a complex interaction with the consumer application, where several conversational turns are exchanged between them.

In the project we tackled two main issues:

In order to achieve these goals, we carried on research in different areas such as Semantic Web languages (for describing service properties, i.e., models), service composition techniques (for describing service interaction) as well as model-based reasoning and diagnosis. As regards the latter, in particular, recent results and techniques for diagnosing and repairing (or recovering from failures) complex physical devices, and for designing easily diagnosable and reliable devices, are being adapted/modified in order to be applied to Web Services networks.


On Line support

During the project we achieved the following main results:


Off-line support

During the project we achieved the following main results:


The results we obtained are significantly beyond the state of the art as currently no platform for complex service execution support any activity for diagnosis, repair planning and recovery and thus service execution is not self healable.

Project Results

Platform for supporting self healing execution

Self-Healing BPEL (SH-BPEL) is a runtime platform that can be used to test, run, and repair BPEL processes. With SH-BPEL, repair actions do not have to be defined at design time and included into the process, but they are executed out of the process flow. This simplify the design work, since no complex fault handling strategies have to be designed. With SH-BPEL it is also possible to test BPEL processes before their deployment. Testing is performed by executing the process into a protected environment in which it is possible to simulate faults and repair activities. Fault can be injected using the Fault Injector. Faults that is possible to inject are related to data quality and performance. In detail, typos, missing data, and misalignments between different data sources are data quality faults. With respect to QoS performances, the tool is able to inject delays. Different probability distributions can be set to characterize the process of selection data to be altered.


Contact details: Barbara Pernici, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, e-mail: barbara.pernici(_at_)polimi.it


Decentralized Diagnostic algorithm

The diagnostic algorithm developed in the project carries out model-based diagnosis of a system in a distributed context.

Performing model-based diagnosis means finding the causes of a malfunctioning starting from a model that describe the intended behavior of the system under analysis. The approach investigated in the project, and applied to the case of composite Web Services, start from a description of the system in terms of finite-domain constraints. Such a description should describe the system in terms of components and relations (e.g. input/output connections) between them. It must include the expected behavior of each component, and state which of the component outputs would be affected by its failure.

The algorithm introduces as a novelty the possibility of carrying out a decentralized diagnosis with a distributed model. This means that there is no need to have a complete system description nor to have completed the system composition in order to deploy the diagnostic service. For a complex system composed of several interacting parts, each part can have its own diagnoser (called a Local Diagnoser) which is the sole owner of the subsystem description, without having to disclose information on the internal behavior to third parties. A diagnostic Supervisor is associates with the complex system and, having knowledge solely of the links between the interacting parts, can compare and merge the diagnostic hypotheses coming from the Local Diagnosers, querying them when necessary, in order to reach a global solution.

The different diagnosers involved in a diagnostic session can be activated at run-time only when necessary, and even the way in which the subsystems are composed, or the identity of such subsystems, can dynamically change from one diagnostic session to the other.

No persistency mechanism is needed, since theoretically each invocation to a Local Diagnoser is independent from the other, even in the same diagnostic session, although persistency can improve the performance of the algorithm.


Contact details: Claudia Picardi, picardi(_at_)di.unito.it. c/o Dipartimento di Informatica, University of Torino. Corso Svizzera 189, 10145, Torino, Italy


Decentralized chronicle-based diagnostic algorithm

The prototype is dedicated to monitoring and diagnosing distributed systems.

This tool allows to monitor a distributed system, to detect any faulty situation, to diagnose it and possibly propose a repair action. The main idea is to have a CRS engine per component, computing a local diagnosis. The synchronization of these local diagnoses is ensured by a broker service in charge of global repair.

CRS is among the most efficient tools to deal with complex real-time situations, involving temporal constraints and heterogeneous events. More and more, industrial companies need to deal with applications which are inherently distributed, which CRS could hardly tackle: our distributed CRS platform will therefore emerge as a relevant tool to increase the competitiveness of such companies. Application areas are not only in web service development, but lie from embedded systems, to supply chain management, or robotics applications.


Contact details: Sophie Robin, robin(_at_)irisa.fr and Marie-Odile Cordier, cordier(_at_)irisa.fr. IRISA, Campus de Beaulieu, 35042 Rennes Cedex, France


QoS analysis tool

Two different communication levels have been covered by the QoS demonstrators: the HTTP and the SOAP levels. Both solutions can be more generally exploited for different purposes including monitoring & diagnosis of SOA distributed systems, such as those that may constitute the next generation Network Management Services. Moreover, the HTTP proxy and the time-based status estimator (based on Hidden Markovian Model) may be useful for different diagnosis tools.


Contact details: Khalil DRIRA, LAAS-CNRS, 7, Av. Colonel Roche 31077 Toulouse Cedex 04 FranceTel. e-mail: khalil(_at_)laas.fr, web: www.laas.fr/~khalil


Repair planner

The Workflow Repair Reasoner is a prototyped solution for exploring the possible strategies/plans for completing the faulty workflow instances successfully. It uses the workflow definition, the information about availability of repair handlers and their properties, data and flow dependencies within the Workflow and the workflow goals.

The prototype produces a plan for completing the workflow instance, which consists of repair actions applied to some of initially executed activities and the execution of activities which were not executed initially.

The proposed solution can be applied to both orchestrated and choreographed workflows. The reasoning about the choreographed workflows takes into account also the choreography protocol and is performed in a centralized manner.

The main idea of reasoning is presenting all the information about the workflow instance and possible repair actions as a logical program and finding a valid model which fulfils all the intended pre-conditions of actions and the workflow goals.

The existing workflow repair approaches focus mostly on specifying the repair handlers for the scopes of activities (at the workflow design stage) and executing these handlers if some fault is observed within the scope (at the run time stage). The repairability is this case depends only on the amount of repair handlers specified by the designer. The main advantage of the proposed approach is taking into account information about all activities, their dependencies within the workflow, repairability properties and influence on the workflow goals. The repair plan is therefore generated on-the-fly and contains the repair actions which are really necessary to achieve the initial workflow goals. Such repair plans are more cost efficient and cover more faulty situations.


Contact details: Prof. Dr. Gerhard Friedrich, Mag. Volodymyr Ivanchenko, AINF, University of Klagenfurt, 9020, Austria, Klagenfurt. Gerhard.Friedrich(_at_)ifit.uni-klu.ac.at, vladimir(_at_)ifit.uni-klu.ac.at


Diagnosability analysis

The exploitable result is in the form of two original algorithms that perform a complete automated diagnosability analysis.

Algorithm 1 takes as input a distributed state based model (in the form of constraints) and outputs the complete cartography of discriminable and undiscriminable pairs of fault modes (possible faulty situations represented by the status of each component).

Algorithm 2 takes as input an event based model (as a Time Petri net) representing the nominal behavior of the system and a set of chronicles associated to faulty or nominal situations. It checks the discriminability of the underlying situations by checking the exclusiveness of one chronicle w.r.t. the nominal behavior or of a pair of chronicles.

Diagnosability analysis is a computationally expensive task. The two algorithms are devised to circumvent this difficulty.


Contact details: Louise Travé-Massuyès, LAAS-CNRS, 7 avenue du Colonel-Roche, F-31077 Toulouse, France; louise(_at_)laas.fr.


Analysis and Design methodology

The aim of the implemented tool that supports the methodology is to provide guidelines for designing services in such a way that they can be easily recovered during their execution. The tool is composed of three main phases: process analysis, repair strategies evaluation and repair strategy selection. The Process Analysis is the module responsible for the definition of users’ requirements. Users can define for each operation and/or service their functional and non-functional requirements. By using the Analytical Hierarchical Process method, users express their requirements on the defined dimensions and users’ utilities are calculated. Users have also the possibility to define which repair strategy is not feasible at priori for the analyzed operation/services. The Repair Strategies Evaluation module allows service providers to identify the influence of each repair strategy on the considered quality dimensions. Furthermore, providers can select the repair strategies to consider and the quality aggregation function to use in the definition of the requirements of a service starting from the operations that compose it. The Repair Strategy Selection module considers the process analysis and repair strategies evaluation outputs and defines the ranking of the most suitable repair strategies that can be adopted.

Such a methodology is innovative with respect to the Web service literature. It provides a significant support to Web service designers by suggesting which is the most suitable repair strategies in a specified scenario.


Contact details: Barbara Pernici, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Via Ponzio 34/5, 20133 Milano, e-mail: barbara.pernici(_at_)polimi.it.


Repairability Analysis algorithm

The Wf Repairability Reasoner supports workflow designers by allowing them

Using this information designers can apply changes to those activities whose influence on the workflow repairability is the greatest. The input to the WfRR is comprised of

The output of the WfRR (provided in XML format) consists of

Additionally, the WfRR can return the list of repairable activities with their associated repairability results.


Contact details: Gaston Tagni (gtagni(_at_)cs.vu.nl). Department of Artificial Intelligence, Faculty of Science, Vrije Universiteit Amsterdam. De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands.


Temporal Conformance

The temporal conformance algorithm checks if a web service composition including a set of choreographies and orchestrations is temporally consistent. Temporal conformance checking is done offline at design time. Hence, process designer can decide if execution of the model leads to any temporal failure e.g. violation of an explicitly assigned deadline. If a temporal conflict is found, the process can be modified so that a temporally conflict-free execution can be guaranteed. In addition temporal execution plans for all activities are computed which can be monitored at run time to enable pro-active and predictive time management i.e. to react early enough so that counter-measure still can be taken to ensure the correct execution of the process. The algorithm works in a fully distributed manner.


Contact details: Amirreza Tahamtan, amirreza.tahamtan(_at_)univie.ac.at, University of Vienna, Department of Knowledge and Business Engineering, Rathausstrasse 19/9, A-1010 Vienna, Austria.


Model Compiler and Diagnostic Knowledge Base Generator

BPEL2PN is a software tool (written in Java) that takes as input a BPEL (Business Process Execution Language) 2.0 service package describing an orchestration of Web services, and produces as output a Petri net model of it as a PNML (Petri Net Markup Language) file. This model can be graphically displayed and simulated by the open source Platform Independent Petri net Editor (PIPE). Both control flow and data flow are represented.


Contact details: Philippe Dague, LRI, Univ. Paris-Sud 11 and CNRS, Orsay, France, philippe.dague(_at_)lri.fr


September 2007: The project concludes its second year.

In the second year the project moved along three different directions:

  1. Prototyping the framework defined during the first year for phase-1 (support platform, diagnostic and repair algorithms), testing it on the guiding application.
  2. Starting of phase-2, that is reconsider the framework and the characterization of diagnosis and repair after removing some of the assumptions that were made in phase-1
  3. Characterization of diagnosability, repairability and self-healability for Web Services

Prototype

The prototype consists of the following main subsystems: Self-Healing Layer, Diagnoser, Recovery Planner.

The Self-Healing Layer consists of a set of modules for supporting the self healing execution of services. In particular:

The Diagnoser adopts a decentralized approach with a Local Diagnoser associated with each individual service and a Supervisor associated with the composite service. The Local Diagnoser is invoked by the service that generates the alarm; it computes a local explanation and passes it to the Supervisor to integrate it in the global explanation. The Supervisor receives local explanations from Local Diagnosers and integrates them into a global explanation; to obtain a global explanation it may invoke other Local Diagnosers to explain events. The process ends with a computed global diagnosis that is sent to the Recovery Planner.

The Recovery Planner computes a plan made of basic recovery actions, and passes it to SH-BPEL for execution.

In order to overcome difficulties in the adoption of the approach we also implemented prototypes of two off-line modules: a Model Compiler and a Semantic Annotator.

The Model Compiler is used by the Diagnoser. It starts from the BPEL model and generates the diagnostic model. In such a way no ad-hoc knowledge acquisition is needed to build WS-DIAMOND enabled services.

The Semantic Annotator starts from a raw WSDL description and a log of service execution and suggests semantic (ontology-based) annotations of the service, that can be used as further information by the Diagnoser (it can be used to generate probes and observations) and for service retrieval for substitution.

The prototype has been demonstrated on a guiding application (FoodShop: an e-commerce service that provides food items).

Phase-2 Extensions

We extended the characterization of diagnosis and of self-healing services. In particular, at the composition level, in phase-2 we assume that the complex service can be choreographed (it was assumed to be orchestrated in phase-1). Moreover, we move from a "closed world" assumption (all the services participating in the composition are known) to an "open world" vision, in which some of the involved services can be "external".

Concerning diagnosis, while in phase-1 only functional faults were considered, in phase-2 also Quality of Service faults are going to be taken into account. Moreover, temporal information (not considered for diagnosis in phase-1) is exploited in the diagnostic process.

As far as repair is concerned, the set of repair actions is extended and, again, temporal aspects are considered (in particular for QoS fault repair). Also, while in phase-1 the optimality criterion was the length of the recovery plan, in phase- 2 a cost function is introduced. Finally, in phase-2, we consider distributed planning for the choreographed case: if in phase-1 there was only one recovery planner for the main workflow of the orchestrator (monolithic planning), in phase-2 each service with a complex business process is able to have a recovery planner.

Self-Healability

We provided a definition of the notions of diagnosability, repairability and self-healability. We moved from the definition in technical domains and we adapted them to Web Services, where we introduced the new notion of self-healability. This is the property of a composite service and corresponds to the possibility of performing diagnosis and recovery of the service at run time. The definitions depend on the possibility of dealing with the alarms that the service may rise and of having sufficient information to determine the cause and to select the appropriate repair actions. We then started the work to define an extended methodology and framework for web services design. Some algorithms for supporting the designer of a service have already been specified and are currently prototyped.

September 2006: The project concludes its first year.

The project is on schedule and in the first year, according to the workplan, achieving the following results:

During this first phase we also:

September 2005: Project Kick-Off

In order to achieve the project goals, we plan to carry out research in different areas.

Research lines that are now separate will thus intertwine and converge towards the target of self-healing Web Services. In particular, we foresee research in at least three different directions:

Cooperative Information System and Web Services

In the field of Web Services, we expect to develop innovative solutions for service conversational and negotiation capabilities in order to accommodate for all the types of information that are needed for reliable service execution, monitoring, diagnosis and recovery. Recovery after monitoring and diagnosis will require the definition of new strategies for flexible Web Services execution (adaptive services and self-modifying processes). The framework will also produce guidelines for service design and specification, as well as for service composition and adaptation.

Model-Based Diagnosis

In the field of model-based reasoning and diagnosis, the most important innovations we expect are the following:

Semantic Web Services

The objectives of WS-Diamond in the area of Semantic Web Services are twofold:

  1. Enrich existing proposals for service markup languages with features required to enable diagnosable and self-healing services.
  2. Develop methods for the semi-automatic acquisition of the markup with service description.

The first objective will be obtained by extending languages like OWL-S and WSMO with features to represent Quality of Service descriptions, monitoring information, repair options in case of failure, etc. The success of this objective will be measured by the extent to which, by exploiting these language extensions, we succeed in describing Web Services operating in the test-bed scenarios that will be defined during the project. The second objective will be obtained by building on existing work in this area (using techniques from Machine Learning and Natural Language Processing). Success of this objective will be measured by the extent to which our implementation of these methods is able to reconstruct the semantic markup that we initially construct by hand for the purposes of the case-study.