Enhancing software reliability of military and mission critical systems through technologies and reliability standards

Rajesh Uppal January 30, 2022 AI & IT, Manufacturing, Military Comments Off on Enhancing software reliability of military and mission critical systems through technologies and reliability standards 680 Views

In 2019, Boeing 737 Max jets en route to Nairobi, crashed shortly after take-off from Addis Ababa. It has been confirmed that 157 passengers on board all lost their lives. This tragedy was as a result of an error in the Boeing aircraft’s flight-control software. Numerous softwares all over the world today have one type of error or the other. The consequences of these errors ranges from financial loss, communication loss to even the loss of human life as the case of the Boeing 737 Max aircraft. Software errors are always directly caused by either the programmers or program developer that left those errors in the code. Software engineers are humans and so they make lots of mistakes. Typically 1 out of 10 to 100 tasks go wrong.

As computing devices become more pervasive, the software systems that control them have become increasingly more complex and sophisticated. Consequently, despite the tremendous resources devoted to making software more robust and resilient, ensuring that programs are correct—especially at scale—remains a difficult and challenging endeavor.

Unfortunately, uncaught errors triggered during program execution can lead to potentially crippling security violations, unexpected runtime failure or unintended behavior, all of which can have profound negative consequences on economic productivity, reliability of mission-critical systems, and correct operation of important and sensitive cyber infrastructure. Insecure software can result from insufficient testing, inexperienced coders who lack cybersecurity training, or financial incentives that reward writing and distributing code quickly rather than eliminating security flaws.

Reliability is paramount in the mission-critical system whose failure or disruption may lead to catastrophic loss in terms of cost, damage to the environment, or even human life. Mission-critical systems are adopted in a growing number of areas, ranging from banking to e-commerce, from avionics to railway, from automotive to health care scenarios. Military systems are also an important class of mission-critical software systems, such as command and control systems and aircraft control systems. Poor reliability leads to higher sustainment costs for replacement spares, maintenance, repair parts, facilities, staff, etc. Poor reliability hinders warfighter effectiveness and can essentially render weapons useless.

Software Reliability is important to attribute of software quality, together with functionality, usability, performance, serviceability, capability, installability, maintainability, and documentation. Software reliability is a statistical measure, According to ANSI, Software Reliability is defined as the probability of failure-free software operation for a specified period of time in a specified environment. Although Software Reliability is defined as a probabilistic function and comes with the notion of time, we must note that different from traditional Hardware Reliability.

Software failures may be due to errors, ambiguities, oversights or misinterpretation of the specification that the software is supposed to satisfy, carelessness or incompetence in writing code, inadequate testing, incorrect or unexpected usage of the software or other unforeseen problem. Software Reliability is becoming hard to achieve, because the complexity of software is increasing due to large size, hence becoming more prone to software errors.

Software Reliability is not a direct function of time. Electronic and mechanical parts may become “old” and wear out with time and usage, but software will not rust or wear-out during its life cycle. Software will not change over time unless intentionally changed or upgraded. Hardware faults are mostly physical faults, while software faults are design faults, which are harder to visualize, classify, detect, and correct.

In May 2008, a Defense Science Board (DSB) report concluded that “High suitability (reliability) failure rates were caused by the lack of a disciplined systems engineering process, including a robust reliability growth program.” The most important reaction to this problem, according to that analysis, was to include reliability in system design at its onset: – “single most important step.. .is to … execute a viable systems engineering strategy from the beginning, including a robust reliability, availability, and maintainability (RAM) program”

Different types of software systems pose different reliability and security concerns. Lance Fiondella from University of Massachusetts, North Dartmouth, MA and others have analysed the relationships between software reliability engineering and cybersecurity to develop more effective ways of assessing and improving system security.

Military requirements of Software reliability and security

Miltary systems are and continue to become increasingly software intensive. Software enables capabilities, but poses risks. For example major automated information systems (MAIS) can provide efficient enterprise services such as global logistics and mission support. Unavailability of these systems can hinder the ability to perform missions. Similarly, weapons systems such as fixed wing, rotorcraft, and unmanned ariel systems (UAS) fuse sensor data and exhibit higher levels of autonomy than ever before.

Moreover, they must operate in hostile environments that actively seek to deny software enable capabilities with techniques from electronic and cyberwarfare. Thus, communication channels that enable system of systems (SoS) capability among land, sea, air, and space assets must actively seek to counter act adversarial efforts to deny capabilities in order to assure the availability of assets to execute missions.

Military systems have become critically dependent on software reliability because of growing software-enabled systems and components. In addition, software is now embedded in the cyberspace domain that enables defense military, intelligence, and business operations. Furthermore, embedded software has become an essential feature of virtually all hardware systems. This necessitates assessing system reliability through a holistic accounting of hardware, software, operator and their interdependencies.

The British destroyer Sheffield was sunk because the radar system identified an incoming missile as “friendly”. Software can also have small unnoticeable errors or drifts that can culminate into a disaster. On February 25, 1991, during the Gulf War, the chopping error that missed 0.000000095 second in precision in every 10th of a second, accumulating for 100 hours, made the Patriot missile fail to intercept a scud missile. 28 lives were lost. One of the major causes behind all these unfortunate events is the presence of unreliable software.

America’s most expensive weapons system has been beset with numerous delays and many of them related to discovery of software bugs and reliability issues. The F-35 Lightning II, Lockheed Martin’s fifth-generation fighter jet, is had issues related to crucial deadline for successfully deploying its sixth and final software release, referred to as Block 3F. Block 3F is part of the 8 million lines of sophisticated software code that underpin the F-35. In short, if the code fails, the F-35 fails. As first reported by Aviation Week, the DoD report says “the rate of deficiency correction has not kept pace with the discovery rate,” meaning more problems than solutions are arising from the F-35 program. “Examples of well-known significant problems include the immaturity of the Autonomic Logistics Information System (aka the IT backbone of the F-35), Block 3F avionics instability, and several reliability and maintainability problems with the aircraft and engine.”

The design, acquisition, and operation of products and systems intended for defense applications must understandably meet some of the strictest reliability standards of any industry or sector. When it comes to a mission-critical apparatus, failure of any type can lead to unacceptable and devastating consequences. For this reason, the Department of Defense’s (DoD) Reliability, Availability, and Maintainability (RAM) guidelines and the U.S. Military’s reliability prediction handbook (MIL-HDBK-217) were established.

Software reliability enhancement methods

Many methods have been explored for assuring the reliability of software systems, but no single method has proved to be completely effective. As reliability is the prime concern for mission-critical software, a number of techniques like peer reviews, code inspection, and dynamic testing, etc are used throughout the development to improve software reliability.

While the traditional, method to see if a program works is to test it. Coders submit their programs to a wide range of inputs (or unit tests) to ensure they behave as designed. Testing can require over 100,000 years to achieve reasonable confidence in the correctness of safety-critical software. Even then, only a negligible fraction of the input space will have been exercised.

Redundancy and diversity are fundamental strategies for enhancing the dependability of any type of system. Redundancy means that spare capacity is included in a system that can be used if part of that system fails. Diversity means that redundant components of the system are of
different types, thus increasing the chances that they will not fail in exactly the same way.

Software Redundancy

Software redundant components are part of software engineering best practices to improve software quality and make a software fail-safe. According to National Research Council (2015), Redundancy exists when one or more of the parts of a system can fail and the system can still function with the parts that remain operational. Two common types of redundancy are active and standby.

In active redundancy, all of a system’s parts are energized during the operation of a system. In active redundancy, the parts will consume life at the same rate as the individual components. An active redundant system is a standard “parallel” system, which only fails when all components have failed. In standby redundancy, some parts are not energized during the operation of the system; they get switched on only when there are failures in the active parts. In a system with standby redundancy, ideally, the parts will last longer than the parts in a system with active redundancy. A standby system consists of an active unit or subsystem and one or more inactive units, which become active in the event of a failure of the functioning unit. The failures of active units are signalled by a sensing subsystem, and the standby unit is brought to action by a switching subsystem.

For software to be fault-tolerant, there are various techniques that can be employed. Omar Anwer Abdul, HameedIsraa Abdulameer Resen, and Saif A Abd (2019) advocated that there are two types of software fault tolerance techniques namely single version and multi-version.
 Single version techniques aim to improve the fault tolerance of a software component by adding to it mechanisms for fault detection, containment, and recovery. Redundancy may also be provided by including an additional checking code, which is not strictly necessary for the system to function. This code can detect some kinds of faults before they cause failures. It can invoke recovery mechanisms to ensure that the system continues to operate.
 Multi-version techniques use redundant software components which are developed following design diversity rules. As in the hardware case, various choices have to be examined to determine at which level the redundancy has to be provided and which modules are to be made redundant.

Sometimes, to ensure that attacks on the system cannot exploit a common vulnerability; these servers may be of different types and may run different operating systems. Using different operating systems is one example of software diversity and redundancy, where comparable functionality is provided in different ways.

AMSAA software reliability scorecard

Software development often is complex and expensive, while available resources and time are generally limited. This exacerbates the challenges of ensuring appropriate and effective practices are followed. To facilitate development of more reliable software, the Army Materiel Systems Analysis Agency (AMSAA) has developed a software reliability scorecard.

The scorecard is a structured and transparent instrument for assessing the health of an individual software development effort and is invaluable in isolating weak areas for further analysis and work. It enables scarce resources to be prioritized and, subsequently, more reliable software to be developed.

AMSAA’s new software reliability scorecard assesses seven key areas of software development and sustainment: Program Management, Requirements Management, Design Capabilities, System Design, Design for Reliability, (Customer) Test & Acceptance, and Fielding & Sustainment. Across the categories a total of 57 specific elements are examined. The scorecard evaluates the risk being taken in each of the key areas separately and also assesses the overall risk of the effort. The instrument captures the rationale for the assessment score given and identifies them along with recommendations for reducing individual risks.

Formal Methods

“The reason why NASA’s Curiosity rover is able to continue roaming Mars years after hitting the surface is that the engineers back at the Jet Propulsion Laboratory spent an obscene amount of time testing its components ahead of the launch. They employed a methodology known as formal verification to mathematically confirm the craft doesn’t from any technical issues that might hinder its performance in the harsh conditions of the red planet.,” says NASA.

The formal methods consist of using formal specification as a way of defining what, exactly, a computer program does and formal verification as the way of proving beyond a doubt that a program’s code perfectly achieves that specification.“You’re writing down a mathematical formula that describes the program’s behavior and using some sort of proof checker that’s going to check the correctness of that statement,” said Bryan Parno, who does research on formal verification and security at Microsoft Research.

“Recently researchers have made big advances in the technology that undergirds formal methods: improvements in proof-assistant programs like Coq and Isabelle that support formal methods; the development of new logical systems (called dependent-type theories) that provide a framework for computers to reason about code; and improvements in what’s called “operational semantics” — in essence, a language that has the right words to express what a program is supposed to do,” writes Kevin Hartnett. “Today formal software verification is being explored in well-funded academic collaborations, the U.S. military and technology companies such as Microsoft and Amazon.”

Software Reliability Prediction

One of the most important solutions to increase software reliability is predicting software errors that cause to decrease in software maintenance cost in the future. A software reliability prediction is performed early in the software life cycle, but the prediction provides an indication of what the expected reliability of the software will be either at the start of system test or the delivery date.

Software reliability models have been successfully used for estimation and prediction of the number of errors remaining in the software. The major difference between software reliability prediction and software reliability estimation is that predictions are performed based on historical data while estimations are based on collected data.User can access the current and future reliability through testing using these models, and can make decisions about the software such as whether the product can be released in its present state or we require further testing in order to improve the quality of software.

Due to the unavailability of testing failure data, Soft computing techniques are being used by researchers. Soft computing is an association of computing methodologies that includes as its principal members’ fuzzy logic, chaos theory, neurocomputing, evolutionary computing and probabilistic computing. They deal with the problems that seem to be imprecise, uncertain and difficult to categorize. Soft computing can be used for software faults diagnosis, reliability optimization and for time series prediction during the software reliability analysis.

Reliability testing

Reliability Testing is one of the key to better software quality. This testing helps discover many problems in the software design and functionality. The main purpose of reliability testing is to check whether the software meets the requirement of customer’s reliability.
Reliability testing will be performed at several levels. Complex systems will be tested at unit,assembly,subsystem and system levels.

Reliability testing is done to test the software performance under the given conditions.

The objective behind performing reliability testing are,

To find the structure of repeating failures.
To find the number of failures occurring is the specified amount of time.
To discover the main cause of failure
To conduct Performance Testing of various modules of software application after fixing defect

After the release of the product too,we can minimize the possibility of occurrence of defects and thereby improve the software reliability. Some of the tools useful for this are- Trend Analysis,Orthogonal Defect Classification and formal methods, etc..

Types of reliability Testing

Software reliability testing includes Feature Testing, Load Testing and Regression Testing

Feature Testing:-

Featured Testing check the feature provided by the software and is conducted in the following steps:-

Each operation in the software is executed at least once.
Interaction between the two operations is reduced.
Each operation have to be checked for its proper execution.

Load Testing:-

Usually, the software will perform better at the beginning of the process and after that, it will start degrading. Load Testing is conducted to check the performance of the software under maximum work load.

Regression Test:-

Regression testing is mainly used to check whether any new bugs have been introduced because of the fixing of previous bugs. Regression Testing is conducted after every change or updation of the software features and their functionalities.

Software Failure and Reliability Assessment Tool (SFRAT)

Recent National Academies report on Enhancing Defense System Reliability recommends use of reliability growth models to direct contractor design and test activities. Several tools are available that automatically apply reliability models and automate reliability test and evaluation. SFRAT is an open source application that Allows users to answer following questions about a software system during test
1. Is the software ready to release (has it achieved a specified reliability goal)?
2. How much more time and test effort will be required to achieve a specified goal?
3. What will be the consequences to system’s operational reliability if not enough testing resources are
available?

Reliability Standards Used in the Defense Industry

There are several standards utilized throughout the defense sector all with the mission to achieve the highest in reliability and quality goals:

MIL-HDBK-217: MIL-HDBK-217 is the worldwide accepted handbook for reliability prediction analysis. This handbook is the originator of the models for reliability prediction analysis, and is the basis for all prediction standards that have been developed since and are in use today.

MIL-HDBK-217 defines the failure rate calculation models for a broad range of electromechanical components, enabling you to effectively calculate and predict the failure rate and MTBF (Mean Time Between Failures) of your products. The most recent revision of MIL 217 is MIL-HDBK-217 F Notice 2. Sometimes, MIL-HDBK-217 is referred to as MIL-STD-217.

NPRD/EPRD: The NPRD (Non-electronic Parts Reliability Data) and EPRD (Electronic Parts Reliability Data) include a library of components and their representative failure data. The data is extensive and represents a wide variety of devices and sources. Adding these components into your reliability prediction analyses allows you to build a complete system model and more accurately assess your product reliability. The most recent editions of these data libraries are NPRD-2016 and EPRD-2014.

ANSI/VITA 51.1: ANSI/VITA 51.1 is a collaborative industry standard that provides recommended modifications to the MIL-HDBK-217 F Notice 2 Reliability Prediction Handbook to reflect more updated failure rate assessments. The ANSI/VITA 51.1 rules, recommendations, and suggestions take into account changes in device technologies, improvements that have occurred over time since the MIL-HDBK-217 F Notice 2 standard was released, and updated data parameters to more accurately model current device quality and performance.

MIL-STD-1629: MIL-STD-1629 is the standard for FMECA (Failure Mode, Effects, and Criticality Analysis). FMECA is a methodology used to evaluate and assess all potential system failure modes, the resulting effects and possible causes of those failure modes, and ultimately eliminate, reduce, or mitigate the failures deemed most critical.

MIL-HDBK-472: MIL-HDBK-472 is the worldwide accepted standard for maintainability predictions. It defines all the elements required to perform a full maintainability prediction analysis including Tasks, Task Groups, FD&I (Fault Detection & Isolation) Outputs, and Maintenance Groups. MIL-HDBK-472 describes how to compute a comprehensive set of maintenance metrics, including mean time to repair (MTTR), mean corrective maintenance time (MCMT), mean preventive maintenance time (MPMT), percent isolation to a single replaceable item, Maintainability Index, and much more.

MIL-STD-2155: MIL-STD-2155, entitled Failure Reporting, Analysis and Corrective Action System (FRACAS), establishes the criteria needed to comply with the FRACAS requirement portion of MIL-STD-785.

MIL-STD-785: MIL-STD-785, Reliability Program for Systems and Equipment Development and Production, is a broad standard that offers general guidelines as well as specifics for reliability programs that span the product lifecycle.

MIL-STD-882: MIL-STD-882, Standard Practice System Safety, establishes an accepted baseline for expectations for system safety efforts. The most recent revision of MIL-STD-882 is MIL-STD-882E, though MIL-STD-882D is still often used.

MIL-HDBK-781: MIL-HDBK-781, Handbook for Reliability Test Methods, Plans, and Environments for Engineering Development, Qualification, and Production, includes test methods, plans, and environmental profiles that can be used in reliability testing during any stage of the product lifecycle from development, qualification, and production. MIL-HDBK-781A is the most current version of the handbook.

MIL-STD-499: MIL-STD-499, System Engineering Management, provides the basis for managing systems engineering activities such as negotiating and performance for product delivery. The most recent version is MIL-STD-499A Notice 2.

MIL-HDBK-344: MIL-HDBK-344, Environmental Stress Screening (ESS) of Electronic Equipment, includes procedures for planning and managing the cost effectiveness of ESS programs for electronic equipment. The most recent version is MIL-HDBK-344A.

MIL-HDBK-514: MIL-HDBK-514, Operational Safety, Suitability, and Effectiveness for the Aeronautical Enterprise, includes a framework for system management of aircraft and related items such as support equipment, ground systems, and simulators, to provide safe and acceptable performance during operation.

References and Resources also include:

https://www.irjet.net/archives/V7/i12/IRJET-V7I1228.pdf