Reliability, maintainability, and availability (RAM) are three system attributes that are of great interest to systems engineers, logisticians, and users. Collectively, they affect both the utility and the life-cycle costs of a product or system.
Although significant improvements have been made in increasing the reliability of basic components such as microelectronics, these have not always been accompanied by corresponding gains in the reliability of equipment or systems. In some cases, equipment and system complexity and functionality have progressed so rapidly that they negate, in part, the increased reliability expected from use of the higher reliability basic component. In other cases, the basic components have been misapplied or overstressed so that their potentially high reliability is not realized. In still other cases, program management has been reluctant or unable, due to program budget shortfalls or highly aggressive schedules, to devote the time and attention necessary to ensure that the potentially high reliability is achieved. However, in many areas of the commercial sector, such as the computer, electronic and automotive industries, increased system complexity has not negated system reliability. In fact, often products with increased system complexity are provided with increased system reliability. This is an area the defense sector must also strive to improve.
In defense while the speed, range, firepower, and overall mission performance of weapons systems has improved dramatically over the years, RAM problems have persisted. RAM problems slow the development and fielding of systems, drive up the total ownership cost, and degrade operational readiness and mission accomplishment at the strategic, operational and tactical levels. New complex digital designs have increased software development and integration issues and the importance of integrated diagnostics.
As a result, DoD conducted a series of studies on these programs to determine the causes. They concluded that defense contractor reliability design practices may not routinely be consistent with best commercial practices for accelerated testing, simulation-guided testing, and process certification and control. Physics-of-failure approaches with physics-based computer-aided design tools may not have been used on a regular basis. A Failure Modes, Effects, and Criticality Analysis (FMECA) and a Failure Reporting, Analysis, and Corrective Action System (FRACAS) were generally not effective in correcting problem failure modes. A FRACAS generally is effective only if a technical Failure Analysis Program is funded and implemented. In addition, DoD found that inadequate testing was conducted at the component and system level. Testing time was limited, and sample sizes were too small. Component stress testing was frequently inadequate or not conducted. Proper accelerated life testing was rarely accomplished. Adequate Reliability Program Plans that provided a roadmap to realization of reliability program objectives and requirements were lacking as well.
Reliability is the probability of an item to perform a required function under stated conditions for a specified period of time. During this correct operation, no repair is required or performed, and the system adequately follows the defined performance specifications. Reliability is further divided into mission reliability and logistics reliability. Reliability follows an exponential failure law, which means that it reduces as the time duration considered for reliability calculations elapses. In other words, reliability of a system will be high at its initial state of operation and gradually reduce to its lowest magnitude over time.
Availability refers to the probability that a system performs correctly at a specific time instance (not duration). Availability is a measure of the degree to which an item is in an operable state and can be committed at the start of a mission when the mission is called for at an unknown (random) point in time. Interruptions may occur before or after the time instance for which the system’s availability is calculated. Availability as measured by the user is a function of how often failures occur and corrective maintenance is required, how often preventative maintenance is performed, how quickly indicated failures can be isolated and repaired, how quickly preventive maintenance tasks can be performed, and how long logistics support delays contribute to down time. Availability is measured at its steady state, accounting for potential downtime incidents that can (and will) render a service unavailable during its projected usage duration. For example, a 99.999% (Five-9’s) availability refers to 5 minutes and 15 seconds of downtime per year.
Maintainability is the ability of an item to be retained in, or restored to, a specified condition when maintenance is performed by personnel having specified skill levels, using prescribed procedures and resources, at each prescribed level of maintenance and repair.
The purpose of Reliability and Maintainability (R&M) engineering (Maintainability includes Built-In-Test (BIT)) is to influence system design in order to increase mission capability and availability and decrease logistics burden and cost over a system’s life cycle. Properly planned, R&M engineering reduces cost and schedule risks by preventing or identifying R&M deficiencies early in development. This early action results in increased acquisition efficiency and higher success rates during operational testing, and can even occur in the development process as early as the Engineering and Manufacturing Development (EMD) phase.
The design, acquisition, and operation of products and systems intended for defense applications must understandably meet some of the strictest reliability standards of any industry or sector. When it comes to a mission-critical apparatus, failure of any type can lead to unacceptable and devastating consequences. For this reason, the Department of Defense’s (DoD) Reliability, Availability, and Maintainability (RAM) guidelines and the U.S. Military’s reliability prediction handbook (MIL-HDBK-217) were established.
RAM impact in Defense missions
Achieving specified levels of RAM for a system is important for many reasons, specifically the affect RAM has on readiness, system safety, mission success, total ownership cost, and logistics footprint.
Readiness
Readiness is the state of preparedness of forces or weapon system or systems to meet a mission, based on adequate and trained personnel, material condition, supplies/reserves of support system and ammunition, numbers of units available, etc. Poor RAM will cause readiness to fall below needed levels or increase the cost of achieving them. Effective diagnostics helps assure both system/mission readiness and efficient repair/return to ready status.
System Safety
Inadequate reliability or false failure indications of components deemed Critical Safety Items (CSI) may directly jeopardize the safety of the user(s) of that component’s system and result in a loss of life. The ability to safely complete a mission is the direct result of the ability of the CSI associated with the system reliably performing to design intent.
Mission success
Inadequate reliability of equipment directly jeopardizes mission success and may result in undesirable repetition of the mission. The ability to successfully complete a mission is directly affected by the extent to which equipment needed to perform a given mission is available and operating properly when needed. Mission aborts caused by false failure indications can have the same impact as hard failures.
Total Ownership Cost
The concept of Total Ownership Cost (TOC) is an attempt to capture the true cost of design, development, ownership and support of DoD weapons systems. At the individual program level, TOC is synonymous with the life cycle cost of the system. To the extent that new systems can be designed to be more reliable (fewer failures) and more maintainable (fewer resources needed) with no exorbitant increase in the cost of the system or spares, the TOC for these systems will be lower.
Logistics Footprint
The logistics footprint of a system consists of the number of logistics personnel and the materiel needed in a given theater of operations. The ability of a military force to deploy to meet a crisis or move quickly from one area to another is determined in large measure by the amount of logistics assets needed to support that force. Improved RAM reduces the size of the logistics footprint related to the number of required spares, maintenance personnel, and support equipment as well as the force size needed to successfully accomplish a mission.
Achieving RAM in Military Systems
Many factors are important to RAM: system design; manufacturing quality; the environment in which the system is transported, handled, stored, and operated; the design and development of the support system; the level of training and skills of the people operating and maintaining the system; the availability of materiel required to repair the system; and the diagnostic aids and tools (instrumentation) available to them. All these factors must be understood to achieve a system with a desired level of RAM. During pre-systems acquisition, the most important activity is to understand the users’ needs and constraints. During system development, the most important RAM activity is to identify potential failure mechanisms and to make design changes to remove them. During production, the most important RAM activity is to ensure quality in manufacturing so that the inherent RAM qualities of the design are not degraded. Finally, in operations and support, the most important RAM activity is to monitor performance in order to facilitate retention of RAM capability, to enable improvements in design (if there is to be a new design increment), or of the support system (including the support concept, spare parts storage, etc.).
The key to developing and fielding military systems with satisfactory levels of RAM is to recognize it as an integral part of the Systems Engineering process and to systematically manage the elimination of failures and failure modes through identification, classification, analysis, and removal or mitigation. Additionally, strengthened ID design maturation tasks will enable RAM design attributes to be realized. These activities start in pre-systems acquisition and continue through development, production, and beyond into operations and support.
Role of the PM and SE
DoDI 5000.02, Enc 3, sec. 12 requires Program Managers (PMs) to implement a comprehensive R&M engineering program as an integral part of the systems engineering (SE) process. The Systems Engineer should understand that R&M parameters have an impact on the system’s performance, availability, logistics supportability, and total ownership cost. To ensure a successful R&M engineering program, the Systems Engineer should as a minimum integrate the following activities across the program’s engineering organization and processes:
- Providing adequate R&M staffing.
- Ensuring R&M engineering is fully integrated into SE activities, Integrated Product Teams and other stakeholder organizations (i.e., Logistics, Test & Evaluation (T&E), and System Safety).
- Ensuring specifications contain realistic quantitative R&M requirements traceable to the Initial Capabilities Document (ICD), Capability Development Document (CDD) and Capability Production Document (CPD).
- Ensuring that R&M engineering activities and deliverables in the Request for Proposal (RFP) are appropriate for the program phase and product type.
- Ensuring that R&M Data Item Descriptions (DIDs) that will be placed on contract are appropriately tailored
- Integrating R&M engineering activities and reliability growth planning curve(s) in the Systems Engineering Plan (SEP) at Milestones A and B and at the Development RFP Release Decision Point.
- Planning verification methods for each R&M requirement.
- Ensuring the verification methods for each R&M requirement are described in the Test and Evaluation Master Plan (TEMP), along with a reliability growth planning curve beginning at Milestone B.
- Planning for system and system element reliability growth (i.e. Highly Accelerated Life Test, Accelerated Life Test or conventional reliability growth tests for newly developed equipment).
- Ensuring data from R&M analyses, demonstrations and tests are properly used to influence life-cycle product support planning, availability assessments, cost estimating and other related program analyses.
- Identifying and tracking R&M risks and Technical Performance Measures.
- Assessing R&M status during program technical reviews.
- Including consideration of R&M in all configuration changes and trade-off analyses.
Reliability and Maintainability Testing
Reliability Testing can be performed at the component, subsystem, and system level throughout the product or system lifecycle. Examples of hardware related categories of reliability testing include
Reliability Life Tests: Reliability Life Tests are used to empirically assess the time to failure for non-repairable products and systems and the times between failure for repairable or restorable systems. Termination criteria for such tests can be based on a planned duration or planned number of failures. Methods to account for “censoring” of the failures or the surviving units enable a more accurate estimate of reliability.
Accelerated Life Tests: Accelerated life testing is performed by subjecting the items under test (usually electronic parts) by increasing the temperature to well above the expecting operating temperature and extrapolating results using an Arhenius relation.
Highly Accelerated Life Testing/Highly Accelerated Stress Testing (HALT/HASS) subjects units under test (components or subassemblies) to extreme temperature and vibration tests with the objective of identifying failure modes, margins, and design weaknesses.
Parts Screening: Parts screening is not really a test but a procedure to operate components for a duration beyond the “infant mortality” period during which less durable items fail and the more durable parts that remain are then assembled into the final product or system.
Examples of system level testing (including both hardware and software) are
Stability tests: Stability tests are life tests for integrated hardware and software systems. The goal of such testing is to determine the integrated system failure rate and assess operational suitability. Test conditions must include accurate simulation of the operating environment (including workload) and a means of identifying and recording failures.
Reliability Growth Tests: Reliability Growth Testing is part of a reliability growth program in which items are tested throughout the development and early production cycle with the intent of assessing reliability increases due to improvements in the manufacturing process (for hardware) or software quality (for software)
Failure/recovery tests: Such testing assesses the fault tolerance of a system by measuring probability of switchover for redundant systems. Failures are simulated and the ability of the hardware and software to detect the condition and reconfigure the system to remain operational are tested.
Maintainability Tests: Such testing assess the system diagnostics capabilities, physical accessibility, and maintainer training by simulating hardware or software failures that require maintainer action for restoration.
Because of its potential impact on cost and schedule, reliability testing should be coordinated with the overall system engineering effort. Test planning considerations include the number of test units, duration of the tests, environmental conditions, and the means of detecting failures.
Calculating Reliability
Failure rate is the frequency of component failure per unit time. It is usually denoted by the Greek letter λ (Lambda) and is used to calculate the metrics specified later in this post. In reliability engineering calculations, failure rate is considered as forecasted failure intensity given that the component is fully operational in its initial condition. Repair rate is the frequency of successful repair operations performed on a failed component per unit time. It is usually denoted by the Greek letter μ (Mu) and is used to calculate the metrics specified later in this post. Repair rate is defined mathematically as follows:Mean time to failure (MTTF) is the average time duration before a non-repairable system component fails. The following formula calculates MTTF:
Variety of methods have been used to calculate system reliability, including trade studies, mathematical modeling, indepth analyses and rigorous testing. A trade study is often the first step, with systematic analysis of the relationships between factors such as mission criticality, space, weight and cost in the context of a particular application. These factors form the elements of the trade study matrix and are weighted based on the system application. The trade study then assigns various potential design solutions a score based on how well they satisfy all the factors. Failure Modes Effects and Criticality Analysis (FMECA), which takes the investigation assessment down into individual components that make up the subassemblies, assemblies, components and overall system reliability.
Application engineers can identify solution possibilities, weigh them against a number of variables—including mission criticality, space, weight and cost—and calculate reliability scores for each. Many of materials, devices and solutions deliver 99.999 percent reliability or better.
Topic | Analysis | Meaning |
Safety | Failure Mode, Effects, and Criticality Analysis (FMECA) | Analyze the consequences of single failure modes (frequency, severity, and risk) |
Fault Tree Analysis (FTA) | Calculate the occurrence rate and probability of safety events that result from complex combinations of sub-events | |
Reliability and Availability | Reliability Block Diagram (RBD) | Calculate reliability, availability, Mean Time Between Failure (MTBF) and Mean Time To Restore (MTTR) of complex systems, depending on the minimal required functionalities that allow the system to operate. |
Markov chains | Markov chains allow to analyze complex systems by modelling each possible system state, and transition rates between the states. | |
Maintainability | Spare parts availability at stock | Calculate the probability that a spare is available in the stock on demand. |
Spare parts effect on operational availability | Calculate system operational availability accounting for increased restoration time due to shortage of spare parts. | |
Testability Analysis | Design a Built In Test (BIT) plan for high coverage of failure modes, and quick failure isolation. |
FMECA deals with effects of a single failure mode event, therefore this calculation is quite straightforward.
Other calculations can become quite complex because of inter-dependence between the states of components of the analyzed system.
Example:
A central stock provides spare parts for two helicopters. When one helicopter consumes a spare part, the availability of spare parts for the second helicopter is reduced.
There are two types of methods for calculating behavior of complex systems:
- Analytic – using equations and numeric integration
- Monte Carlo Simulation – simulating many possible scenarios triggered by random events
Each method has advantages and disadvantages that dictate when each method should be used. The following table summarizes the advantages, deficiencies, and uses of each method:
Analytic | Simulation | |
Advantages | When the analytic algorithm is carefully designed, high accuracy can be obtained in a very short calculation time.For example: requirement of failure probability lower than 10-9 per flight hour can be easily verified. | Simulation can be very flexible, allowing to model highly complex systems with minimal assumptions. |
Disadvantages | Approximations often have to be employed in order to allow analytic calculation. For safety analysis, approximations have to be “worst case” i.e. provide upper bound to failure probability. | In order to achieve high accuracy, many simulations have to be carried out and averaged. This may require a lot of computation resources and time. |
Uses | SafetyFault Tree Analysis is often used for occurrence probability of safety events. Analytic calculation allows for fast and accurate analysis.
Spare optimization The goal of spare optimization is to find the cheapest combination of spare parts that will provide the required system availability. Using fast analytic calculations allows to quickly scan many sparing options. When coupled with a smart optimization engine, the optimal spare parts combination can be achieved.
Availability Steady state availability (system availability after sufficiently long time, when correlations between system components decay) can be calculated quickly and accurately.
Life Cycle Cost Upper bound on the mean life cycle cost and mean cost components for each life year can be quickly calculated. |
Availability and ReliabilityMonte Carlo simulations can provide the point availability and reliability (a curve of availability / reliability over time), accounting for correlations between operational age of components.
Life Cycle Cost By attaching a cost to all events, the life cycle cost can be calculated, including a curve showing how the expenses accumula |