Maintenance and semiconductor fab efficiency – McKinsey
The global semiconductor industry is expanding at an unprecedented rate. Our analysis suggests that the industry will grow by 6 to 8 percent per year through 2030, when it will hit the $1 trillion annual revenue mark.
Demand from end markets—for uses such as computing, data storage, wireless communication, and AI—is responsible for about 60 percent of this growth. The rest comes from industries that require mature nodes (wafers less than 200 millimeters [mm]), such as the automotive, industrial, and wired-communication segments. And while the dynamics in end markets that require leading-edge chips (such as in computing and AI) may create minor fluctuations in demand, the semiconductor industry will grow overall, at least for the next decade.
The industry’s growth has been accompanied by investment. For the past 20 years, we’ve seen the leading-edge 300 mm wafer receive generous amounts of investment in the form of R&D and operational best practices, which has made it relatively easy to increase manufacturing capacity. But for wafer platforms that are 200 mm or smaller—which are frequently used for automotive, industrial, and wired communications—the story is different.
Known as “mature nodes,” chips made on these platforms have historically responded to fixed demand and flat growth. But that changed when roaring end-market demand pushed mature node–manufacturing facilities—known as “fabs”—to their full capacity.
Many fabs need a capacity boost—and quickly. Others need a performance boost to minimize the cost of tools being down for maintenance. But most fabs with mature nodes are at least 20 years old, with the equipment, processes, and productivity to match.
In this context, fab efficiency takes on outsize importance for meeting market demand and cost expectations. The potential is significant. In our experience, improving equipment reliability can help a fab enhance its tool availability (the share of time that a piece of equipment is ready to process incoming work) by more than 15 percent. When applied to bottlenecks, about 70 to 80 percent of this improvement is transformed into the overall equipment effectiveness (OEE) (an overall measure of a manufacturing operation’s utilization relative to its full potential) of the fab. As a result, the fab can quickly tap into significant latent capacity—often by more than 10 percent—without adding tools or expanding its footprint.
Achieving this requires a significant change in mindsets, processes, and systems away from reactivity and toward prevention and advance planning. Fab leaders would need to prioritize equipment recovery, conduct consistent planned maintenance, and manage parts efficiently. The insights in this article would bring the most benefit to wafer platforms of 200 mm or less, but the concepts are applicable to all semiconductor fabs.
Many fabs running mature nodes are not yet using the latest advancements in data management systems, lean manufacturing processes and practices, and Industry 4.0 tools. Cost reduction over a period of years and loss of institutional knowledge through talent attrition has weakened the capability of the production floor. Taken together, these factors make responding to maintenance issues difficult—and inefficient when it does happen.
Over time, a lack of planned, structured actions creates a cycle of reactionary maintenance for many fabs, resulting in lower availability and lower wafer output for the fab overall. In our experience, some fabs’ maintenance ratios (M-ratios) have dipped below 1.0, which means unplanned-maintenance time surpasses scheduled-maintenance time. Getting mired in this kind of firefighting diverts attention and resources from planning and implementing a strategic maintenance program. If unplanned maintenance significantly surpasses planned maintenance, the tools are effectively controlling the maintenance organization, not the other way around. Pressure to hit quarterly shipment targets can heighten the pressure on fab teams, which is not always conducive to rigorous problem solving. The result is reactionary responses to issues, which lead to longer downtimes, reoccurring failures, and reduced tool availability—ultimately affecting wafer output.
Tool availability is one of the crucial factors that affects OEE besides tool performance, wafer loading, and quality. Tool availability sometimes dips, but decision makers’ reaction to it are pivotal (see sidebar “Critical factors in overall equipment effectiveness”).
Seven factors significantly affect fab overall equipment effectiveness (OEE).
Unscheduled downtime, as measured by the maintenance ratio (M-ratio). The M-ratio is the ratio of scheduled downtime (ideal because it’s generally preventative and planned) to unscheduled downtime for any equipment. A healthy M-ratio is greater than or equal to 4.0, meaning that 20 percent—or less—of production equipment downtime is unplanned.
Scheduled downtime. This includes a set of effective preventive actions aimed at avoiding potential tool failures, based on root-cause analysis of past failures or on OEMs’ prescribed actions.
Speed loss. This is when equipment runs more slowly than its target run rate.
Idle time. With and without wafers waiting to be processed, this is time in which equipment is available for production but has not been loaded.
Standby loss. This loss occurs due to loading and transporting wafers and parts and due to unclear alternative wafer flows when the primary path is blocked because a tool is down.
Qualification time. This is time spent after maintenance to test and ensure equipment is performing to specifications.
Engineering time. This is time spent by engineers to inspect, adjust, and improve equipment.
In theory, decision makers and practitioners would prepare and implement an effective planned-maintenance schedule, tamping down the frequency of unexpected maintenance over time. In practice, doing so is difficult because the steps in a wafer flow (the path a wafer takes in the chip-making process) are interdependent and complex—as is the equipment. As a result, we’ve observed that managers focus on firefighting in 40 to 70 percent of the instances in which a tool goes down. This approach necessarily drives them to prioritize short-term wins. In the remaining cases, managers may try to fulfil a plan, but those efforts tend to be doomed because firefighting consumes a disproportionate amount of resources.
In our experience, tool availability can be improved in most cases by addressing unplanned maintenance. We found that an hour of planned maintenance can typically save three to four hours of unplanned maintenance.
A best-in-class maintenance program for semiconductor fabs consists of three overarching elements: resolve problems quickly with short-loop root cause analysis (RCA), build a strong planned-maintenance foundation, and manage parts efficiently.
Our discussion is focused on short-loop root cause analysis (RCA). However, there are three other effective approaches to solve problems and reduce unplanned downtime in semiconductor fabs.
Short-loop RCA, an agile problem-solving technique that can help teams break down causes and generate solutions, can help teams arrive at a shared understanding of the main causes of tool downtime and develop plans to address them. The approach can resolve moderately complex maintenance problems, which, in our experience, account for about half of the issues that arise in corrective maintenance (see sidebar “Short-loop root cause analysis in context”).
Semiconductor fab teams can use short-loop RCA to find the best way to resolve problems that arise with specific tools (exhibit).
For each piece of equipment targeted for root-cause analysis, teams would articulate the problem to be addressed and assess its frequency and impact. Teams would also identify the internal and external resources needed to investigate the problem.
The next step would be to collect historical data on the piece of equipment, reconfirm the nature of the problem with the relevant machine technician, and identify any adjacent issues.
To conduct the RCA, the working team would gather in a dedicated space to develop hypotheses, employ five-why analysis (a particular interrogative problem-solving technique used to identify the source of an issue), prioritize possible root causes, and brainstorm potential solutions. The outcome of this exercise could be a course of action that’s likely to succeed. If not, the team could go back to using historic data to generate insights that could translate into solutions.
As a final step, the team would implement their solutions and track their efficacy. Documentation here will be crucial, especially if teams have insights about whether a particular piece of equipment’s problems are systemic and whether they have implications for other areas of the business such as the fab’s strategy on spare parts.
Analytical frameworks such as fuzzy logic (which uses an imprecise set of equipment notes to identify key tool downtime issues) can help teams recognize the most common problems based on downtime hours and frequency. Ad hoc teams—including suppliers and tool vendors, if needed—can then be assembled for workshops focused on pinpointing the root causes of each issue (see sidebar “Applying fuzzy logic to maintenance in a semiconductor fab”).
Tools in fabs maintain a running log of tool downtime incidents (that is, when equipment experiences running stoppages) as they occur. Over time, the amount of raw data in logs tends to swell and suffer from a lack of structure and organization. As a result, decision makers on the floor struggle to prioritize issues to maximize the efficacy of tool recovery.
One fab used analytical techniques such as fuzzy logic (that is, using an imprecise set of equipment notes to identify key tool downtime issues) to comb the entire dataset—the text of the log entries—for keywords associated with different types of errors, including the abbreviations used by technicians and equipment engineers. Fuzzy logic (and similar analytical tools) can reduce the time it takes to analyze the notes from a few hours to a few minutes and provide additional insights through customized dashboards.
The fuzzy logic framework helps to plot the total amount of downtime caused by different incidents against the number of times those incidents occurred (exhibit).
The issues that were both high impact (as measured by the amount of downtime they caused) and high frequency were marked as a bigger priority and received the greatest level of resource allocation. The issues that happened infrequently but had high impact received attention from teams in workshops dedicated to understanding the root causes behind those issues. Technicians received training to address the remaining issues.
Of course, implementing these habits requires the right expertise, resources, discipline, and enforcement. One benefit of having this consistent framework is that fab teams can use it with the talent they already have. It can also help offset the effects of attrition of experienced staff. Fab teams could allocate resources toward fully resolving any problems teams identify so that each issue can be addressed only once. The knowledge captured from the experience can be used to train staff and resolve issues in the future.
In our experience, effective planned-maintenance programs increase fab availability by 5 to 7 percentage points. Although the rewards are significant and the steps—planning, implementation, and continuous improvement—are straightforward, planned-maintenance programs are difficult to establish consistently in many fabs. The status quo of resource-sapping firefighting is one hurdle, as is outdated maintenance procedures.
Commitment from the top down, accountability, and buy-in from the stakeholders leading the effort can help a planned-maintenance program take root. The foundation for a successful planned-maintenance program is an opportunity for the fab team to gradually adopt the right practices. Each milestone allows teams to recognize the impact of their effective preventive work. Standards and ratings, such as bronze, silver, and gold, for different tasks can motivate teams to improve and achieve higher levels of performance.
The most successful programs have weekly team meetings to review progress measured by KPIs and to conduct feedback that helps teams plan for further improvements. Planning in these programs involves scheduling maintenance in advance and notifying the relevant stakeholders. The most effective teams schedule maintenance activities for the next two weeks and give stakeholders at least ten days of notice.
The implementation stage is focused on checking in with stakeholders and removing hurdles to planned-maintenance activities. Best-in-class teams check in on at least two planned-maintenance activities on each shift and resolve roadblocks within 30 minutes.
From there, continuous improvement is an ongoing process of adopting—or rejecting—possible improvements. The most effective teams make adoption and rejection decisions within a week of receipt so that new suggestions are rapidly integrated into the way they work.
Parts management can be a silent killer of fab availability because replacement parts are not always associated with acute problems. But a solution—in this case, a part—that’s delayed or unavailable prolongs or worsens the initial problem. Procurement teams may need to tap alternate suppliers, parts may not be accounted for in inventory, and multiple teams or team members may work on the same issues without coordination or clear lines of responsibility.
In our experience, a missing or delayed part can create tool downtimes that are up to six times longer than the expected downtime if the part was in stock. Spread across an entire fab, parts management is an open-ended—and critical—challenge.
Because challenges related to parts management are variable and often open-ended, a focus on identifying the right issues is important. Disconnects often occur—for many different reasons. As a result, procurement teams may end up focusing on sourcing nonpriority parts while inventories of critical, work-stopping parts remain unfilled.
A control tower—a cross-functional team that uses real-time data to make decisions quickly—can help analyze problems and create and implement solutions regarding parts management. The control tower can transparently resolve immediate issues and provide systemic fixes that later become part of a strategic tool set to improve processes and manage suppliers. And because of the complexity and regional specifics of the semiconductor supply chain, a parts management risk team would be an important operator of the control tower.
Demand for semiconductors will continue to grow, and fabs are feeling the pressure. One way to relieve the pressure, particularly for fabs producing wafer sizes of 200 mm and smaller, is to shift from reactive firefighting to proactive management of equipment recovery, planned maintenance, and parts management. Entire fabs—not to mention the global semiconductor industry—stand to benefit.
Ryan Fletcher is a partner in McKinsey’s Southern California office, where Abhijit Mahindroo is a senior partner; Yorgos Friligos is an associate partner in the Miami office; and Joydeep Guha is a senior expert in the Bay Area office.