To err is human, but to really mess things up you need a statistician…
One of the most useful diagnostic tools for understanding what is going on in a process is the statistical process control chart (SPC). This is also a frequently misunderstood tool, and these misunderstandings lead to misdirected effort during a Six Sigma process, resulting in lost time and money. All the questions related to these foiled efforts boil down to this, “I used my software to make a control chart, but the chart looks all messed up. Why doesn’t SPC work?”
It does, if you avoid some common pitfalls. So today, I am kicking off a few articles about these pitfalls that I hope will make your projects less frustrating and more efficient.
No Measurement System Analysis
In my experience, the most common error in doing SPC is not performing a measurement system analysis first. This is an occurrence that calls for a real head thumping on the desk (yours, or maybe the person’s who didn’t do the study, if you can get away with it).
Sometimes teams don’t do any type of study, which is especially common in discrete or judgment gauges, due to the difficulty of the statistics. More frequently they do something that they call a “gauge R&R,” but that really does not test all the necessary aspects of the measurement system. As I talk about in my article, “Don’t judge a gauge by its sticker,” I prefer a three-phase approach: a potential study that shows if you even have a hope of being able to use the gauge to measure at the level you intend, a short-term study to test the gauge prior to using it in production, and a long-term study that monitors the gauge through time in order to detect shifts in the gauge through time. (Note that a calibration sticker alone does none of these things.) A good measurement system analysis tests the gauge for:
- Stability through time (control)—If the gauge measures the exact same things over and over again, do you get predictable results in terms of the average and variability? In the absence of stability, you have a fine random number generator, but it is not so useful as a process measure.
- Inherent variability (precision)—Is the gauge's variability for the system (across measurement device and/or operators) a relatively small proportion of the specification to which you are making a decision? If not, and if the gauge is stable, at the expense of repeated measures, you can still make a decision.
- Bias or accuracy—How far off is the data generated by the gauge from the “real” number on average? This is actually less important for measurement systems that are inputs into a process if those processes are tuned based on outgoing measures.
Common errors when doing gauge analyses are:
- Not testing for stability through time—You end up working on a process problem with D-M-A-I-C that is really a measurement problem through T-I-M-E.
- Not using the things that you are actually going to measure as the items you measure during the study—Measuring standard gage blocks “real good” tells you little about measuring your product, unless you make gage blocks.
- Not using samples that do not “exercise” the gauge across the full band of what it will be used to measure—These data are also used to determine linearity, changes in bias as the magnitude changes, as well as independence between the mean and the variance around the mean.
- Not testing the actual operators who will be using the gauge in the actual environment within which they will be using it— Something that works great for an engineer in the lab could be terrible in a production environment.
- Assuming a measurement system that doesn’t destroy the sample is a nondestructive gauge for purposes of determining gauge stability— Destructive gauges I have encountered include hardness testers, micrometers, and calipers, all of which changed the measurement I was trying to repeat, as well as things where the real value is changing through time and can’t truly be repeated.
If the measurement system analysis is not performed, or not performed correctly, you could be trying to solve a problem that exists only in the measurement system. There is nothing quite so agonizing as sitting down with a frustrated team only to find out that the entire reason they were commissioned was because of a gauge that was not in control or capable of measuring to the specification... and that the previous 35 years of product were placed on SPC charts, scrapped, or passed based on that gauge.
Continuous Processes
Continuous processes are those that run continuously when producing and are characterized by ongoing additions of inputs, and they have some unique things to consider when applying SPC. Examples: a blast furnace continuously produces pig iron as long as it is fed limestone, coke, heat, oxygen, and iron ore; a chrome plating process electrolytically deposits chrome plating by using a large bath of acid that is constantly supplied with chrome, acid, heat, and electricity; a continuous heat treatment furnace produces heat-treated sheet and is continuously supplied with heat and sheet.
What these processes have in common, as important to SPC, is that there is an existing component of the process (e.g. heat, raw material, or consumable) that is used up and replenished on an ongoing basis. Such processes do not have meaningful subgroups, so the only control charts that we can use for the continuous part of the process are individual charts. In addition to the usual cautions associated with individual charts, continuous processes can have a persistence from one sample to the next that can lead to problems in interpreting the control chart.
For example, in any process that is measured on an ongoing basis, you can see data that are autocorrelated. All that means is that the current data point depends on the previous data point in some way. Let’s say you are control charting the temperature in the room where you are reading this. If you gather data every minute, you might see something like this:
(For data geeks: to generate these data, I used the previous point as the average for the next point, with some random normal noise thrown in.)
The same type of chart could have been generated for any of the continuous process examples too. Clearly, the temperature is not going to change radically from one minute to the next, but it is going to change some amount from the previous number due to people walking in or out of the room, changes in sunlight, variations in the heat produced by your computer, maybe even taking off or putting on a light jacket. This kind of distribution is called a “random walk,” or sometimes, in an unusually evocative term, a “drunken sailor” distribution (he takes a drink, staggers in a random direction, takes a drink, etc). In math, these distributions could, theoretically, randomly walk from negative infinity to positive infinity, but in real life there is usually some constraint (in this case, the thermostat and the temperature outside your room) that prevents a true "random walk." But as you can see, this control chart looks weird.
Now, it turns out that there is actually a lot of useful information on that chart. The moving range chart shows that the point-to-point variation is fairly predictable, which means that the process is potentially controllable if we can react to the ongoing shifts in the mean. Based on these data, and given a way to adjust the mean (say with a thermostat in the room) we can aspire to control the temperature to target ±2.16 degrees (which is where the red lines are around the average). This is a big improvement over the room temperature bouncing between 62.5 and 74.5 degrees.
If we can’t control the variability we are tracking (as we wouldn't if this was a chart of ore concentration coming into our plant) we may be able to use it as a feed-forward into our process in order to reduce the effect of the wild swings on our process output, making it more robust.
There may actually be a pattern to this autocorrelation, which is detectable by plotting the autocorrelation function for different lags. If so, this can be modeled (maybe with an ARIMA model) and you could plot the deviation from this model, if that meets the control needs of the process. If it does not, you would investigate what might be causing the pattern in order to reduce or eliminate it.
It might be that we are taking data more frequently than we need to for our process. In our temperature example, you can see how the process is bounded, and if we took a data point less frequently than we are now (say once every 20 seconds) we might find a process that is, at a grosser level, in control. The flip side of this is that by doing so, you lose the ability to get the process variability down to ±2.16 degrees as above, so only do this if thoroughly justified by the process. Remember, the goal is not a control chart that is in control, it is a process that is economically controlled to the minimum level of variability around the target.
Next month I’ll continue with some common errors people make in SPC for batch processes.
So don’t make any SPC charts on a batch process until then, OK?