Gauging your conformance decisions

n the past couple of articles, we have been having fun together testing whether a measurement device is usable for the crazy purpose of determining if we are actually making product in or out of specification. Last month, we performed a measurement systems analysis (MSA) “potential study” using a snazzy MSA spreadsheet (if I do say so myself)*. We found that the Hard-A-Tron was not only pretty highly variable (compared to our spec), but that the material we were measuring actually might have been changing over time. But a potential study was not enough for you, was it? You asked, nay demanded, that we perform a short-term MSA, and I, your humble servant, gave you the data to do so. After the jump, we will perform the analysis, so unless you are the type of person that flips to the back of the book to see if you want to read it, finish up your analysis, and then click to read more.

OK, let’s review the scenario. We are weighing a plastic preform before placing it into a compression mold. The weight specification is 465 ±50 grams. We are assuming that the measurement is independent of operator, so we only have one operator do the test.

We have been using this scale for years—and it has a digital readout, so the plant manager likes it. On the other hand, we have had a lot of defective parts for years, too, and the area the scale is in is pretty contaminated with phenolic dust.

You perform the study by having the operator go through all 25 samples in a random order (while preventing him from seeing the ID number of each). You record these readings, then go through the same 25 samples in a different random order, and repeat this so you have five measurements on each of the 25 pieces. Each row contains the five measurements for that part.

First off, what are we trying to accomplish by doing a short-term study?

In this case, we want to know if the gauge itself might be adding variability into the process. In other situations, we might want to learn more about how a gauge performs before plunking down $10,000 for it. By increasing the size of the samples we are testing, we give the gauge a better opportunity for “stuff” to happen that would affect the measurement and giving ourselves some more data to help us make a conclusion. I mean, if something happens during the hour of this test, it is only going to perform worse once we get it into production, right?

So we need to look at:

  1. The repeatability, or how much variability the same operator and the same gauge produce on the same part. This comes from the range within operator within part and will be shown on a modified range chart.
  2. The reproducibility, or how much variability is due to differences in operator and gauge. (This is not applicable in this scenario.) This comes from the range across operators.
  3. The discrimination of the gauge, or the ability of the gauge to tell nominally different parts from each other. This will come from the total measurement error for each operator and be shown on a modified mean chart.
  4. The ability of the gauge to correctly classify product as conforming or nonconforming on a single measure. This will be indicated by the metric %R&R.

Once the data are in the spreadsheet, a whole bunch of things happen. Let’s go through each and see what we can learn about this gauge.

As with the potential study, we are using the range across each unit’s measurements to get an estimate of the measurement error within sample. The average range across all those samples ought to give us an estimate of that component of the measurement error. Now that we have 25 samples being measured five times each, we can use the concept of a control chart to help us determine if any of the ranges are unexpectedly high, instead of just eyeballing it. (We can’t do that with the potential study, because the limits on the range fluctuate with sampling error on those smaller sample sizes and lower repeated measures.)

So the spreadsheet creates the usual type of range chart, but the range is for each part, not across multiple parts. Be sure you understand what this range chart is doing—it is critical to making conclusions about your measurement system. Each dot is the average range of repeated measurements on the same part, and so represents measurement error. Measurement error is one of the few events you can usually count on being normally distributed and hopefully the same for each part (we will verify this two ways), so the average of the ranges  of those repeated measurements can use the usual formula for calculating the control limit for the ranges:

where D3 and D4 are constants related to sample size. As it turns out, with a sample size of five, there is no lower limit on the range chart. Because each dot is the range across a part, the dots are not in time order, we don’t connect the dots and we don’t use any of the time-based control rules like runs and trends. We just look for one or more points outside of the limits.

And here is what we see:

heretic_short_2

Figure 1: Range chart of repeated measures

This chart would detect if one or more parts had an unusually high range, which you might see if the part gets damaged during the MSA, or if there was something unique to that part that made getting a measurement difficult. Just like with regular statistical process control (SPC), we would investigate any point outside the limits to try to understand what happened. But with no points outside the limits, we can say that the within-part variability looks to be pretty stable across all of our parts.

Those of you who have ever used a mass balance are looking at that average range and going, “Whoa!” But we did learn something important here: There is nothing unusual about any part that is causing a large range in measurements, which implies that there is something inherent in the measurement process itself causing that. Whether that is due to operator error or measurement device we don’t know by looking at the graph (though the assumption in this case was that operator didn’t have an effect).

Because we have built a range chart it seems like we ought to take a look at a mean chart, too. But it is again important to understand what the mean chart is telling us.

Just like the range chart, the mean chart here is the mean of the repeated measures, not the mean of multiple samples, and the samples are selected from nominals across the entire range that the gauge is expected to be used. These two facts have a profound effect on how we interpret this chart.

Remember how control limits are (by default) calculated for the location charts for continuous data? Except in very particular circumstances, we use the average dispersion metric multiplied by a constant, as with a range chart:

heretic_short_3

We do this because the average range will give us a better idea of the true underlying variability. The way we usually make a control chart is to take, say, five sequential units for each sample. That way they are about as similar as we can make them, so what variability we see within-sample is hopefully due to just the total process variation (which includes inherent variability in the parts and measurement error). If there are no changes through time (out of control events) then the within-sample error is the same as the between-sample error, and the chart shows random, normally distributed means. And don’t forget, we are taking a sample of five, and due to the central limit theorem I know that the random sampling distribution of the means are going to be more narrowly distributed than the individuals. Remember this?

heretic_short_4

So I use the range to estimate the process standard deviation (σ), and then reduce that to account for the sample size used in generating the means. Otherwise, the limits would be too big for our averages.

But that is not what is happening here.

In an MSA we are trying to understand only the measurement error, and our ranges across the repeated measures represent only that error. Also, we should be choosing as samples for our MSA parts across the entire range I expect to use that gauge. If I am testing a 1- to 2-inch micrometer, I will knowingly choose parts that are to a 1-inch nominal, 2-inch nominal, and anything in between. So I don’t expect my means to be distributed in any particular way—I chose how they are distributed when I decided which parts I wanted for my sample. If my micrometer varies around a 100th of an inch on repeated measures, the range is going to predict a tiny control limit, but my parts are up to an inch different from each other. We can have an additional component of variability here, and that is the nominal differences between parts.

So the mean chart is going to look weird. You might try to make a case that you don’t even need the mean chart, but we can still use a chart with the means to learn about our measurement system. Again we don’t draw lines from average to average (which implies time order) but now I add in points for the individual readings as well (you’ll see why in a moment). I have big blue dots for the mean and little black dots for each measurement. This will visually show the spread of the repeated measurements, which can help us figure out what is going on.

We are not doing process control here—control limits make no sense to put on this chart. We don’t care about the part-to-part variability, we care about the variation of the repeated measures. We can put some lines on the chart that show that variation. To do so, we use this good old formula—

heretic_short_5

—to give us an estimate of the measurement error based on the range, which is about 14.69 g. We do not correct for sample size—we run this process by taking a single measure, so we want to see visually how much variation we can have on a single measurement. Let’s use ±3 measurement error standard deviations to generate the natural tolerance of repeated measurements on the exact same part time and time again. For convenience, we will place these lines (not control limits) around the mean of all the parts.

And we see this:

heretic_short_6

Figure 2: Mean plot of repeated measures with ±3σ lines and

individual readings

Note that I made those lines dashed black, so as not to confuse anyone into thinking that they are control limits.

This chart shows a process with so much variation that whatever part to part differences might exist are swamped by the huge measurement error. See that blue dot pretty close to the green line? If I measured that 100 times, I could get readings spanning the entire range between the black dashed lines. Yikes!

What I actually want to see on this graph is something like this:

heretic_short_7

Figure 3: Mean plot of repeated measures with ±3σ lines and

individual readings for different gauge

The black dashed lines show the discrimination of the gauge on a single measurement. With the gauge in figure 3, we have a pretty good ability to discriminate one part from another. In figure 2, we can’t tell one part from another with a single measure on the mass balance—any real differences are small compared to the measurement variation.

Still, we chose a bunch of parts from the process—maybe they really are pretty much the same. If the specification is wide compared to this variation, I can still count on the gauge to correctly classify my parts as in or out of spec. Remember that goofy graph from last month showing the probability of incorrectly classifying a part? Here it is again for this gauge.

 

heretic_short_8

So even if our preform masses are well within specification, our measurement system is going to be classifying a good chunk of them as out of spec. And we stand a pretty good chance classifying parts that are out of spec as in spec. Hmmm….

We don’t have to make that graph as part of an MSA, we just need a metric to show this inability of the gauge to make the right decision. That is the %R&R calculation, which will tell me what proportion of the spec is taken up solely by measurement error:

heretic_short_9

Again, %R&R should not be used as the only consideration for acceptability, but in this case the high measurement error relative to the spec combined with the fact that we have had problems with this process for years, would seem to indicate that it is time to give that mass balance salesman a call. And to think about how to protect the new balance from phenolic contamination.

One last thing to consider is to see if there is a relationship between the magnitude of the mass reading and its variation. It is not uncommon for a measurement system to have different variability on the high versus the low end of the scale. We can check that with a quick correlation, which the spreadsheet generated for us.

 

heretic_short_10

We test to see if that correlation is significant, using the correlation test for ρ = 0 and (no surprise) there is no significant correlation. If there were, the variation of the measurements would change with the magnitude of the reading, so the ability to correctly classify our preforms would change depending on how much they weigh, and we would have a different %R&R for different nominal masses. It is good to know if this is the case, especially with a gauge that measures a wide span of nominal values as part of production. (Yes, I am talking to you, QA/QC labs!)

The short-term study gives us a lot of information on how a gauge performs during a snapshot in time, and the %R&R indicates if I can use that gauge to make conformance decisions. (And I hope by this point, you can see the total insufficiency of a calibration sticker to tell you that.) But once I start using a gauge, how do I know that it is still good today?

The long-term study monitors a gauge over time to ensure that a gauge that is acceptable today remains so tomorrow. To do that, we set aside five to eight samples from our usual production, spanning the range of what the gauge is to measure. Every day we remeasure these same samples, so again we are getting repeated measures. If there is a change in the readings we investigate to see what changed. And if it wasn’t the samples, then the gauge is giving different numbers today than it did yesterday.

So here is your mission, should you choose to accept it:

A statistical facilitator and an engineer wish to conduct a gauge capability analysis (long-term) for a particular Ignition Signal Processing Test on engine control modules. The test selected for study measures voltage which has the following specifications:

IGGND = 1.4100 ± .0984 Volts (Specification)

Eight control modules are randomly selected from the production line at the plant, and run (in random order, of course) through the tester at one hour intervals (but randomly within each hour). This sequence is repeated until 25 sample measures (j = 25) of size eight (n = 8) have been collected.

The data can be found here.

Play around with the data in the MSA spreadsheet, and let me know what you think about your voltage test fixture.

*By the way, I noticed that the MSA spreadsheet will give an error in Excel 2007 where it hasn’t before. The new version with this workaround is posted online.