## The Mystery Measurement Theatre

- Details
- Created: 14 January 2010
- Written by Steven Ouellette

If you can’t trust your measurement system, you can’t do anything with the data it generates. Last month, in “ Letting You In On a Little Secret,” we talked about the purpose of measurement system analysis (MSA) and I gave you a neat spreadsheet that will do MSA for you, as well as some data (repeated after the jump) from the gauge you want to buy, the Hard-A-Tron. I also left you with a mysterious statement that this study was trickier than it appeared. This month I’ll start off answering a question I received, and then we will see how well the Hard-A-Tron did—and what mysterious thing was going on in the data. After that, if you are good, I’ll give you another set of data to further test a measurement device.

The question I received was, “Why don’t you use a standard hardness coupon instead of production material?” It boils down to two words: external validity. Let’s say I'm testing my Hard-A-Tron on these standard, traceable to NIST coupons, and the gauge passes. What does this tell me about my ability to measure the cylinders I fabricate?

Well, not much if anything, actually. The coupons are flat—I just drop them onto the measuring stage. The cylinders are, well, cylindrical, so I might have some sort of a fixture to place them in to hold them while I test with the device. Does the fixture have some springiness to it that might affect the reading? How about the extra height above the stage—could that change the reading variability? And the fact that I'm measuring a curved surface rather than a flat one—how might that change the as-measured hardness? Well, we don’t know—that was not part of the study using coupons. Using similar reasoning, I want to perform the MSA in the environment in which the device is actually used (perhaps dirty and poorly lit) and by the people who actually use it, instead of me alone in a nice clean, bright lab. My research question is not, “How well does my device measure coupons?” but rather, “How well does my device measure my cylinders in the production environment?” And to do that, I want to replicate as much of the actual process as I can.

OK, back to our Hard-A-Tron device. Recall that one of the main purposes of MSA is to determine if a measurement system will correctly classify something as conforming or nonconforming to your specification. If you used my spreadsheet, it does all the calculations for you so that you can answer this question. Recall that we have two operators (Jack and Jill) measuring the hardness on the same bars. First Jack measures all 10 in a random order. Then he measures all 10 in a different random order. Then Jill measures all in a third random order, and finally measures them again in a different random order. (Phew!)

Here were the results I showed you last time:

So let’s think through our sources of measurement variability here.

- Each of the samples we are measuring probably has some different “real” hardness, so we have part-to-part variability. We don’t want to count this as part of our gauge variability though—it is inherent in the parts.
- The same operator measuring the same part does not get exactly the same number, so we track that with the range within part within operator. We will calculate an average range within operator for an overall indicator of this. This range will allow us to estimate the variance due to repeatability. (We want to keep an eye on the ranges within part though—that might indicate a problem with the part—perhaps it got damaged during testing or there is something making the measurement difficult to do on that part.)
- Each operator might have a different bias off of the “real” number. Regardless of the fact that the different parts could be totally different hardnesses, the
*average*of all the parts should be a particular number, right? So we will take an average across all the readings for each operator. If those numbers are statistically different, then the operators are biased compared to each other. (A repeated measures t-test would be the way to statistically test that, but that is a little beyond the scope of this article.) The difference between the two averages will give us an estimate of the reproducibility variability. - … and there is one more “mystery component” that I am not going to tell you yet.

You might have memorized the formula to estimate the standard deviation from the average range when you learned about statistical process control. Here it is again:

Because we have two operators as well as two measurements, we use an expanded table to look up d_{2}. Thankfully, that table is built right into the spreadsheet. We have two different ranges—the average of the two operators’ ranges, and the range between the averages for each operator. Dividing by d_{2} gives us the estimate for the within-operator and between-operator variability, repeatability (σ_{RPT}) and reproducibility (σ_{RPD}) respectively.

To estimate the total variability (σ_{e}) including repeatability and reproducibility, we need to add together the variances (you know we aren’t allowed to add standard deviations) and take the square root.

Umm, why do we care about σ_{e} again? This is the overall variability that this measurement system will produce, across operator, for the same part measured again and again. This in turn will be essential in answering the question, “Can I use this measurement system to determine if the product is conforming?”

Here is the output from the spreadsheet:

So Jack averages a little lower than Jill (without a known “real” value for the parts, we don’t know if either Jack or Jill has the right hardness). Jack has a little bit more variability on average within part than Jill. (Note again that for our simple MSA we have not shown that these differences are statistically significant.) We have used those ranges to estimate the variability due to reproducibility and repeatability, and added the variances and taken the square root to get the overall variability of the Hard-A-Tron as used by Jack and Jill.

The reproducibility variability is higher, indicating that the major source of the variability is from operator to operator. This points us to how we can improve our measurement system—we need to figure out how to get Jack and Jill to agree with each other better. Maybe we look at the standard operating procedure used with the measurement device, maybe we watch how they set up the measurement. Or maybe… but that would be giving away the mystery; and I haven’t even assembled everyone in the drawing room.

I know what you are thinking, “That is all well and good, but I still need to answer my question—can I use the dang thing to determine conformance to spec?”

Now for a simple question, there is a lot of confusion about how to answer it, so I am going to walk through this to make sure you understand why we use the metric we use. I need you to keep focused on the question we are trying to answer, though.

Determining if something is in or out of spec is an action near and dear to many of our hearts, I am sure. So let’s make a drawing.

Let’s assume for this discussion that the average between Jack and Jill is the real hardness. If Jack and Jill measured the part a whole lot of times, and the part’s real hardness was 70, we would see a distribution of measurements that looks something like this:

Figure 1: Hard-A-Tron measurement error on a part that really measures 70 |

Now, this is the *same part* with the real average right in the center of the spec. With this much variability, a part that is smack in the center of the spec would be scrapped or reworked about 12.754 percent of the time due only to our measurement variability. That is not so good.

Consider a part that is still really in spec, but not right in the center:

Figure 2: Hard-A-Tron measurement error on a part that really measures 67 |

This part really has a hardness of 67, so it is in spec, but 31.128 percent of the time we would say it was out of spec. Even more amusing, we would actually have said it is out of spec on the high side 0.5546 percent of the time.

Last one—what would happen if it were actually out of spec at 65?

Figure 3: Hard-A-Tron measurement error on a part that really measures 65 |

The good news is that we would actually recognize that it is out of spec about 56.7875 percent of the time. (Remarkably, 0.0648 percent of that is shown as over the upper spec.) The bad news is that 43.2125 percent of the time, we would think it was in spec, and send it off to our customer. Oh no! Can you spell “termination with cause”?

As a side note, we know what will happen if something is measured and is categorized as out of spec—we remeasure it, right? Maybe it will “go back in.” But as you can see in this case, due to the high measurement error, we should be remeasuring even if it is inside of the spec.“Quick—we are in spec—remeasure it.” How many times do you think *that* happens?

Anyway, let’s sum up and show a graph of the probabi lity of incorrectly classifying a part of a given hardness:

Figure 4: Probability of the Hard-A-Tron incorrectly classifying a part as conforming or nonconforming |

A crazy graph, I know, but for any given hardness, we can read off how likely it is we will come to the wrong conclusion about its conformance to spec. The worst case is when the part is really on the spec, which is (for a symmetric distribution anyway) always going to be 50–50. The further from the spec, the easier it is to make the right decision. But with this system, we are always running a pretty big chance of mischaracterizing our part.

So clearly, this is a measurement system that has real trouble accomplishing what we need it to do—correctly determine if a part is in or out of spec. Whatever metric we end up using has to reflect this inability.

There are two measurements that are commonly used for determining gauge capability, and they are both pretty similar. The one that has come into vogue is the %R&R metric:

That 5.15 gives you the width of 99 percent of the measurement error.

The one I prefer is the P/T ratio:

Here you are taking the natural tolerance of the measurement error instead of 99 percent of it. The difference is small, but this one makes more sense to me. Plus 6 is easier to remember than 5.15, but maybe that is just me. On the other hand, everyone uses 5.15 so I have pretty much given up that fight.

Here are the spreadsheet’s calculations for %R&R.

What does this number tell us? Dividing by the width of the spec gives you the proportion of the spec eaten up by measurement error alone. The smaller this number, the less likely you are to categorize something that is in spec as out, or out of spec as in. Or, as in our case here, if your measurement error itself takes up more than the whole width of the spec, the %R&R clearly tells you so by being stupidly big.

What would you want this number to be? A really nice %R&R would be 10 percent. Here is that goofy Twin Peaks graph with a %R&R of 10 percent:

Figure 5- Probability of a measurement system with a %R&R = 10 percent incorrectly classifying a part as conforming or nonconforming |

Well, this is about as silly as before, but in a good way. Unless I am reeeeeeealy close to the spec limit, I am almost certainly going to make the right decision about the conformance to spec for that part.

The %R&R of 10 percent should not be an acceptance criterion, though. Whether a measurement system is acceptable or not is a function of how costly it is to mischaracterize a part and of how many times you are willing to remeasure. All the above graphs are showing you what you would get if you measured each part *one* *time* in order to make a conformance decision. If your %R&R is a little too high, you could decrease the effective gauge error by measuring more than once and taking an average, and then base your decision on that. How many do you need to measure? Remember this equation from basic stats when learning about the random sampling distribution of the mean? Of course you do:

You can back-calculate the standard deviation you need in order to get a %R&R of 10 percent. For our Hard-A-Tron, we would need a standard deviation of about (0.1 ÷ 5.15) * 9 = 0.1748. Putting that in the equation above on the left, using the 2.953 as the sigma on the right, we solve for *n* and get 286.

Umm, I don’t think we will be measuring our little bars that many times. Looks like we need to reduce that measurement error some other way.

Here is where I finally solve the mystery. Did I distract you with enough stuff to make you forget there even was one?

Go back and look at the data and follow the readings for a single part through Jack 1, Jack 2, Jill 1, and Jill 2. With a few exceptions, the numbers *increase through time*. The potential study we did gives us no direct information of stability though time, but this is a pretty gross change that is visible across each part. Could it be that the real values of hardness were changing with time and the operator-to-operator difference we saw was actually due to that? Well, as the metallurgists in the audience can tell you, that certainly can happen. Either the metal itself gets harder with time (it ages) or the force exerted on the sample for each of the hardness measurements actually hardens the material nearby for the next test.

That is what happened here, and that is the last source of variability—the hardness of the material itself was changing with time. In this case, it is impossible to isolate the change in hardness from the change in operator. (By the way, this is exactly why we do Jack 1, Jack 2, Jill 1, and Jill 2 in that order. If we had randomized across operator, we probably would not have noticed the time effect. As it was, this effect was confounded with operator, so while we might have spent time trying to figure out why Jill was different than Jack, we had a better chance of eventually catching the time effect.) It turns out that in this case, hardness measurement is a Class III destructive test—the true value changes with time.

And that is the end of the Mystery of the Maltese Hard-A-Tron.

Now if I get a %R&R of 10 percent in the real world for a potential study like this, I would be happy—but not “I’ll buy your gauge” happy, since I have yet to really put it through its paces. To do that, I go and gather 25 parts that span the expected range of measurement, and have one or more operators measure each part five times or more. This will give us a little bit of information on the stability of the gauge, at least for the short-term, and so this study is cleverly called a short-term study.

Here are some data from a short-term test. In this example, we are weighing a plastic preform before placing it into a compression mold. The weight specification is 465 ±50 grams. We are assuming that the measurement is independent of operator, so we only have one operator do the test. Each row is one of the 25 samples, measured in random order. Each column is the repeated measures, which were done in different random orders.

We have been using this scale for years—and it has a digital readout. On the other hand, we have had a lot of defective parts for years too, and the area the scale is in is pretty contaminated with phenolic dust.

What do you think? Is the mass balance capable of measuring that specification?

Tune in next month to Mystery Measurement Theatre to find out.