Don't read further until you have read Top Ten Stupid Six Sigma Tricks #3!


Question 1

This is a depressingly common error that can corrupt the results even before any data are gathered. Measurement level describes the relationship between what you are really interested in learning about (the dependent variable) and the measurement you actually are taking (the criterion measure). This relationship is described by the terms Nominal, Ordinal, Interval, Ratio, and Absolute. The difficulty comes because knowing what the data are (say temperature measurements, times, or lengths) does not imply the measurement level.

For example, if we are measuring the diameter of a disk, some might say, "Oh, that is continuous data, so I can use a t-test!" However, the statistics are based on the relationship back to the dependent variable, not the nature of the measurement itself. If you are interested in knowing about the diameter or circumference (2πR), then it is a ratio relationship, if you are interested in area (πR2), it is only ordinal, if you are measuring the deviation from target diameter, it is interval, and if you are using the diameter to define the type of part, it is nominal.

This would be pretty academic, except that all those lovely statistical tests you learned are only valid for certain levels of measurement, and again you can make the wrong conclusion. If you were interested in comparing the disk areas of two groups using the diameter as the criterion measure and you assumed that it was ratio data, you might set up your sample size and select the test based on doing a t-test. The problem is that increasing the diameter by 1mm going from 5 to 6mm increases the area by 8.64mm2, but increasing the diameter by 1mm going from 6 to 7mm increases the area by 10.21 mm2. That is why it is an ordinal relationship - the same increase in the criterion measure (diameter) gives you differing amounts of increase in the dependent variable (area) in which you are really interested.

So you can't use a t-test on these data, since a t-test assumes ratio or interval level data, and can incorrectly indicate or miss an effect. Note that the probability of making these errors are totally different from alpha and beta error (discussed some in the next question). Alpha and beta only apply to sampling error - here we have an error in the test selection. The preferred solution would be to calculate the area and do a t-test on that if our assumption about the relationship between area and spin friction is true - it is probably even better to test the before and after on the spin friction test so as not to even make this assumption if that is an economical choice. Otherwise you are stuck with less powerful ordinal tests like the Sign Test for Location.

Question 2 and 3

See the thing is that statistical tests are based on a pretty conservative assumption, and that is that we will stick with the status quo until we get sufficient data to convince us otherwise. The status quo (usually called the "null hypothesis" because statisticians only speak Greek not Latin) always involves the state of "no change" or "no effect." You get to choose the alpha error (the probability of incorrectly concluding there has been a change due to sampling error when there really has not). Common probabilities are 0.01, 0.05, and 0.10. I default to 0.05, though if the cost of making the change is high compared to the benefit, I will go to 0.01, and if the cost of missing an opportunity is high compared to the cost of the change, I might go up to 0.10. Alpha is just the number you compare the p-value of your test statistic to in order to determine if there is evidence of a change. Most people "get" alpha.

As I said, we are going to stick with the status quo unless proven otherwise. But it is possible that there is a change of a magnitude that I am concerned about, but I just haven't collected enough data to see that yet. This means that I can also make a beta error, one where I say that there is no change or effect, when in fact there is. While you get to choose alpha, you have to "buy" a lower probability of making a beta error with your design choice, a larger effect, or with larger sample sizes.

That is what these two guys missed. So while they correctly accepted the null hypothesis ("There is not sufficient reason to say the process change had an effect") this is not the same as saying, "There has been no effect of a magnitude that could drive us out of business."

In determining the sample size for means, you can choose any three of these four: alpha, beta, sample size, and the size of the effect you need to detect. You are probably stuck with the last input, which is the standard deviation, although you might be able to reduce this sometimes. In defense of the two guys at lunch, they probably took ten samples because that is what their boss told them they could take. But, the chance of them making a beta error (missing the fact that there was a shift of the minimum amount they needed to detect if it was in fact there) was over 90%! You would have been better off saving the expense of the experiment and asking the Magic Eight Ball - at least that way you had a 50% chance of making the correct decision.

Yet I have seen this time and time again: people spending a lot of money to run a test that has less than a 50% chance of detecting what it is they wanted to detect in the first place. I have even run into situations where people are wasting money running too many samples. Maybe they only needed three but, "Huh-yuck! I got ten toes!"

Or, in the case of this problem, 29 toes, since 29 is the correct answer. It is a one-sample test (since we are testing against a historical average) and we should use the t-distribution to get the sample size since we have no assurance that the new process will have a known standard deviation. We also need to use a two-tailed test since the difference can go in either direction.

Question 4

So if we are stuck testing for normality, it makes sense to test it well. As with all other statistical tests, we would like the ability to test data and determine if we can conclude that those data could have reasonably come from a normal distribution, or if it could not have, that we would have a reasonable chance of detecting that.

Which brings me to my favorite brand of vodka, the Kolmogorov-Smirnov Goodness-of-Fit test.

Most Six Sigma stat software I have seen defaults to this to test for normality (some I have seen also use the Shapiro-Wilk). The problem is that the K-S test is one of the weakest tests for this purpose. Not as bad as the chi-squared perhaps, but still pretty bad in that it tends to not reject the assumption of normality when it really should. Which as we saw above, means that you can make wrong decisions leading to annoyingly frequent exit interviews.

What I recommend is jointly using the Anderson-Darling, Shapiro-Wilk, and Lin-Mudholkar statistics, and the Skewness and Kurtosis indices, along with their associated tests. Each of these tests has the power to detect certain departures from normality in certain circumstances - by reviewing the results from these tests you can make a good decision about the appropriateness of using the normal distribution.

Remember too that even if you decide that the normal distribution is a good fit, you still need to check the process for control and reasonableness before you can go on to the next step. If a process is not in a reasonable state of control, it doesn't matter if the data are fit by the normal distribution or not - you cannot make any statistical inferences since anything can happen and the past is no predictor of the future. In this case you need to figure out what is going on with that first. As far as reasonableness, there was once a guy in procurement who complained to the vendor that too many of the batteries were out of specification, including the ones that were predicted to have negative voltage, based on the natural tolerance of the normal distribution. Obviously, even though the voltage distribution passed the normality tests, it could not be right if it was predicting negative voltages. Either that or maybe cold fusion or something was happening and we missed out on a source of infinite energy!