Making Sense of Mann-Whitney Test for Median Comparison
When conducting the 2-sample t-test to compare the average of two groups, the data must be sampled from normally distributed populations. If that assumption does not hold, the nonparametric Mann-Whitney test is a better for drawing conclusions.
By Arne Buthmann
When conducting the 2-sample t-test to compare the average of two groups, the data in both groups must be sampled from a normally distributed population. If that assumption does not hold, the nonparametric Mann-Whitney test is a better safeguard against drawing wrong conclusions.
The Mann-Whitney test compares the medians from two populations and works when the Y variable is continuous, discrete-ordinal or discrete-count, and the X variable is discrete with two attributes. Of course, the Mann-Whitney test can also be used for normally distributed data, but in that case it is less powerful than the 2-sample t-test.
Uses for the Mann-Whitney Test
Examples for the usage of the Mann-Whitney test include:
Comparing the medians of manufacturing cycle times (Y = continuous) of two different production lines (X).
Comparing the medians of the satisfaction ratings (Y = discrete-ordinal) of customers before and after improving the quality of a product or service.
Comparing the medians of the number of injuries per month (Y = discrete count) at two different sites (X).
Project Example: Reducing Call Times
A team wants to find out whether a project to reduce the time to answer customer calls was successful. Time is measured before and after the improvement. A dot plot (Figure 1) of the data shows a lot of overlap between the lead times – it is hard to tell whether there are significant differences.
Figure 1: Cycle Time Before and After Improvement Effort
Therefore, the team decides to use a hypothesis test to determine if there are “true differences” between before and after. Because the data is not normally distributed (p < 0.05) (Figure 2), the 2-sample t-test cannot be used. The practitioners will use the Mann-Whitney test instead.
Figure 2: Normality Test of Data Before and After Improvement Effort
For the test, the null hypothesis (H0) is: The samples come from the same distribution, or there is no difference between the medians in the call times before and after the improvement. The alternative hypothesis (Ha) is: The samples come from different distribution, or there is a difference.
Passing Mann-Whitney Test Assumptions
Although the Mann-Whitney test does not require normally distributed data, that does not mean it is assumption free. For the Mann-Whitney test, data from each population must be an independent random sample, and the population distributions must have equal variances and the same shape.
Equal variances can be tested. For non-normally distributed data, the Levene’s test is used to make a decision (Figure 3). Because the p-value for this test is 0.243, the variances of the before and after groups used in the customer call example are the same.
Figure 3: Test for Equal Variances on Before and After Improvement Effort Data
Ideally the probability plot can be used to look for a similar distribution. In this case, the probability plot (Figure 4) shows that all data follows an exponential distribution (p > 0.05).
Figure 4: Test for Exponential Distribution of Before and After Improvement Effort Data
If the probability plot does not provide distribution that matches all the groups, a visual check of the data may help. When examining the plot, a practitioner might ask: Do the distributions look similar? Are they all left- or right-skewed, with only some extreme values?
Completing the Test
Because the assumptions are now verified, the Mann-Whitney test can be conducted. If the p-value is below the usually agreed alpha risk of 5 percent (0.05), the null hypothesis can be rejected and at least one significant difference can be assumed. For the call times, the p-value is 0.0459 – less than 0.05. The median call time of 1.15 minutes after the improvement is therefore significantly shorter than the 2-minute length before improvement.
Mann-Whitney Test and Cofidence Interval: Before; After
N Median Before 100 2.000 After 80 1.150
Point estimate for ETA1 - ETA2 is 0.400 95.0 percent confidence interval for ETA1 - ETA2 is (0.000;0.900) W = 9,743.5 Test of ETA1 - ETA2 vs. ETA1 not = ETA2 is significant at 0.0460 The test is significant at 0.0459 (adjusted for ties)
How the Mann-Whitney Test Works
Another name for the Mann-Whitney test is the 2-sample rank test, and that name indicates how the test works.
The Mann-Whitney test can be completed in four steps:
Combine the data from the two samples into one
Rank all the values, with the smallest observation given rank 1, the second smallest rank 2, etc.
Calculate and assign the average rank for the observations that are tied (the ones with the same value)
Calculate the sum of the ranks of the first sample (the W-value)
Table 1 shows Steps 1 through 4 for the call time example.
Table 1: Sum of the Ranks of the First Sample (the W-value)
Call time
Improvement
Rank
Rank for ties
0.1
Before
1
4
0.1
Before
2
4
0.1
After
3
4
0.1
After
4
4
0.1
After
5
4
0.1
After
6
4
0.1
After
7
4
0.2
Before
8
11
0.2
Before
9
11
0.2
Before
10
11
0.2
After
11
11
0.2
After
12
11
0.2
After
13
11
0.2
After
14
11
...
...
...
...
7.5
Before
173
173
8
After
174
174
8.5
After
175
175
8.6
Before
176
176
10.3
Before
177
177
11.3
Before
178
178
11.9
After
179
179
18.7
Before
180
180
Sum of ranks (W-value) for before
9,743.5
Because Ranks 1 through 7 are related to the same call time of 0.1 minutes, the average rank is calculated as (1 + 2 + 3 + 4 + 5 + 6 + 7) / 7 = 4. Other ranks for ties are determined in a similar fashion.
Based on the W-value, the Mann-Whitney test now determines the p-value of the test using a normal approximation, which is calculated as follows:
where, W = Mann-Whitney test statistics, here: 9743.5 n = The size of sample 1 (Before), here: 100 m = The size of sample 2 (After), here: 80
The resulting ZW value is 1.995, which translate for a both-sided test (+/- ZW) and a normal approximation into a p-value of 0.046.
If there are ties in the data as in this example, the p-value is adjusted by replacing the denominator of the above Z statistics by
where, i = 1, 2, ..., l l = The number of sets of ties ti = The number of tied values in the i-th set of ties
The unadjusted p-value is conservative if ties are present; the adjusted p-value is usually closer to the correct values, but is not always conservative.
In this example, the p-value does not vary dramatically through the adjustment; it is 0.0459. This indicates that the probability that such a ZW value occurs if there are actually no differences between the call times before and after the improvement is only 4.59 percent. With such a small risk of being wrong, a practitioner could conclude that the after results are significantly different.
About the Author: Arne Buthmann is a senior consultant with Valeocon Management Consulting in Europe. He has a wide range of experience in consulting and training multi-national business enterprises such as Novartis, Johnson & Johnson, Merial, Danone, TRW, Siemens and Bosch. Buthmann helps clients to implement Six Sigma, Lean and Design for Six Sigma, and is the co-author of the book Produkt- und Prozessdesign für Six Sigma mit DFSS.
TA的首页

