__FIRST LECTURE AFTER FACTORIAL ANOVA__

*DATA TRANSFORMATIONS*

Up to this time, we have said that if you do
**NOT** have homogeneity of variance **and** you do

__ANOVA__

What we have said is that if you do not have
homogeneity **AND** you do not have equal n's, then you should
do a

If you have 2 means - **Mann-Whitney U**.

If you have more than 2 means - **Kruskal-Wallis
H**.

** HOWEVER**, there

You can try **transforming the data**
by doing something mathematical to each score.

___________________________________________

**Sometimes**, but

Another good thing is that if you find a transformation
that will bring about **homogeneity**, it will probably also
bring about

In fact, usually, if a transformation causes an unmet assumption to be met, it usually will cause us to meet all the other unmet assumptions.

___________________________________________

Remember that normality is also an assumption
of **ANOVA**, **t-tests**, and other **parametric** tests.

If a transformation **can't**
be found to bring about

___________________________________________

You already know how to check for **homogeneity**
of variance.

We do **Levene's** test - if it is significant,
there is * no* homogeneity of variance. Even a significance level
of .10 is cause for concern.

But, how can we check for ** normality**?

There are **statistical tests** and there
are **visual methods** involving **plots**.

There are several statistical tests.

You have probably already seen a couple of
simple ones - **skew** and **kurtosis**.

When a distribution is normal, the values for
**skewness** and **kurtosis** are both zero.

If a distribution is ** positively skewed**
(cases cluster to the left and right tail is extended with only a few cases),
then the statistic for skew will be

If it has ** negative skew**, skewness
will be

If a distribution is too **peaked** (leptokurtic),
the kurtosis value will be **positive**.

If it is too flat, with many cases in the tails
(**platykurtic**), the kurtosis value will be **negative**.

There are tables of critical values for skew
and kurtosis, and if you use these, do it at the **.01** or .**001** for
small to moderate size samples.

If not, a rule of thumb is that if both kurtosis
and skew is ** between +1 and -1**, then we

If you have a really large sample, don't worry
too much - **ANOVA** is robust to violation if the sample is large.

SPSS will give you kurtosis and skew, and it will also give you some better statistics

For normality, you can get the **Kolmogorov-Smirnov**
test and the **Shapiro-Wilks** test.

The **Kilmogorov-Smirnov** test will have
the **Lilliefors** modification with small samples.

**Shapiro-Wilks** is for distributions with fewer than 50 cases.

**It tests the null hypothesis that the population is
normally distributed.**

So, if the test is significant, it means the
variable is probably **not** normally distributed.

___________________________________________

In SPSS we check both homogeneity and normality
using the **EXPLORE** routine followed by a **PLOT**.

Here are some scores:

** Group 1 Group 2 Group 3**

** 3 6 12**

** 0 4 6**

** 4 2 6**

** 2 4 10**

** 2 7 6**

**Mean 2.2 4.6 8.0**

**Var 2.2 4.4 8.0**

This is a **POISSON** distribution we'll
talk more about that later.

Note that in each group, the **mean** is just about exactly
equal to the **variance** - typical in **POISSON** distributions.

We can check this for **normality** and
for **homogeneity** with the **EXPLORE** routine in **SPSS**:

_____ 1. Put in the scores as usual, with all in the same column and with another variable as a dummy variable to identify the group:

_____ 2. Now click on **ANALYZE** - **DESCRIPTIVE
STATISTICS** and **EXPLORE**. The **EXPLORE** box will open. Make it
look like this::

_____ 3. Make sure that under **DISPLAY** in the lower
left of the screen, **BOTH** is marked.

_____ 4. Click the **PLOTS** button at the
bottom of the box. The following **EXPLORE: PLOTS** box will open:

_____ 5. Make the box look like the one pictured
above. Then, click **Continue** and then **OK**. The **EXPLORE** analysis
will run. The output will appear. Here is a copy:

There are several things we can look at here - all having to do with the assumption of normality.

**With regard to normality, look at page** 1.

Both **skewness** and **kurtosis** for
the **SCORES** variable fall **between -1 and +1**.

So we suspect it is from a normal population.

They are both **positive**, suggesting that
it is slightly positively skewed and slightly leptokurtic.

_______________________________________

In the middle of page 1, you will find the statistical test for normality.

Look under **Kilmogorov-Smirov** and you
will see that the normality is .200 and under Shapiro-Wilk it is .392.

These are not** significant**,
so the variable is normal, although less so than ideal..

On **page 2** is the histogram, and you
will see the visual evidence for this - slightly skewed positively and slightly
peaked, but not too far from normal.

______________________________________

Look at ** page 2** of the printout.

There, you will find **Q-Q** Plots for each
group.

These are called **normal probability
plots**, or

The observations are listed from **low to
high** and **plotted against the expected values if the distribution
were normal**.

**Observed values**
are along the **x axis** and the **predicted values** from a normal dist.
are on **the y axis**.

If the distribution is **normal**, the plot should resemble
** a straight line**.

The **detrended normal Q-Q plot on page
3 is a** plot of the amount each score varies from the straight line.

In these, a normal distribution would have
**every score falling on a horizontal line that passes through 0**.

____________________________________________

** Page 3**
has a

The **median** is a dark line.

The **length** of the box indicates the
**variability**, because the **75th percentile** is the ** top**
of the box, and the

Therefore, the **length** of the box shows
**the interquartile range** (the **middle 50%** of the scores).

Lines are drawn from the edge of the box to
the largest and smallest values that are not **outliers** (outliers are cases
between **1.5 and 3 boxlengths** from the edge of the box, and **extremes**
are more than 3 boxlengths away.)

The length of the box is a visual depiction
of one measure of **variability - the interquartile range.**

**The longer the box, the more variability.**

__________________________________________________

This variable does not depart significantly from normality.

If it does, there are a large number of **different**
transformations that are possible to try to get the distribution to be normal.

There is not a whole lot that can be done with normality with so few scores, but there is a possibility of transforming even these few scores to get more homogeneity of variances.

Thee are many different transformations that are possible.

__SQUARE ROOT TRANSFORMATION__

One of the simplest transformations is a **square
root transformation**.

To do this one, you simply **take the
square root of each of your raw data scores**, then do an

This is a transformation that is often also used when a distribution lacks normality or homogeneity.

It is best for bringing about normality when
a distribution departs only **moderately** from

And, use if the skew is positive.

In addition, this ** transformation**
is almost always the best one to use

Such a distribution is called a ** POISSON**
distribution.

*For example, suppose that you know that
a seed manufacturer is selling seed for which 0.1% is assumed will be dead.*

That means that the probability of **any
one seed being dead** is

If you take ** 100 samples of 1000 seeds
each**, you will get:

**37 samples with no dead seeds**

**37 samples with 1 dead seeds**

**18 samples with 2 dead seeds**

** 6 samples with 3 dead seeds**

** 2 samples with 4 dead seeds**

If your raw data is **the number of
dead seeds in each 1000-seed sample**, then that distribution will
probably be a

The above is **not** a normal
distribution by any stretch.

It is a **POISSON** distribution, in which,
by definition, **the variance is equal to the mean** (or a little larger).

** You can often normalize such a
distribution, and make the variances homogeneous, by transforming
each score to square roots**.

Just take the **root** of
each score.

If you have ** any scores less than 10**,
then use the

______________________________________________________

But, so far, we have not tested the **variances** for
**homogeneity**.

That is because in **step #3** above, we
did not identify a **FACTOR** so that **SPSS** could break the scores
into the ** three groups** of

When we were checking for normality, we were evaluating the entire distribution.

Let's go back now and identify a **factor**
in **step #3** above. Repeat the first few steps, then here is the **EXPLORE**
box again:

Note that we moved **group** into the **FACTOR
LIST **field.

_____ 4. Now, click on the **PLOTS** button.
The **PLOTS** box will open:

_____ 5. This time, make the **PLOTS** box look like the
above. Notice that you should check the **NORMALITY PLOTS WITH TESTS** checkbox
and the **UNTRANSFORMED** radio button should be chosen.

_____ 6. Click **CONTINUE** and **OK**.
The output will be produced:

This will give us statistics on **each group separately**,
and it will also give us **Levene's** test for homogeneity of variance.

You will find it on **page 3.**

It is **.101**, which is in the "** nervous**"
category, although not significant.

This is exactly what you would have found had
you run a **one-way ANOVA**.

____________________________________________________

Now, we can consider some different **transformations**.

Let's see what the effect would be on homogeneity of taking the root of each score.

If you have ** any scores less than 10**,
then use the

Since we **do** have scores
less than

Let's use the **root of X + .5.**

How do we do that?

_____ 1. Return to the data and click on **TRANSFORM**
and **COMPUTE**. The following box will appear:

What we have to do is name a new "**TARGET
VARIABLE**" in the upper left of the box and write an equation to produce
that new variable in the **NUMERIC EXPRESSION** field.

Since it will be the **root** of each score
after adding .5, we will call it **rootplh**, or **root plus
half a point**.

_____ 2. Type **rootplh** in the **TARGET
VARIABLE FIELD**.

_____ 3. Scroll down the list of functions
in the **FUNCTIONS** box until you find the one for **square roots**.
It is called **SQRT(numexpr)**. (Right click on any function to get a definition.)
Click on it to highlight it and then click the **UP ARROW** button just above
this list of functions. This will move the function up into the **NUMERIC EXPRESSION**
field. Now the **COMPUTE VARIABLE** box should look like this:

Notice **the question mark** inside the
function you have chosen.

_____ 4. Since we want to take the **square root** of
whatever number is in the **SCORES** variable, but only after **adding
.5 to each score**, we will highlight

_____ 5. Now, type a **plus** sign (or select the symbol
on the pictured keypad) followed by **.5** (or select **.5** on the pictured
keyboard). The box should look like this:

_____ 6. Now, click **OK**. You will be returned to the
data and you will see the new variable:

_____ 7. Now, run a **one-way ANOVA** on this data:

Here are the results:

Note that **Levene's** sig. is now **.917** - much
better than before, when it was **.101**

**__________________________________________**

How do you write it up?

Explain what you did.

You can either report the **untransformed means**, but
the analysis of **transformed means**, or you can **de-transform the means**
to get **weighted means**. Do this by ** squaring the transformed means
and subtracting .5**. You won't get the same means you got from raw scores,
however. These are weighted means and you should refer to them as such.

There are many other transformations. Mertler and Vennatta
(2001) (** Advanced and Multivariate Statistical Methods**) have a
useful chart showing which transformations are often best for distributions
with varying skew.

____________________________________________________

**Moderate positive skew**, as we have already seen,
should first try a **square root transformation** as we did above. The SPSS
compute function is **SQRT( var)**, or, if any values are

**Substantial positive skew** should cause us to try a
**logarithm** transformation. This is usually best if the means
and standard deviations tend to be

**Severe positive skew** should cause us to try an **inverse**,
or

**Moderate negative skew** should cause us to try a transformation
in which we "**reflect**" the variable and then try one of the
transformations above. To "reflect" means find the largest score,
add 1 to it, and the result is a constant that is larger than any score in the
distribution. Now, subtract each score from the constant. (This will make the
distribution be distributed positively, and so one of the above transformations
would then work.) The SPSS function for **reflect followed by square
root transformation** is

For **substantial negative skew**, try **reflect
followed by logarithm**.

**Severe negative skew** should cause us to try **reflect
followed by inverse transformation**. Reflect as usual, then take
the inverse of the new data. SPSS function is

**There are many other transformations that are possible.
Tabachnick and Fidell (1996) have a good section on transformations with lots
of examples.**

**At this point, we should probably review logarithms.**

__LOGARITHMS__

What are **logarithms**?

A **logarithm** is a number that represents an **exponent**.

You know what an exponent is - it tells us how many times to multiply a number times itself.

So **10 ^{3} = 1000**

The abbreviation for a logarithm is **LOG**.

The ** common logs** are

What does that mean?

The **base 10 log of 1000** is the ** number of times
10 must be multiplied by itself to equal 1000** - so the

Why?

Because **10 times 10 times 10**, or **10 ^{3}
= 1000**.

We can write it like this:

**log _{10 }1000 = 3** or, sometimes

**Remember, **logs can be to any **base**.

The **base 5 log of 25** is the ** number of times
5 must be multiplied by itself to equal 25** - so the

**log _{5 }25 = 2** or, sometimes

**2 is the logarithm of 25 to base 5. **

**What is the log of 81 to base 3? 4**

**What is the log of 125 to base 5? 3**

**What is the log of 64 to base 8? 2**

**What is the log of 8 to base 2? 3**

When doing transformations, you can use any base, but **base
10** is probably the most convenient.

___________________________________________________

*HANDOUT - transformations.first.lec.after.factorial.homework1.doc*

*************************************************************************

__NONPARAMETRIC TESTS WHEN THERE IS NO HOMOGENEITY OF
VARIANCE__

Often, you will have no alternative but to give up on a ** parametric**
test such as ANOVA.

If there is no homogeneity and you can't achieve it through
**data transformation**, that is about your only choice.

You will have to find a ** nonparametric** test
that will work for your data.

Some people call these tests "** assumption-free**,"
although that is really

Unfortunately, these tests are, in general, not as powerful as are parametric tests.

Many nonparametric tests (but not all) work by **RANKING
THE DATA**.

What we often do is give the **lowest score a rank of one**,
etc.

As you can appreciate, **a low score will have a low
rank**,

Then, we do our analysis on the **RANKS**, not on the
**raw data**.

You can probably see why **nonparametric tests**
have

However, it gets us around the controversy about not being
able to use parametric tests with data that is __not__**true
interval or ratio data**.

No one argues that **nonparametric tests** are inappropriate
for ordinal data.

______________________________________

The **Mann-Whitney U** is the nonparametric test we often
use if we want to run a t-test but cannot because we lack normality, homogeneity,
or because the data is not interval or ratio data.

This test is also sometimes called the **Wilcoxon W Test**,
and even sometimes the **Mann-Whitney-Wilcoxon**.

Let's use it on the **first two groups** we looked at
in your homework: (transformations.first.lec.after.factorial.sav)

__Exp. Control __

**1.00 1.00**

**4.00 2.00**

**7.00 3.00**

**10.00 4.00**

**13.00 5.00**

**16.00 6.00**

**19.00 7.00**

**22.00 8.00**

**25.00 9.00**

**28.00**

We will first run a **t-test** to investigate homogeneity
of variance and so you can compare the Mann-Whitney results to independent t-test
results.

Here is the **t-test**:

Note the **means**. It appears that the **experimental**
group outperformed the **control** group means are 14.5 to 5.00. But note
LEVENE's Test. There is **no homogeneity** and

Now the **Mann-Whitney U**. You will find it by clicking
**ANALYZE**, then **NONPARAMETRIC TESTS**, then **2 INDPENDENT SAMPLES**.

When the next box opens, make it look like this:

Be sure that **Mann-Whitney U** is checked.

Click the **OPTIONS** button and choose **DESCRIPTIVE
STATISTICS**.

Click OK to run the analysis.

Here is the output:

Look at the **RANKS** box. Remember that **Mann-Whitney**
**ranks data from low to high**. So, the group with the

The **Test Statistics** box has the result**. Mann-Whitney
U** and **Wilcoxon W** yield different statistics, but the same results.
Both appear here. The **asymptotic significance** assumes large samples and
corrects for ties in ranks. The **exact significance** does **not**
correct for ties, but is for small groups. They will usually be very close,
as they are here. If they aren't the same, use

So, the results here are the same as with the **t-test**
- the **experimental** group has **higher** scores. (Null relates to **median**.)
But, note that the **t-test sig**. was **.008** while the

**BUT**, if critics are right that

Homework: copy these scores and do both a t-test and a Mann-Whitney. Draw conclusions.

**experimental:
28, 35, 35, 24, 39, 32, 27, 29, 36, 35**

** control: 05, 24, 06, 14, 09, 07, 17,
06, 03, 10**

__END - Kruskal-Wallis
Next Time__