Today we’ll look at one of the first functions I developed, prior to desc_table/kable I mentioned here. The function is called par.2aov
.
Aim
First, let’s see what the function is for: in a data analysis, when you have to compare various groups divided by 2 categorical variables, you have to choose whether to use the parametric test (2-way ANOVA) or the non-parametric test (Kruskal-Wallis). To do that, two characteristics of the sample must be analyzed:
To test the first condition, a normality test (eg Shapiro) must be applied, while the second is verified with a homoskedasticity test (eg Bartlett). Both functions are implemented in r, so why did I have to write a function?
There are two reasons: the first is to have a better view of the results and to have them in a single list (you will see well later in the output section); the second, the most important, is that this function allows to evaluate at the same time also the possible effect of the interaction between the two categorical variables, dividing the sample into n*m subgroups and evaluating distribution and homoskedasticity.
Function command and inputs
The function is launched with the command mrt.par.2aov(x, y, z, type.of.int = "+/*")
.
As you can see, there are 3 inputs: x is the numerical vector of the observed values, y and z are the factorial vectors of the categorical variables that distinguish the various groups. Type.of.int is the input that determines whether to evaluate also the interaction between the two categorical variables: +
(default) evaluates only the two distinct categorical variables, while *
also evaluates their interaction.
Main steps
There is not much to say about the steps of the function, as they are few, simple and straightforward. The tests for normality (Shapiro) and homoskedasticity (Bartlett) are applied and two distinct data frames are created with the results.
If type.of.int is *
, an additional data frame is created with the results of the Shapiro test applied to the various groups, as the basic Shapiro test applied with tapply to a dataframe takes as input a single categorical variable.
Output
This is the most “interesting” section, which is the one for which the function was created. Let’s see an example immediately to understand its usefulness:
|
|
$`Shapiro categorie`
categoric fattore W pval
1 mpg$drv 4 0.9140492 5.192071e-06
2 mpg$drv f 0.9062250 1.555910e-06
3 mpg$drv r 0.9116438 3.317112e-02
4 mpg$year 1999 0.9095696 8.223978e-07
5 mpg$year 2008 0.9704048 1.090310e-02
$`Shapiro interazione`
mpg$drv mpg$year W pval
1 4 1 0.8027046 1.230175e-06
2 f 1 0.8036883 2.868337e-07
3 r 1 0.8487732 4.113255e-02
4 4 2 0.9308291 3.937911e-03
5 f 2 0.9593676 8.922286e-02
6 r 2 0.9447511 4.825434e-01
$Bartlett
categoric k-squared df pval
1 mpg$drv 0.7067503 2 0.7023137
2 mpg$year 0.1818015 1 0.6698296
3 interaction 3.7215846 5 0.5901552
As a result, a list with three datasets is provided: the first reports the values of the Shapiro test performed for each category of categorical variables, the second (optional depending on the type of interaction) provides the result of the Shapiro test for each combination of the two categorical variables, and the third provides the results of the Bartlett test both for the categories and for their interaction. In this case, given the p-values of the Shapiro test, I will opt to perform a Kruskal-Wallis test.
Possible improvements
Although the function was written some time ago, at the moment I don’t see any chunk where it can be improved (I have to admit that, while I was writing this post, I had to modify it because there was an error).