This page looks best with JavaScript enabled

par.2aov function

 ·  🕘 4 min read  ·  🤖 Matteo Miotto

Today we’ll look at one of the first functions I developed, prior to desc_table/kable I mentioned here. The function is called par.2aov.

Aim

First, let’s see what the function is for: in a data analysis, when you have to compare various groups divided by 2 categorical variables, you have to choose whether to use the parametric test (2-way ANOVA) or the non-parametric test (Kruskal-Wallis). To do that, two characteristics of the sample must be analyzed:
  • All groups into which the sample is divided by categorical variables must have a normal distribution
  • The variances between the different groups must be homogeneous

  • To test the first condition, a normality test (eg Shapiro) must be applied, while the second is verified with a homoskedasticity test (eg Bartlett). Both functions are implemented in r, so why did I have to write a function? There are two reasons: the first is to have a better view of the results and to have them in a single list (you will see well later in the output section); the second, the most important, is that this function allows to evaluate at the same time also the possible effect of the interaction between the two categorical variables, dividing the sample into n*m subgroups and evaluating distribution and homoskedasticity.

    Function command and inputs

    The function is launched with the command mrt.par.2aov(x, y, z, type.of.int = "+/*"). As you can see, there are 3 inputs: x is the numerical vector of the observed values, y and z are the factorial vectors of the categorical variables that distinguish the various groups. Type.of.int is the input that determines whether to evaluate also the interaction between the two categorical variables: + (default) evaluates only the two distinct categorical variables, while * also evaluates their interaction.

    Main steps

    There is not much to say about the steps of the function, as they are few, simple and straightforward. The tests for normality (Shapiro) and homoskedasticity (Bartlett) are applied and two distinct data frames are created with the results. If type.of.int is *, an additional data frame is created with the results of the Shapiro test applied to the various groups, as the basic Shapiro test applied with tapply to a dataframe takes as input a single categorical variable.

    Output

    This is the most “interesting” section, which is the one for which the function was created. Let’s see an example immediately to understand its usefulness:

    Example 1 Evaluate which test should be used to evaluate if there is a difference between the hwy values of the machines in the mpg dataset based on the year, the type of drive train (drv) and their interaction.
    1
    
    par.2aov(mpg$hwy, mpg$drv, mpg$year, type.of.int = "*")
    
    
    $`Shapiro categorie`
      categoric fattore         W         pval
    1   mpg$drv       4 0.9140492 5.192071e-06
    2   mpg$drv       f 0.9062250 1.555910e-06
    3   mpg$drv       r 0.9116438 3.317112e-02
    4  mpg$year    1999 0.9095696 8.223978e-07
    5  mpg$year    2008 0.9704048 1.090310e-02
    
    $`Shapiro interazione`
      mpg$drv mpg$year         W         pval
    1       4        1 0.8027046 1.230175e-06
    2       f        1 0.8036883 2.868337e-07
    3       r        1 0.8487732 4.113255e-02
    4       4        2 0.9308291 3.937911e-03
    5       f        2 0.9593676 8.922286e-02
    6       r        2 0.9447511 4.825434e-01
    
    $Bartlett
        categoric k-squared df      pval
    1     mpg$drv 0.7067503  2 0.7023137
    2    mpg$year 0.1818015  1 0.6698296
    3 interaction 3.7215846  5 0.5901552
    
    

    As a result, a list with three datasets is provided: the first reports the values of the Shapiro test performed for each category of categorical variables, the second (optional depending on the type of interaction) provides the result of the Shapiro test for each combination of the two categorical variables, and the third provides the results of the Bartlett test both for the categories and for their interaction. In this case, given the p-values of the Shapiro test, I will opt to perform a Kruskal-Wallis test.

    Possible improvements

    Although the function was written some time ago, at the moment I don’t see any chunk where it can be improved (I have to admit that, while I was writing this post, I had to modify it because there was an error).

    Share on
    Support the author with

    Matteo Miotto
    WRITTEN BY
    Matteo Miotto
    Genomic Data Science master student