Use a list to carry out analyses with only some variables from a data set (in R)
Overview
Many functions in R take a whole data frame as an object. However, for many analyses only a subset of variables from a data frame are needed. This post explains how to select only some variables in a data frame to carry out a desired analysis.
Background: Datasets include more variables than those needed for the analysis
Datasets in the social sciences typically have one row per ‘case’ – which, depending on your field, may be a person, family, organization, school, constituency, or country – and several columns that include information about the ‘case’.
One thing that data analysts almost always need to do when analysing their data, is to indicate which variables (or columns) should be used in the respective analysis. Consider the simple example of a linear regression with one independent variable (predictor variable) and one dependent variable (criterion variable):
lm(dependent_variable ~ independent_variable, mydata)
Running this model is straightforward because the dependent variable, the independent variable, and the dataset (mydata) have to be specified.
So far, so good. Things get slightly more complicated with other types of analysis. For example, if you wish to compute Cronbach’s alphas (a measure for internal consistency of reliability) for your scales or run an exploratory factor analysis, relevant functions (e.g., the functions alpha and fa from the package psych) require a data matrix or a data frame as input (i.e., an object consisting of several rows and columns).
The code for estimating Cronbach’s alpha with the package psych (note that the whole data frame ‘mydata’ is submitted to the function):
library(psych)
alpha(mydata)
The code for conducting a factor analysis with the package psych (note that the whole data frame ‘mydata’ is submitted to the function):
library(psych)
fa(mydata)
The problem is that when you directly submit your data frame as it is, then all variables are submitted to the analysis rather than just the relevant ones. To avoid nonsensical factor analyses or reliability analyses that include sociodemographic variables and all kinds of meta data (e.g., how long it took participants to complete a survey), it is crucial to restrict these analyses to the variables of interest.
How to select variables within a data frame
In my opinion, the clearest and most readable way to write code that analyses only a subset of variables consists of three steps:
- Create a list with all the variables that should be included in the analysis
- Save this list as an R object
- Tell R to do the analysis only with the variables included in the list
In the following example, the function combine c() in line 1 defines a list with six items. This list is then saved as an object that is called ‘items’ (you could also call it ‘variables’ or anything else that makes sense).
But the real magic happens in line 2. When brackets follow after a data frame, the information between the opening bracket and the comma indicates to which cases (or more generally: rows) an operation should be restricted to. The information between the comma and the closing bracket indicates to which variables (or, more generally: columns) an operation should be restricted. Since there is no information before the comma, all cases are used. However, the presence of ‘items’ after the comma indicates that only the variables included in this list should be used in the analysis.
items <- c("item_1","item_2","item_3","item_4","item_5","item_6")
library(psych)
alpha(mydata[,items])
If need be, these functions can easily be extended so they do some additional meaningful things. For example, this extended version also tests whether some items may be reverse-coded and returns a warning if there are suspicious items:
alpha(mydata[,items], check.keys=TRUE)
Re-using the item list in other functions
This three-step approach is especially useful if several analyses or operations are carried out. For example, it could make sense to also run an exploratory factor analysis to check the relationships between the items:
fa(mydata[,items],fm = "ml", nfactors = 2)
Or if the statistical mean of these items should be computed:
mydata$seriousness_scale <- (rowMeans(mydata[,items], na.rm = TRUE))