Day1

Basic data analysis | Lesson 1

Learning Objectives

  1. How to import datasets into R.
  2. Conduct descriptive statistics on the dataset to explore the data.
  3. Basic data visualization with histograms, boxplots, and scatterplots.
  4. How to calculate the correlation between 2 (numerical) variables.
  5. How to make a simple linear regression, and plot the line in the scatterplot.
  6. Conduct a t-test for basic hypothesis testing.
  7. Recognize the differences between base R plotting and using the ggplots2 package.

The dataset

The scientific experiment | Imagine that you are interested in determining the effects of a high-fat diet on gene expression. For this study, the scientists obtained data from 60 mice, where half were fed a lean-diet, and the other half a high-fat diet. All other living conditions were the same. Four weeks after, a biopsy of the mice’s liver was sequenced by RNA-seq, and all mice were weighted, and the sex and age were also recorded. The results from this analysis are saved in diet_mice_metadata.txt file, and the gene counts are in the file diet_mice_counts.xlsx.

Think about the experimental design

  • What is the research question? What is the hypothesis?
  • How many variables are in the study?
  • Which variable(s) are dependent? (Dependent or Response variables are the variables that we are interested in predicting or explaining.)
  • Which variable(s) are independent? (Independent or Explanatory variables are used to explain or predict the dependent variable.)
  • Which variable(s) are covariates? (Covariates are variables that are potentially related to the outcome of interest in a study, but are not the main variable under study - used to control for potential confounding factors in a study.)
  • Are the “controls” appropriate? Why?

Hands-on exercises

We will start by looking at the metadata file containing the variables related to each sample (i.e. each mouse): type of diet, final weight, gender, and age in months.

A. Create a new project in RStudio

Start by creating a new project in RStudio. Go to File > New project, and follow the instructions.
Once you have are in the project folder, create a new R script file. Go to File > New File > R Script. A blank text file will appear above the console. Save it in your project folder with the name diet_analysis.R.

B. Load the data and inspect it

  1. Download the file diet_mice_metadata.txt (mice weights according to diet) from GitHub https://github.com/patterninstitute/rmind-workshop/blob/main/data/diet_mice_metadata.txt.
  2. Save the file in your current working directory where the RProject was created inside a folder named data.
  3. Type the instructions inside grey boxes in pane number 2 of RStudio — the R Console. As you already know, the words after a # sign are comments not interpreted by R, so you do not need to copy them.
    • In the R console, you must hit enter after each command to obtain the result.
    • In the script file (R file), you must run the command by pressing the run button (on the top panel), or by selecting the code you want to run and pressing ctrl + enter.
  4. Save all your relevant/final commands (R instructions) to your script file to be available for later use.
# Load required packages
library(tidyverse)     # to ease data wrangling and visualization
library(here)          # to help with file paths 
library(RColorBrewer)  # color palettes
library(patchwork)     # combine plots in panels for figures

# Load the file and save it to object mice_data
mice_data <- read.table(file=here("data/diet_mice_metadata.txt"), 
                        header = TRUE,
                        sep = "\t", dec = ".",
                        stringsAsFactors = TRUE)

C. Answer the following questions using R

1. Briefly explore the dataset.

We should use descriptive statistics that summarize the sample data. We will use measures of central tendency — Mean, Median, and Mode —, and measures of dispersion (or variability) — Standard Deviation, Variance, Maximum, and Minimum.

2. How is the variable “mouse weight” distributed?

After summarizing the data, we should find appropriate plots to look at it. A first approach is to look at the frequency of the mouse weight values using a histogram.

3. How is the variable “mouse weight” distributed in each diet?

Since our data of interest is one categorical variable (type of diet), and one continuous variable (weight), a boxplot is one of the most informative.

4. How are the other variables distributed?

There are other variables in our data for each mouse that could influence the results, namely gender (categorical variable) and age (discrete variable). We should also look at these data.

5. What is the frequency of each variable?

  • 5.1 How many measurements do we have for each gender?
  • 5.2 How many measurements do we have for each diet?
  • 5.3 How many measurements do we have for each gender in each diet?
  • 5.4 What if we want to know the results for each of the three variables: age, diet, and gender?

6. Is there a dependency between the age and the weight of the mice in our study?

7. Is the correlation between the age and the weight of the mice different for males and females?

8. Is the correlation between the age and the weight of the mice different for different diets?

9. Does the type of diet influence the body weight of mice?

Can we answer this question just by looking at the plot? Are these observations compatible with a scenario where the type of diet does not influence body weight?


10. Now that we have calculated the T-test, shall we accept or reject the null hypothesis? What are the outputs in R from the t-test?

Final discussion

Take some time to discuss the results with the other participants, and decide if H0 should be rejected or not, and how confident you are that your decision is reasonable. Can you propose solutions to improve your confidence on the results? Is the experimental design appropriate for the research question being asked? Is this experiment well controlled and balanced?