Data Visualization

Learning Objectives

Following this assignment students should be able to:

understand the basic plot function of ggplot2

import ‘messy’ data with missing values and extra lines

execute and visualize a regression analysis

Reading

Topics
- ggplot
Readings
- R for Data Science - Data visualisation
Additional information
- ggplot2 documentation

Lecture Notes

ggplot

Exercises

Acacia and Ants (15 pts)

An experiment in Kenya has been exploring the influence of large herbivores on plants.

Download the data on Acacia for the experiment into a data subdirectory and read it into R using the following command:
```
acacia <- read.csv("data/ACACIA_DREPANOLOBIUM_SURVEY.txt", sep="\t", na.strings = c("dead"))
```
1. Make a scatter plot with CIRC on the x axis and AXIS1 (the maximum canopy width) on the y axis. Label the x axis “Circumference” and the y axis “Canopy Diameter”.
2. The same plot as (1), but with points colored based on the ANT column (the species of ant symbiont living with the acacia)
3. The same plot as (2), but instead of different colors show different species of ant (values of ANT) each in a separate subplot.
4. The same plot as (3) but including a simple model by adding geom_smooth.
[click here for output] [click here for output] [click here for output] [click here for output]
Mass vs Metabolism (15 pts)

The relationship between the body size of an organism and its metabolic rate is one of the most well studied and still most controversial areas of organismal physiology. We want to graph this relationship in the Artiodactyla using a subset of data from a large compilation of body size data (Savage et al. 2004). You can copy and paste this data frame into your program:
```
size_mr_data <- data.frame(
  body_mass = c(32000, 37800, 347000, 4200, 196500, 100000,
    4290, 32000, 65000, 69125, 9600, 133300, 150000, 407000,
    115000, 67000,325000, 21500, 58588, 65320, 85000, 135000,
    20500, 1613, 1618),
  metabolic_rate = c(49.984, 51.981, 306.770, 10.075, 230.073, 
    148.949, 11.966, 46.414, 123.287, 106.663, 20.619, 180.150, 
    200.830, 224.779, 148.940, 112.430, 286.847, 46.347,
    142.863, 106.670, 119.660, 104.150, 33.165, 4.900, 4.865),
  family = c("Antilocapridae", "Antilocapridae", "Bovidae",
    "Bovidae", "Bovidae", "Bovidae", "Bovidae", "Bovidae",
    "Bovidae", "Bovidae", "Bovidae", "Bovidae", "Bovidae",
    "Camelidae", "Camelidae", "Canidae", "Cervidae",
    "Cervidae", "Cervidae", "Cervidae", "Cervidae", "Suidae",
    "Tayassuidae", "Tragulidae", "Tragulidae"))
```
Make the following plots with appropriate axis labels:
1. A plot of body mass vs. metabolic rate
2. A plot of body mass vs. metabolic rate, with logarithmically scaled axes (this stretches the axis, but keeps the numbers on the original scale), and the point size set to 3.
3. The same plot as (2), but with the different families indicated using color.
4. The same plot as (2), but with the different families each in their own subplot.
[click here for output] [click here for output] [click here for output] [click here for output]
Adult vs Newborn Size (20 pts)

Larger organisms have larger offspring. We want to explore the form of this relationship in mammals.

Download some mammal life history data from the web. You can do this either directly in the program using read.csv() or download the file to your computer using your browser, save it in the data subdirectory, and import it from there. It is tab delimited so you’ll want to use sep = "\t" as an optional argument when calling read.csv(). The \t is how we indicate a tab character to R (and most other programming languages).

When you import the data there are some extra blank lines at the end of this file. Get rid of them by using the optional read.csv() argument nrows = 1440 to import only the first 1440 rows.

Missing data in this file is specified by -999 and -999.00. Tell R that these are null values using the optional read.csv() argument, na.strings = c("-999", "-999.00"). This will stop them from being plotted.
1. Graph adult mass vs. newborn mass. Label the axes with clearer labels than the column names.
2. It looks like there’s a regular pattern here, but it’s definitely not linear. Let’s see if log-transformation straightens it out. Graph adult mass vs. newborn mass, with both axes scaled logarithmically. Label the axes.
3. This looks like a pretty regular pattern, so you wonder if it varies among different groups. Graph adult mass vs. newborn mass, with both axes scaled logarithmically, and the data points colored by order. Label the axes.
4. Coloring the points was useful, but there are a lot of points and it’s kind of hard to see what’s going on with all of the orders. Use facet_wrap to create a subplot for each order.
5. Now let’s visualize the relationships between the variables using a simple linear model. Create a new graph like your faceted plot, but using geom_smooth to fit a linear model to each order. You can do this using the optional argument method = "lm" in geom_smooth.
[click here for output] [click here for output] [click here for output] [click here for output] [click here for output]
Acacia and Ants Histograms (20 pts)

An experiment in Kenya has been exploring the influence of large herbivores on plants.

Download the data on Acacia for the experiment into a data subdirectory for you project read it into R using the following command:
```
acacia <- read.csv("data/ACACIA_DREPANOLOBIUM_SURVEY.txt", sep="\t", na.strings = c("dead"))
```
1. Make a bar plot of the number of acacia with different ant mutualists.
2. Make a histogram of the height of acacia (using the HEIGHT column). Label the x axis “Height (m)” and the y axis “Number of Acacia”.
3. Make a plot that shows histograms of both AXIS1 and AXIS2. Due to the way the data is structured you’ll need to add a 2nd geom_histogram() layer that specifies a new aesthetic. To make it possible to see both sets of bars you’ll need to make them transparent with the optional argument alpha = 0.3. Set the color for AXIS1 to “red” and AXIS2 to “black” using the fill argument. Label the x axis “Canopy Diameter(m)” and the y axis “Number of Acacia”.
4. Use facet_wrap() to make the same plot as (3) but with one subplot for each treatment. Set the number of bins in the histogram to 10.
[click here for output] [click here for output] [click here for output] [click here for output]
Acacia and Ants Data Manipulation (30 pts)

An experiment in Kenya has been exploring the influence of large herbivores on plants.

Download the data on Trees for the experiment into a data subdirectory. There are a number of problematic entries in this data so use the readr package to import it:
```
library(readr)
trees <- read_tsv("data/TREE_SURVEYS.txt")
```
1. Add a new column to the trees data frame named canopy_area that contains the estimated canopy area calculated as the value in the AXIS_1 column times the value in the AXIS_2 column. Print out the SURVEY, YEAR, SITE, and canopy_area columns from data frame.
2. Make a scatter plot with canopy_area on the x axis and HEIGHT on the y axis. Color the points by TREATMENT and plot the points for each value in the SPECIES column in a separate subplot. Label the x axis “Canopy Area (m)” and the y axis “Height (m)”. Make the point size 2.
3. That’s a big outlier in the plot from (2). 50 by 50 meters is a little too big for a real Acacia, so filter the data to remove any values for AXIS_1 and AXIS_2 that are over 20 and update the data frame. Then remake the graph.
4. Find out how the abundance of each species has been changing through time. Use group_by, summarize, and n to make a data frame with YEAR, SPECIES, and an abundance column that has the number of individuals in each species in each year. Print out this data frame.
5. Make a line plot with points (by using geom_line in addition to geom_point) with YEAR on the x axis and abundance on the y axis with one subplot per species. To let you seen each trend clearly let the scale for the y axis vary among plots by adding scales = "free_y" as an optional argument to facet_wrap.
[click here for output] [click here for output] [click here for output] [click here for output] [click here for output]

Assignment submission & checklist

Data Science Skills in R

Assignment

Learning Objectives

Reading

Lecture Notes

Exercises

Acacia and Ants (15 pts)

Mass vs Metabolism (15 pts)

Adult vs Newborn Size (20 pts)

Acacia and Ants Histograms (20 pts)

Acacia and Ants Data Manipulation (30 pts)