View on GitHub

Introduction to R

a whirlwind crash course on awesome things in R

Introduction to R: Seattle pet names

Description: New to R statistical programming? Join this introductory R workshop in a whirlwind tour of R for data manipulation and visualization! We’ll apply widely used tools from the Tidyverse collection of packages in the RStudio interface to jumpstart your data science work using R. This material is intended to be covered by a two-hour instructor-led class.

Introduction

This course introduces R coding from a data science perspective, and is designed around the step-by-step process of analyzing a dataset.

By the end of this class, you should be able to:

This workshop is not intended to teach you everything you need to know to analyze your own data in R, but instead, should give you an idea of how R coding works and the basic ways that R coders go about developing a project.

Getting started

R is a statistical programming language, while RStudio is an integrated development environment (IDE) that allows you to code in R more easily.

The left side of RStudio is the console, where you can run R code. The text printed in this panel is basic information about R and the version you’re running. You can test how the console can be used to run code by entering 3 + 4 and then pressing enter. This instructs your computer to read, interpret, and execute the command, then print the result (7) to the Console, and show a right facing arrow (>), indicating it is ready to accept additional code.

You can run code by typing into the Console, then hitting enter:

4 + 5 
## [1] 9

The panel on the lower right shows the files present in your working directory. Currently, that’s probably your Home directory, which includes folders like Documents and Downloads.

Before we continue coding, we need to do a few more things to make it easier for us to work in R. First, we’ll create a new project to use for this workshop:

After your RStudio screen reloads, note two things:

Now we’re ready to create a new R script:

This R script is a text file that we’ll use to save code we write in this workshop. We’ll refer to this window as the script or source window. Remember to save this file periodically to retain the record of the work you’re doing, so you can re-execute the code later if necessary.

By convention, a script should include a title at the top. Then we’ll load the packages we’ll be using.

# Introduction to R

# load packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.6
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Loading packages is like opening an application: it makes a previously installed software available and ready to use.

A line of code starting with # is a code comment: this means it’s readable by and useful to humans, but R will ignore anything that comes after #. We’ve included a comment as a title, as well as one describing what the code library(tidyverse) means.

library is a function, which tell the computer how to perform particular tasks for you. In this case, library has instructed R to ensure functions from packages in Tidyverse will be readily accessible.

We’ve added this code to our script. If we wanted to execute (tell R to run) this command, we have options:

  1. Copy and paste the code into the Console
  2. Use the Run button at the top of the script window
  3. Use the keyboard shortcut: Ctrl + Enter

The third option is the most efficient, especially as your coding skills progress. With your cursor on the line with the code, hold down the Control key and press Enter. You’ll see the code and answer both appear in the Console. A few things to note about this keyboard shortcut:

In practice, a script should represent code you are developing in R, and you should only save the code that you know actually works. This makes it easier to re-create your analysis later. For this class, we’ll be including notes about things we learn as comments.

Importing data

Now that our project and software are available and loaded, we can obtain the data we’ll be using in this workshop.

In the code below:

You can think of this code in its entirety as referencing “the spreadsheet goes into pets.”

# import data and assign to object
pets <- read_csv("https://data.seattle.gov/api/views/jguv-t9rb/rows.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   `License Issue Date` = col_character(),
##   `License Number` = col_character(),
##   `Animal's Name` = col_character(),
##   Species = col_character(),
##   `Primary Breed` = col_character(),
##   `Secondary Breed` = col_character(),
##   `ZIP Code` = col_double()
## )

After executing the code above (by holding the Control key and pressing Enter), you will see:

Now that the object has been assigned, we can inspect the object and learn a bit more about what the data represents.

If you click on pets in the environment pane, a new tab will appear next to your R script in the Source pane so you can preview the data: - These data include pet names obtained from from City of Seattle pet licenses.

In this case, we are not retaining a copy of the data on our computer (though we have loaded the data into our working memory). Additionally, we are importing data from a csv, or comma separated value, file. You may need to make different choices in telling R how to import your data; some additional options commonly encountered are available here.

Exploring the data

Let’s assume we’re interested in answering the following questions using these data:

Let’s get started by learning a few more useful ways to look at our data, which will be useful as we figure out how to ask and answer our questions.

First, if we want to take a peek at our data again:

glimpse(pets)
## Rows: 46,388
## Columns: 7
## $ `License Issue Date` <chr> "December 18 2015", "June 14 2016", "June 16 2016…
## $ `License Number`     <chr> "S107948", "S116503", "S116742", "S119301", "2085…
## $ `Animal's Name`      <chr> "Zen", "Misty", "Frankie Lee Jones", "Lyra", "Puc…
## $ Species              <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", …
## $ `Primary Breed`      <chr> "Domestic Longhair", "Siberian", "Domestic Shorth…
## $ `Secondary Breed`    <chr> "Mix", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `ZIP Code`           <dbl> 98117, 98117, NA, 98121, 98107, 98146, 98108, 981…

A few observations about these data: - most of the columns are straightforward to understand - we may need to convert the type of data (e.g., chr or dbl) depending on our questions - column names with spaces in them are encased in back ticks - NA indicates missing data

Next, let’s learn to use some functions to manipulate the data a bit more. We can count the number of each species, which will also tell us how many species there are:

# count number of each species 
count(pets, Species)
## # A tibble: 4 x 2
##   Species     n
##   <chr>   <int>
## 1 Cat     14542
## 2 Dog     31808
## 3 Goat       35
## 4 Pig         3

This function accepts two arguments: the name of the dataset (pets) and the column we would like to assess (Species). It returns a table with two columns, one indicating the name of each species, and the other (n) representing the number of rows in the original dataset for each species. Since there are relatively few examples of goats and pigs in the dataset, we may need to remove those later.

Next, let’s take a look at our zip codes. Because these were interpreted by R as numerical data, we can use a regular function from math to assess their range:

range(pets$`ZIP Code`)
## [1] NA NA

The $ is a special character that allows us to access a particular column in the data frame.

Since there are missing data in the dataset, we get a very unsatisfying answer. We can add an additional argument to ignore the missing data:

range(pets$`ZIP Code`, na.rm = TRUE)
## [1]   981 99810

Oops, looks like there’s at least one piece of data that has been entered incorrectly (an incomplete zip code). We’ll keep that in mind for some of our later explorations.

Now that we have a better idea of some of the nuances of the data, let’s start to address some questions.

Can I find my dog in this dataset?

I adopted my dog Loki last September, which the website from which I pulled the data source indicated is included. I can look for any animal with the name Loki using another function from tidyverse:

filter(pets, `Animal's Name` == "Loki")
## # A tibble: 117 x 7
##    `License Issue Da… `License Number` `Animal's Name` Species `Primary Breed`  
##    <chr>              <chr>            <chr>           <chr>   <chr>            
##  1 April 10 2019      142748           Loki            Cat     Domestic Medium …
##  2 April 23 2019      8007463          Loki            Cat     Mix              
##  3 May 11 2019        8007924          Loki            Cat     Domestic Longhair
##  4 June 11 2019       349964           Loki            Cat     Pixie-Bob        
##  5 June 28 2019       S150592          Loki            Cat     Domestic Shortha…
##  6 July 20 2019       434474           Loki            Cat     Pixie-Bob        
##  7 August 04 2019     8010346          Loki            Cat     Domestic Shortha…
##  8 August 05 2019     S133431          Loki            Cat     Domestic Shortha…
##  9 August 07 2019     577887           Loki            Cat     Domestic Shortha…
## 10 August 21 2019     S136205          Loki            Cat     Domestic Shortha…
## # … with 107 more rows, and 2 more variables: Secondary Breed <chr>,
## #   ZIP Code <dbl>

The syntax, or way of representing R code with words and symbols, is similar here to other functions we’ve run. We are asking R to show us any rows for which the pet name (from the Animal’s Name column) is exactly (==) Loki. We have to include the name in quotation marks so R doesn’t interpret it as an object or column name.

It looks like there are a lot of pets named Loki! We could assign that output to a new object, then apply another filter to the new object. Instead, we’re going to learn a tool that will help us build more complex sets of data manipulation steps.

The following code results in the same output as the last code we ran:

pets %>%
  filter(`Animal's Name` == "Loki")
## # A tibble: 117 x 7
##    `License Issue Da… `License Number` `Animal's Name` Species `Primary Breed`  
##    <chr>              <chr>            <chr>           <chr>   <chr>            
##  1 April 10 2019      142748           Loki            Cat     Domestic Medium …
##  2 April 23 2019      8007463          Loki            Cat     Mix              
##  3 May 11 2019        8007924          Loki            Cat     Domestic Longhair
##  4 June 11 2019       349964           Loki            Cat     Pixie-Bob        
##  5 June 28 2019       S150592          Loki            Cat     Domestic Shortha…
##  6 July 20 2019       434474           Loki            Cat     Pixie-Bob        
##  7 August 04 2019     8010346          Loki            Cat     Domestic Shortha…
##  8 August 05 2019     S133431          Loki            Cat     Domestic Shortha…
##  9 August 07 2019     577887           Loki            Cat     Domestic Shortha…
## 10 August 21 2019     S136205          Loki            Cat     Domestic Shortha…
## # … with 107 more rows, and 2 more variables: Secondary Breed <chr>,
## #   ZIP Code <dbl>

The difference between these two commands is the use of the pipe, which in R is represented as %>%. Pipes are found in many programming languages. They function by sends the output from the lefthand side of the symbol (in our case, the data from pets) as the input for the righthand side (here, filtering by animal name). A few notes about this code: - We don’t include the name of the data in the filter command because the data is coming through the pipe - The indentation in the second line is important for keeping track of code chunks that belong together. - When executing multiple lines of code in RStudio, your cursor can be placed on any line and the entire chunk will run.

Let’s add some more filters to our previous command:

pets %>%
  filter(`Animal's Name` == "Loki") %>%
  filter(Species == "Dog") %>%
  filter(`ZIP Code` == 98109)
## # A tibble: 2 x 7
##   `License Issue Da… `License Number` `Animal's Name` Species `Primary Breed`   
##   <chr>              <chr>            <chr>           <chr>   <chr>             
## 1 February 17 2020   S111120          Loki            Dog     Chihuahua, Short …
## 2 September 16 2020  S154941          Loki            Dog     Miniature Pinscher
## # … with 2 more variables: Secondary Breed <chr>, ZIP Code <dbl>

Wow, it looks like there are two minuature pinschers in Seattle named Loki!

Where do goats live?

Let’s continue to build on our knowledge of pipes and Tidyverse functions by addressing our next question. We’ll start with a filter that is similar to what we used in the previous section:

pets %>%
  filter(Species == "Goat")
## # A tibble: 35 x 7
##    `License Issue Da… `License Number` `Animal's Name`   Species `Primary Breed`
##    <chr>              <chr>            <chr>             <chr>   <chr>          
##  1 July 12 2019       S101598          Finn              Goat    Miniature      
##  2 July 12 2019       S101599          Mollie            Goat    Miniature      
##  3 July 31 2019       S136146          Abelard           Goat    Miniature      
##  4 July 31 2019       S136147          Olive             Goat    Miniature      
##  5 October 28 2019    S135973          Linda             Goat    Miniature      
##  6 October 28 2019    S135974          Max               Goat    Miniature      
##  7 December 27 2019   S135866          Grace             Goat    Miniature      
##  8 December 27 2019   828693           Pegasis           Goat    Miniature      
##  9 January 30 2020    S108877          Professor Nibble… Goat    Miniature      
## 10 January 30 2020    S108879          Brussels Sprout   Goat    Miniature      
## # … with 25 more rows, and 2 more variables: Secondary Breed <chr>,
## #   ZIP Code <dbl>

Location is roughly assigned by zip code. We can use two additional functions to group the data by zip code, then tally the number of each rows (goats) per category:

pets %>%
  filter(Species == "Goat") %>%
  group_by(Species, `ZIP Code`) %>%
  tally()
## # A tibble: 12 x 3
## # Groups:   Species [1]
##    Species `ZIP Code`     n
##    <chr>        <dbl> <int>
##  1 Goat         98108     6
##  2 Goat         98112     4
##  3 Goat         98115     5
##  4 Goat         98117     2
##  5 Goat         98118     3
##  6 Goat         98122     3
##  7 Goat         98125     2
##  8 Goat         98133     2
##  9 Goat         98144     2
## 10 Goat         98146     2
## 11 Goat         98177     2
## 12 Goat         98178     2

It looks like most goats are in the zip code 98108.

This coding approach is referred to as split-apply-combine because we have:

What are the most common pet names in the dataset?

Our next question will allow us to perform a few additional data manipulations, and also introduce some basic visualization.

First, let’s count how many of each pet name are in the dataset:

pets %>%
  count(`Animal's Name`)
## # A tibble: 13,137 x 2
##    `Animal's Name`     n
##    <chr>           <int>
##  1 "__"                1
##  2 "-"                 1
##  3 "?????"             1
##  4 "'Alani"            1
##  5 "'Murca"            1
##  6 "\"Mama\" Maya"     1
##  7 "#1"                1
##  8 "#2"                1
##  9 "30 Weight"         1
## 10 "7's"               1
## # … with 13,127 more rows

That’s still a lot of pet names! Unlike tally from spit-apply-combine, these aren’t sorted in any order. We can order them with another function:

pets %>%
  count(`Animal's Name`) %>%
  arrange()
## # A tibble: 13,137 x 2
##    `Animal's Name`     n
##    <chr>           <int>
##  1 "__"                1
##  2 "-"                 1
##  3 "?????"             1
##  4 "'Alani"            1
##  5 "'Murca"            1
##  6 "\"Mama\" Maya"     1
##  7 "#1"                1
##  8 "#2"                1
##  9 "30 Weight"         1
## 10 "7's"               1
## # … with 13,127 more rows

We need to add one more argument to change the order of arrangement:

pets %>%
  count(`Animal's Name`) %>%
  arrange(desc(n))
## # A tibble: 13,137 x 2
##    `Animal's Name`     n
##    <chr>           <int>
##  1 Luna              402
##  2 Lucy              340
##  3 Charlie           334
##  4 Bella             285
##  5 Daisy             224
##  6 Max               219
##  7 Lily              198
##  8 Lola              195
##  9 Jack              190
## 10 Oliver            187
## # … with 13,127 more rows

We can also assign the names to a new object so we can visualize them:

# find most common pet names
common <- pets %>%
  count(`Animal's Name`) %>%
  arrange(desc(n))

Now that we have our data organized appropriately, let’s create a histogram of these count frequencies. We’ll be using the plotting package ggplot2 from Tidyverse. There are three steps to creating a ggplot.

  1. bind data: create a new plot with a designated dataset
# basic ggplot
ggplot(data = common) # bind data to plot

The last line of code creates an empty plot, since we didn’t include any instructions for how to present the data.

  1. specify the aesthetic: maps the data to axes on a plot
# basic ggplot
ggplot(data = common, mapping = aes(x = n)) # specify aesthetics (axes)

This adds labels to the axis, but no data appear because we haven’t specified how they should be represented

  1. add layers: visual representation of plot, including ways through which data are represented (geometries or shapes) and themes (anything not the data, like fonts)
ggplot(data = common,
       mapping = aes(x = n)) + 
  geom_histogram() # add a layer of geometry
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The plus sign (+) is used here to connect parts of ggplot code together. The line breaks and indentation used here represents the convention for ggplot, which makes the code more readable and easy to modify.

In the code above, note that we don’t need to include the labels for data = and mapping =. It’s also common to include the mapping (aes) in the geom, which allows for more flexibility in customizing (we’ll get to this later!).

ggplot(common, aes(n)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot is identical to the previous plot, despite the differences in code.

Let’s now work on customizing this plot! First, we’ll adjust the bin width as the Console is prompting us:

ggplot(common, aes(n)) +
  geom_histogram(binwidth = 1)

That makes it difficult to interpret the data. We can resolve this by scaling the y axis:

ggplot(common, aes(n)) +
  geom_histogram(binwidth = 1) +
  scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 281 rows containing missing values (geom_bar).

That created some new warnings, but we’re fairly confident in the data so we’re not going to worry about that for now.

We can also customize the axis labels and add a title:

ggplot(common, aes(n)) +
  geom_histogram(binwidth = 1) +
  scale_y_log10() +
  labs(title = "Frequency of pet names",
       x = "number of pets with name",
       y = "number of names")
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 281 rows containing missing values (geom_bar).

Most pets have fairly unique names, though some names are very common (with 100-200 pets coming when you call the name!).

Since we’ve done a lot of work to customize this plot, we will likely want to share it with others! We can save it to a file:

ggsave("pet_frequency.jpg")
## Saving 7 x 5 in image

## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 281 rows containing missing values (geom_bar).

What are the weirdest pet names in the dataset?

Let’s make our job a bit simplier and say we’re interested in both the shortest and longest pet names.

Our first task is to count the length of each name. First, let’s filter out any pets that are missing data for the name:

pet_names <- pets %>%
  filter(!is.na(`Animal's Name`))

Then we can use another function, mutate:

# count length of names
pet_names <- pets %>%
  filter(!is.na(`Animal's Name`)) %>%
  mutate(name_length = str_length(`Animal's Name`))

This function creates a new column called name_length. The data placed in the column is derived from counting the number of characters (str_length, or string length) for each pet’s name (Animal's Name).

Let’s look at the range of lengths:

# find range
range(pet_names$name_length)
## [1]  1 50

OK, I’m really curious about the shorest pet names. Let’s take a peek:

# find short names
pet_names %>%
  filter(name_length == 1)
## # A tibble: 13 x 8
##    `License Issue Da… `License Number` `Animal's Name` Species `Primary Breed`  
##    <chr>              <chr>            <chr>           <chr>   <chr>            
##  1 March 01 2020      8016070          V               Cat     Domestic Medium …
##  2 March 22 2020      133630           M               Cat     Maine Coon       
##  3 April 22 2020      8017333          U               Cat     Maine Coon       
##  4 August 11 2020     573722           -               Cat     Domestic Shortha…
##  5 October 20 2020    732044           Z               Cat     Domestic Shortha…
##  6 March 19 2021      S153992          B               Cat     Domestic Shortha…
##  7 April 05 2021      S156304          Q               Cat     Siamese          
##  8 October 02 2019    S103881          M               Dog     Boxer            
##  9 January 26 2021    S151203          Q               Dog     Catahoula Leopar…
## 10 July 06 2019       S133378          Z               Dog     Terrier, Pit Bull
## 11 September 05 2019  S135617          Z               Dog     Australian Cattl…
## 12 April 08 2020      8016956          Z               Dog     Doberman Pinscher
## 13 May 27 2020        8007383          Q               Dog     Terrier          
## # … with 3 more variables: Secondary Breed <chr>, ZIP Code <dbl>,
## #   name_length <int>

And finally, we can take a look at the longest names:

# find longest names
pet_names %>%
  filter(name_length > 40) %>%
  select(`Animal's Name`)
## # A tibble: 5 x 1
##   `Animal's Name`                                   
##   <chr>                                             
## 1 Ole-Won Cannoli Girgio Nibbles Gess-Trumm         
## 2 Her Grace, Princess Athena, Destroyer of Spiders  
## 3 Lady Kassandra Yu Countess of Wallingford DBE     
## 4 Little Miss Dublin Maeve of the Emerald Isle      
## 5 Nuit Ahathoor Hecate Sappho Jezebel Lilith Crowley

If we pipe the names through select, we’ll be able to view only the names column, so we can fully appreciate their majesty.

In which months are pets adopted most often?

For our last question, we’re going to perform the most complex data manipulations of the lesson, while also learning a few new functions.

Before we can answer our question, we need to make our date data easier to manipulate. First, we’ll transform the license dates to a format R recognizes as dates (they are currently character data):

pets %>%
  mutate(date = parse_date(`License Issue Date`, "%B %d %Y")) %>%
  glimpse()
## Rows: 46,388
## Columns: 8
## $ `License Issue Date` <chr> "December 18 2015", "June 14 2016", "June 16 2016…
## $ `License Number`     <chr> "S107948", "S116503", "S116742", "S119301", "2085…
## $ `Animal's Name`      <chr> "Zen", "Misty", "Frankie Lee Jones", "Lyra", "Puc…
## $ Species              <chr> "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", "Cat", …
## $ `Primary Breed`      <chr> "Domestic Longhair", "Siberian", "Domestic Shorth…
## $ `Secondary Breed`    <chr> "Mix", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `ZIP Code`           <dbl> 98117, 98117, NA, 98121, 98107, 98146, 98108, 981…
## $ date                 <date> 2015-12-18, 2016-06-14, 2016-06-16, 2016-08-04, …

Because this is code in progress, we’re not yet assigning to an object, and we’re also using glimpse to peek at the output. There are two important things about the new column: - it’s data type is date instead of chr - dates are now in a standard format of YYYY-MM-DD

Next, we can split the dates into separate columns, which will make plotting easier:

# make dates easier to manipulate
pet_dates <- pets %>%
  mutate(date = parse_date(`License Issue Date`, "%B %d %Y")) %>%
  separate(date, c("year", "month", "day"), remove = FALSE)

This creates three new columns named year, month, and day.

Now we need to count the number of pets licensed each month. We’ll split out each species so we can compare that variable as well:

# reformat data for time series (group by month)
dates_group <- pet_dates %>%
  group_by(month, Species) %>%
  tally()

Now we’re ready to plot! Let’s create a line plot, which is good for visualizing time series:

ggplot(dates_group, aes(month, n)) +
  geom_line()

That doesn’t look very good. This is because we need to plot by species, too!

ggplot(dates_group, aes(month, n, group = Species)) +
  geom_line()

That’s a little better, but which is which? We can color by species as well:

ggplot(dates_group, aes(month, n, group = Species, color = Species)) +
  geom_line()

The goat and pig data isn’t particularly useful. Let’s filter it out before plotting:

# plot time series (only cats and dogs)
dates_group %>%
  filter(Species == "Cat" | Species == "Dog") %>%
  ggplot() +
    geom_line(aes(month, n, group = Species, color = Species))

There are a few new things occurring in the code above: - we are using pipes to send data directly to a plot, so we don’t need to specify data after ggplot - the aesthetic is now included in the geometry

There is an additional package in Tidyverse called lubridate that is specifically designed to make date and time date easier to manage. It requires loading an additional package, and we could perform some basic tasks with the functions we already had installed, but it is certainly worthwhile to check out if you have more dates and times to manage!

Take home messages

Acknowledgements and further reading