We will get acquainted with how R is functioning
We will learn about different types of variables
We will just scratch a surface of several R packages like parts of tidyverse (dplyr and ggplot)
We will create a dashboard with information contained in dog bites dataset
Go to this link to download folder with this excercise: https://github.com/tixwitchy/Dogs-of-New-York
Click on Clone or download green button and Download ZIP
Open DogsofNewYork.Rproj file
In the left part of R studio (Console) copy the following code and press enter:
After installing R and R studio you need to set a working directory where all your work will be stored.
The best way to do this is to choose File/New Project which will automatically store all your information in same place.
As we already opened DogsofNewYork.Rproj file it has already set the working directory for us.
When you install R, you have basic functions already available within Base R. You can take a look at Introduction to Base R for additional information.
However, in order to access functions or data written by other people there are numerious R packages available.
An R package is a bundle of functions (code), data, documentation, vignettes (examples).
Important note - R is case-sensitive so make sure to check spelling and capitalization!
To access information in R packages they first need to be installed and then accessed through their libraries. Use the following code to install packages and load libraries.
Type in your console the following command and press enter.
## [1] 4
You use <- to create objects in R. It is called an assignement operator.
## [1] 15
The data set on dog bites is taken from R package nycdogs by Kieran Healy. For our exercise it is adapted only to include year 2017 and several variables. So let us see how the dataset looks like.
Important note: You will rarely come accross the dataset that is already prepared for analysis. Usually, you will spend between 50% - 80% of your time on cleaning and preparing data.
First, we will import and inspect a csv file about dog bites in New York City for 2017 with the following code.
There are 3072 rows that we will refer to observations and 6 columns that we will call variables. As you may also see, we have different types of variables such as character, date, double (continuous).
## Rows: 3,072
## Columns: 6
## $ date_of_bite <date> 2017-01-02, 2017-01-02, 2017-01-04, 2017-01-07, 2017-01…
## $ breed <chr> "Labrador Retriever Crossbreed", "Lhasa Apso", "Pit Bull…
## $ gender <chr> "Male", "Male", "Unknown", "Unknown", "Male", "Unknown",…
## $ spay_neuter <chr> "No", "Yes", "No", "No", "Yes", "No", "No", "No", "No", …
## $ borough <chr> "Brooklyn", "Brooklyn", "Brooklyn", "Brooklyn", "Brookly…
## $ zip_code <dbl> 11231, 11211, 11219, 11216, 11216, 11229, 11216, 11206, …
Measured have the resulting outcome expressed in numerical terms (Numeric):
Integer: Age, number of kittens
Double (Continuous): Height, weight
Attribute have their outcomes described in terms of their characteristics or attributes:
Character: Black, yellow, white
Factor (Ordinal): Cold, mild, warm, hot
In top left corner press a document with the plus sign icon and choose R Markdown. Then open Flex Dashboard template.
In tidyverse package there is a so-called “pipe” operator %>%. It passes the result of the left hand-side as the first operator argument of the function on the right handside. It is used to connect multiple operations on data together.
In the Setup part code, we will import a dog bites data set and create a subset for number of bites per boroughs that we will use in textual part of our dashboard.
Now let us take a look at the 5 boroughs with the highest number of bites
## # A tibble: 5 x 3
## borough n perc
## <chr> <int> <dbl>
## 1 Queens 817 27
## 2 Brooklyn 690 22
## 3 Manhattan 663 22
## 4 Bronx 506 16
## 5 Staten Island 284 9
We will use tick `, followed by r and some function and closed with another tick as a formula that will automatically add information in the text, so if we use a subset for another year it will update the data in the text straight away. To access particular value in a dataset you can use the following code where the first number is the number of row and the second one the number of column.
## # A tibble: 1 x 1
## borough
## <chr>
## 1 Queens
### **How much do dogs bite in New York City** {data-height=250, align=justify}
This dashboard shows statistics of dog bites in New York City in 2017. It shows that there were total of
**`r nrow(datadogs2017)`** bites.The bar chart below shows which were the top three biters per breed.
Although, there are certain patterns that might suggest that **Pit Bulls** are the most aggressive, it might
just be the case that there are more of them in New York in comparison to other breeds.
**`r datadogsborougs[1,1]`** and **`r datadogsborougs[2,1]`** are the two boroughs with
the highest percentage of bites **`r datadogsborougs[1,3]`%**
and **`r datadogsborougs[2,3]`%** respectively.
### **Table of Dog Bites in New York in 2017** {data-height=750}
First, in a Setup part of our dashboard document we will create a table without last column related to zip codes.
Now we will add a searchable table just below the title Table of Dog Bites in New York in 2017 with the help of DT package.
First, we will create a subset to see which are the three top breed bitters. We will again put this part of code in the first Setup part of our R dashboard/R markdown file.
datadogsbreed <- datadogs2017 %>%
group_by (breed) %>% #grouping by breed variable
tally () %>% #tally will count the number of bites per breed
rename (Number_of_bites = n) %>% #since we got a column entitled "n" we will rename it
arrange (desc (Number_of_bites)) %>% #arranging the breeds in descending order
top_n(3) #choosing only top three breeds
We will use two packages, one (ggplot2) to make a bar graph and another one (plotly) to make the graph’s information pop up when hovering. Ggplot2 is a package created by Hadley Wickam that is based on a grammar of graphics.
Enables you to specify building blocks of a plot and to combine them to create graphical display you want.
data
aesthetic mapping
geometric object
statistical transformations
scales
coordinate system
position adjustments
faceting
Instead of Chart B we will write: Three breeds with the highest number of bites in 2017 and use this a code for a bar chart.
p <- ggplot (data = datadogsbreed) + #data mapping
geom_col (aes (x = breed, y = Number_of_bites),#geometric object and aesthetic mapping we will reorder breed based on number of bites in descending order
fill = c("darkred", "darkgreen", "darkblue")) + #with colors for bars
xlab ("Breed") +
ylab ("Number of Bites") + #adding the title of y lab
theme (legend.position = "none",
panel.background = element_rect (fill = "lightcyan")) #removing legend and changing panelbackground color
#Turn it interactive with plotly
p <- ggplotly (p)
p
We will add in the Setup part of our file the following code that will transform character variable-breed to a factor. We will specify levels so that we can create a proper order of dog breeds according to the number of bites - first level will be Pit Bull, followed by Unknown and then Shih Tzu.
Note: In order to use a particular column/variable in R, we connect dataset and needed column with the dollar sign.
In this final part, we will create a stacked bar chart which will show how many dogs that bit were spayed/neutered and how many of them were male or female. So we will again in Setup part create a subset grouped by spay/neuter and gender. We will also create another column to use it as a pop-up label.
datadogsgenderspay <- datadogs2017 %>%
group_by (spay_neuter, gender) %>%
tally () %>% #counting number of cases of these two groups
mutate (Info = paste ('<br>', "Spay/Neuter:", #"<br>" is used to indicate break/next line
spay_neuter, '<br>',
"Number of bites:", n, '<br>',
"Gender:", gender, '<br>')) # Creating new variable- Info to be used as a label
spay_neuter | gender | n | Info |
---|---|---|---|
No | Female | 271 | <br> Spay/Neuter: No <br> Number of bites: 271 <br> Gender: Female <br> |
No | Male | 682 | <br> Spay/Neuter: No <br> Number of bites: 682 <br> Gender: Male <br> |
No | Unknown | 1063 | <br> Spay/Neuter: No <br> Number of bites: 1063 <br> Gender: Unknown <br> |
Yes | Female | 290 | <br> Spay/Neuter: Yes <br> Number of bites: 290 <br> Gender: Female <br> |
Yes | Male | 755 | <br> Spay/Neuter: Yes <br> Number of bites: 755 <br> Gender: Male <br> |
Yes | Unknown | 11 | <br> Spay/Neuter: Yes <br> Number of bites: 11 <br> Gender: Unknown <br> |
Instead of Chart C we will write and center the title: Bites based on dog’s gender and whether they were spayed/neutered {align=center} and use this a code for a stacked bar:
p1<- ggplot (data = datadogsgenderspay)+ #data mapping
geom_col (aes (x = spay_neuter, y = n, #geom object and aesthetic mapping
fill = gender, label = Info)) + #creating stack by gender and using Info column as label
scale_fill_manual (values = c('cyan3', 'darkorange', "purple")) + #using specific colours for gender
ylab ("Number of bites") +
theme (legend.position = "none",
panel.background = element_blank())
p1<-ggplotly(p1, tooltip = "Info") # adding in plotly label Info
p1
“We infer that something we see in the data applies beyond the time, place and conditions in which it happened to surface.”
— Ben Jones, Avoiding Data Pitfalls
In order to say that Pit Bulls are really agressive we need to do additional research.
Is it relevant to make conclusions with this number of observations? Is the data reliable?
That is why experts need to be able to create this type of visualisations. They already have expertise needed to draw valid conclusion and this tool can help them reach wider audience as well as follow and contribute to other people’s work.
If your dashboard has shiny elements you can publish it through Shinyapps.io