Wrangling

# Wrangling
### Data Visualization for Social Good <a href='https://correlaid.org/correlaid-x/switzerland/'> CorrelAid Switzerland </a> <a href='https://correlaidswitzerland.github.io/DataViz4Good/'> </a>  <a href='https://correlaid.org/correlaid-x/switzerland/'> </a>  <a href='switzerland@correlaid.org'> </a>  <a href='https://www.linkedin.com/company/correlaidswiss/'> </a>
### February 2021

---

<div class="my-footer">
 
 
 <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/>
 
 <a href="https://therbootcamp.github.io/">
 
 
 https://correlaid.org/correlaid-x/switzerland/
 
 
 </a>
 <a href="https://correlaid.org/correlaid-x/switzerland/">
 
 Data Visualization for Social Good | February 2021
 
 </a>
 
 </div>

---

<!---
.pull-left45[

# What is "Wrangling"?

<ul>
 <li class="m1"><high>Transform</high>
 
 <ul class="level">
 <li>Change column names</li>
 <li>Create new variables</li>
 </ul></li>
 <li class="m2"><high>Organize</high>
 
 <ul class="level">
 <li>Sort rows</li>
 <li>Join data sets</li>
 <li>Transpose data</li>
 </ul></li>
 <li class="m3"><high>Aggregate</high>
 
 <ul class="level">
 <li>Build groups</li>
 <li>Calculate statistics</li>
 </ul></li>
</ul>

]

]

--->

# Tidyverse

<ul>
 <li class="m1">The tidyverse is...</li> 
 <ul class="level">
 <li>A collection of user-friendly <high>packages</high> for analyzing <high>tidy data</high></li> 
 <li>An <high>ecosystem</high> for analytics and data science with common design principles</li> 
 <li>A <high>dialect</high> of the R language</li>
 </ul>
</ul>

]

---

# <mono>%>%</mono>

<ul>
 <li class="m1">The <high>novel pipe operator</high> from the <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"><mono>magrittr</mono></a> package makes chaining commands easy.</li>
</ul>

]

```r
# Numeric vector
score <- c(8, 4, 6, 3, 7, 3)
score
```

```
## [1] 8 4 6 3 7 3
```

```r
# Mean: Base-R-style
mean(score)
```

```
## [1] 5.167
```

```r
# Mean: dplyr-style
score %>%  
  mean()  
```

```
## [1] 5.167
```

]

---

# <mono>%>%</mono>

]

]

---

# <mono>readr</mono>

<ul>
 <li class="m1">Benefits over <mono>read.csv</mono>:</li>
 <ul class="level">
 <li>Better type inference</li>
 <li>Avoids <mono>factors</mono></li>
 <li>Produces <highm>tibble</highm></li>
 </ul></li>
</ul>

]

]

---

# <mono>readr</mono>

]

```r
# Read in taxation
basel <- read_csv("1_Data/taxation.csv")

basel
```

```
## # A tibble: 357 x 10
## year quarter quarter_no N
## <dbl> <chr> <dbl> <dbl>
## 1 2001 Altsta… 1 1673
## 2 2001 Vorstä… 2 3204
## 3 2001 Am Ring 3 6579
## 4 2001 Breite 4 5433
## 5 2001 St. Al… 5 6179
## # … with 352 more rows, and 6 more
## # variables: income_mean <dbl>,
## # income_median <dbl>,
## # income_gini <dbl>,
## # wealth_mean <dbl>,
## # wealth_median <dbl>,
## # wealth_gini <dbl>
```

]

---

# <mono>tibble</mono>

<ul>
 <li class="m1">Benefits over <mono>data.frame</mono>:</li>
 <ul class="level">
 <li><high>Better print</high>: More informative and cleaner</li>
 <li>More consistent subsetting</li>
 </ul></li>
</ul>

]

```r
# Read in taxation
basel <- read_csv("1_Data/taxation.csv")

basel
```

]

---

# <mono>dplyr</mono>

<ul>
 <li class="m1">Benefits over Base R:</li>
 <ul class="level">
 <li><high>No more brackets</high></li>
 <li><high>Data masking</high></li>
 <li>Tidy selection</li>
 <li>Intuitively named functions</li>
 </ul></li>
</ul>

]

<table cellspacing="0" cellpadding="0" class="clean_table" width="100%">
<col width="42%">
<col width="58%">
<tr>
<td>Key verbs</td>
<td>Purpose</td>
</tr>
<tr>
<td style="padding-top:20px">Transformation</td>
<td></td>
</tr>
<tr>
<td><mono>rename()</mono></td>
<td>Rename column names</td>
</tr>
<tr>
<td><mono>mutate()</mono></td>
<td>Create/change columns</td>
</tr>
<td style="padding-top:20px">Organization</td>
<td></td>
</tr>
<tr>
<td><mono>arrange()</mono></td>
<td>Sort</td>
</tr>
<tr>
<td><mono>select()</mono></td>
<td>Select variables</td>
</tr>
<tr>
<td><mono>slice()</mono>, <mono>filter()</mono></td>
<td>Select rows</td>
</tr>
<tr>
<td><mono>left_join()</mono>, <mono>inner_join()</mono>, etc.</td>
<td>Join data sets</td>
</tr>
<td style="padding-top:20px">Aggregation</td>
<td></td>
</tr>
<tr>
<td><mono>summarize()</mono></td>
<td>Calculate statistics</td>
</tr>
<tr>
<td><mono>group()</mono></td>
<td>Summarize group-wise</td>
</tr>
</table>

]

---

# `select()`

```r
# Select two columns
TIBBLE %>% 
  select(VAR1, VAR2)

# Select everything but 
TIBBLE %>% 
  select(-VAR1)
```

]

```r
basel %>%
  
  # Select columns
  select(year, quarter, income_mean)
```

```
## # A tibble: 357 x 3
## year quarter income_mean
## <dbl> <chr> <dbl>
## 1 2001 Altstadt Gross… 87776
## 2 2001 Vorstädte 84109
## 3 2001 Am Ring 62582
## 4 2001 Breite 52039
## 5 2001 St. Alban 89956
## 6 2001 Gundeldingen 51229
## 7 2001 Bruderholz 96124
## 8 2001 Bachletten 70348
## # … with 349 more rows
```

]

<!---

# `slice()`

```r
# Slice using :
TIBBLE %>%
  slice(INDEX_START:INDEX_STOP)

# Slice using vector  
TIBBLE %>%
  slice(c(INDEX1, INDEX2, ...))
```

]

```r
basel %>%
    select(year, quarter, income_mean) %>%

# Select rows 20 to 30
  slice(20:30)
```

```
## # A tibble: 11 x 3
## year quarter income_mean
## <dbl> <chr> <dbl>
## 1 2001 Riehen 84857
## 2 2001 Bettingen 83803
## 3 2002 Altstadt Gross… 89525
## 4 2002 Vorstädte 86350
## 5 2002 Am Ring 64797
## 6 2002 Breite 52483
## 7 2002 St. Alban 85906
## 8 2002 Gundeldingen 52035
## # … with 3 more rows
```

]

--->

---

# `filter()`

```r
# Filter using logical comparisons
TIBBLE %>%
 filter(VAR1 == VAL1,
 VAR2 > VAL2,
 VAR3 < VAL3,
 VAR4 == VAL4 | VAR5 < VAL5)
```
]

```r
basel %>%
    select(year, quarter, income_mean) %>%

# Select rows rows where year is 2017
  filter(year == 2017)
```

```
## # A tibble: 21 x 3
## year quarter income_mean
## <dbl> <chr> <dbl>
## 1 2017 Altstadt Gross… 97111
## 2 2017 Vorstädte 103714
## 3 2017 Am Ring 78761
## 4 2017 Breite 56888
## 5 2017 St. Alban 102457
## 6 2017 Gundeldingen 56544
## 7 2017 Bruderholz 105973
## 8 2017 Bachletten 81580
## # … with 13 more rows
```

]

---

# `arrange()`

```r
# Sort ascending
TIBBLE %>%
  arrange(VAR1, VAR2)

# Sort descending w/ desc()
TIBBLE %>%
  arrange(desc(VAR1), VAR2)
```

]

```r
basel %>%
  select(year, quarter, income_mean) %>%
  filter(year == 2017) %>% 
  
  # Sort by income
  arrange(income_mean)
```

```
## # A tibble: 21 x 3
## year quarter income_mean
## <dbl> <chr> <dbl>
## 1 2017 Klybeck 41569
## 2 2017 Kleinhüningen 45664
## 3 2017 Clara 50680
## 4 2017 Matthäus 50786
## 5 2017 Iselin 51600
## 6 2017 St. Johann 52890
## 7 2017 Rosental 54543
## 8 2017 Gundeldingen 56544
## # … with 13 more rows
```

]

---

# `summarize()`

```r
# Create new summary variables
TIBBLE %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  filter(year == 2017) %>% 
 
  # Calculate averages in 2017
  summarize(
    income = mean(income_mean),
    wealth = mean(wealth_mean))
```

```
## # A tibble: 1 x 2
## income wealth
## <dbl> <dbl>
## 1 72388. 560333.
```

]

<!---

# `summarise_if()`

```r
# Create new summary variables
TIBBLE %>%
  summarise_if(
    CONDITION,
    SUMMARY_FUN
  )
```

]

```r
basel %>% 
  
  # Calculate averages in 2017
  summarize_if(is.numeric,
               mean)
```

```
## # A tibble: 1 x 9
## year quarter_no N
## <dbl> <dbl> <dbl>
## 1 2009 11.4 5381.
## # … with 6 more variables:
## # income_mean <dbl>,
## # income_median <dbl>,
## # income_gini <dbl>,
## # wealth_mean <dbl>,
## # wealth_median <dbl>,
## # wealth_gini <dbl>
```

]

--->

---

# `group_by()`

```r
# Create grouped summary variables
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Calculate averages for all years
  group_by(year) %>% 
  summarize(
    income = mean(income_mean),
    wealth = mean(wealth_mean))
```

```
## # A tibble: 17 x 3
## year income wealth
## <dbl> <dbl> <dbl>
## 1 2001 63027. 347770.
## 2 2002 63555. 367401.
## 3 2003 63083. 373278.
## 4 2004 62298. 353968.
## 5 2005 63133. 441864.
## 6 2006 64148. 465242.
## 7 2007 66594 435270.
## 8 2008 66463. 401131.
## # … with 9 more rows
```

]

---

# `group_by()`

```r
# Create grouped summary variables
TIBBLE %>%
  group_by(GRUPPEN_VAR) %>%
  summarise(
    NAME1 = SUMMARY_FUN(VAR1),
    NAME2 = SUMMARY_FUN(VAR2)
  )
```

]

```r
basel %>%
  
  # Calculate averages for all years
  group_by(year) %>% 
  summarize(
    income = mean(income_mean),
    wealth = mean(wealth_mean)) %>% 
  arrange(income) 
```

```
## # A tibble: 17 x 3
## year income wealth
## <dbl> <dbl> <dbl>
## 1 2004 62298. 353968.
## 2 2001 63027. 347770.
## 3 2003 63083. 373278.
## 4 2005 63133. 441864.
## 5 2002 63555. 367401.
## 6 2006 64148. 465242.
## 7 2011 66050. 398102.
## 8 2008 66463. 401131.
## # … with 9 more rows
```

]

---

# `*_join()`

```r
# Join two tibbles
TIBBLE1 %>%
  left_join(TIBBLE2, 
            by = c("KEY1" = "KEY2"))
```

]

```r
basel %>%
  group_by(year) %>% 
  summarize(
    income = mean(income_mean),
    wealth = mean(wealth_mean)) %>% 
  
  # join back to basel
  right_join(basel)
```

```
## # A tibble: 357 x 12
## year income wealth quarter
## <dbl> <dbl> <dbl> <chr> 
## 1 2001 63027. 3.48e5 Altsta…
## 2 2001 63027. 3.48e5 Vorstä…
## 3 2001 63027. 3.48e5 Am Ring
## 4 2001 63027. 3.48e5 Breite 
## 5 2001 63027. 3.48e5 St. Al…
## 6 2001 63027. 3.48e5 Gundel…
## 7 2001 63027. 3.48e5 Bruder…
## 8 2001 63027. 3.48e5 Bachle…
## # … with 349 more rows, and 8 more
## # variables: quarter_no <dbl>,
## # N <dbl>, income_mean <dbl>,
## # income_median <dbl>,
## # income_gini <dbl>,
## # wealth_mean <dbl>, …
```
]

---

# <mono>tidyr</mono>

<ul>
 <li class="m1">Benefits over Base R:</li>
 <ul class="level">
 <li>Did not exist before.</li>
 </ul></li>
</ul>

]

<img src="https://github.com/gadenbuie/tidyexplain/raw/master/images/tidyr-spread-gather.gif" height=420px> 
adapted from <a href="https://github.com/gadenbuie/tidyexplain">tidyexplain</a>

]

---

# `pivot_longer()`

```r
# wide to long
TIBBLE %>% 
  pivot_longer(cols = VARS,
               names_to = NAME1,
               values_to = NAME2)
```

]

```r
# wide to long
basel %>% 
  select(year, quarter, 
         income_mean, wealth_mean) %>% 
  pivot_longer(c(income_mean, wealth_mean))
```

```
## # A tibble: 714 x 4
## year quarter name value
## <dbl> <chr> <chr> <dbl>
## 1 2001 Altstadt Gr… income… 8.78e4
## 2 2001 Altstadt Gr… wealth… 1.01e6
## 3 2001 Vorstädte income… 8.41e4
## 4 2001 Vorstädte wealth… 1.12e6
## 5 2001 Am Ring income… 6.26e4
## 6 2001 Am Ring wealth… 3.01e5
## 7 2001 Breite income… 5.20e4
## 8 2001 Breite wealth… 1.05e5
## # … with 706 more rows
```
]

---

<h1><a href="https://therbootcamp.github.io/EDA_2020Sep/_sessions/Wrangling/Wrangling_practical.html">Practical</a></h1>