class: center, middle, inverse, title-slide # Wrangling ### Data Visualization for Social Good
CorrelAid Switzerland
### February 2021 --- layout: true <div class="my-footer"> <span style="text-align:center"> <span> <img src="https://raw.githubusercontent.com/therbootcamp/therbootcamp.github.io/master/_sessions/_image/by-sa.png" height=14 style="vertical-align: middle"/> </span> <a href="https://therbootcamp.github.io/"> <span style="padding-left:82px"> <font color="#7E7E7E"> https://correlaid.org/correlaid-x/switzerland/ </font> </span> </a> <a href="https://correlaid.org/correlaid-x/switzerland/"> <font color="#7E7E7E"> Data Visualization for Social Good | February 2021 </font> </a> </span> </div> --- <!--- .pull-left45[ # What is "Wrangling"? <ul> <li class="m1"><span><high>Transform</high> <br><br> <ul class="level"> <li><span>Change column names</span></li> <li><span>Create new variables</span></li> </ul></span></li> <li class="m2"><span><high>Organize</high> <br><br> <ul class="level"> <li><span>Sort rows</span></li> <li><span>Join data sets</span></li> <li><span>Transpose data</span></li> </ul></span></li> <li class="m3"><span><high>Aggregate</high> <br><br> <ul class="level"> <li><span>Build groups</span></li> <li><span>Calculate statistics</span></li> </ul></span></li> </ul> ] .pull-right5[ <br> <p align="center"> <img src="image/wrangling_eng.png" height = "530px"> </p> ] ---> .pull-left3[ # Tidyverse <ul> <li class="m1"><span>The tidyverse is...</span></li><br> <ul class="level"> <li><span>A collection of user-friendly <high>packages</high> for analyzing <high>tidy data</high></span></li><br> <li><span>An <high>ecosystem</high> for analytics and data science with common design principles</span></li><br> <li><span>A <high>dialect</high> of the R language</span></li> </ul> </ul> ] .pull-right65[ <br><br> <p align="center"> <img src="image/tidyverse_wrangling.png" height = "520px"> </p> ] --- # <mono>%>%</mono> .pull-left45[ <ul> <li class="m1"><span>The <high>novel pipe operator</high> from the <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"><mono>magrittr</mono></a> package makes chaining commands easy.</span></li> </ul> <br> <p align="center"> <img src="image/magrittr_hex.png" height = "280px"> </p> ] .pull-right45[ ```r # Numeric vector score <- c(8, 4, 6, 3, 7, 3) score ``` ``` ## [1] 8 4 6 3 7 3 ``` ```r # Mean: Base-R-style mean(score) ``` ``` ## [1] 5.167 ``` ```r # Mean: dplyr-style score %>% mean() ``` ``` ## [1] 5.167 ``` ] --- # <mono>%>%</mono> .pull-left45[ <ul> <li class="m1"><span>The <high>novel pipe operator</high> from the <a href="https://cran.r-project.org/web/packages/magrittr/vignettes/magrittr.html"><mono>magrittr</mono></a> package makes chaining commands easy.</span></li> </ul> <br> <p align="center"> <img src="image/magrittr_hex.png" height = "280px"> </p> ] .pull-right45[ <p align="center"> <img src="image/pipe_en.png" height = "400px"> </p> ] --- # <mono>readr</mono> .pull-left45[ <ul> <li class="m1"><span>Benefits over <mono>read.csv</mono>:</span></li> <ul class="level"> <li><span>Better type inference</span></li> <li><span>Avoids <mono>factors</mono></span></li> <li><span>Produces <highm>tibble</highm></span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/readr.png" height=240px> </p> ] .pull-right45[ <p align="center"> <img src="image/data.png" height=400> </p> ] --- # <mono>readr</mono> .pull-left45[ <ul> <li class="m1"><span>Benefits over <mono>read.csv</mono>:</span></li> <ul class="level"> <li><span>Better type inference</span></li> <li><span>Avoids <mono>factors</mono></span></li> <li><span>Produces <highm>tibble</highm></span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/readr.png" height=240px> </p> ] .pull-right45[ ```r # Read in taxation basel <- read_csv("1_Data/taxation.csv") basel ``` ``` ## # A tibble: 357 x 10 ## year quarter quarter_no N ## <dbl> <chr> <dbl> <dbl> ## 1 2001 Altsta… 1 1673 ## 2 2001 Vorstä… 2 3204 ## 3 2001 Am Ring 3 6579 ## 4 2001 Breite 4 5433 ## 5 2001 St. Al… 5 6179 ## # … with 352 more rows, and 6 more ## # variables: income_mean <dbl>, ## # income_median <dbl>, ## # income_gini <dbl>, ## # wealth_mean <dbl>, ## # wealth_median <dbl>, ## # wealth_gini <dbl> ``` ] --- # <mono>tibble</mono> .pull-left45[ <ul> <li class="m1"><span>Benefits over <mono>data.frame</mono>:</span></li> <ul class="level"> <li><span><high>Better print</high>: More informative and cleaner</span></li> <li><span>More consistent subsetting</span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/tibble.png" height=240px> </p> ] .pull-right45[ ```r # Read in taxation basel <- read_csv("1_Data/taxation.csv") basel ``` ``` ## # A tibble: 357 x 10 ## year quarter quarter_no N ## <dbl> <chr> <dbl> <dbl> ## 1 2001 Altsta… 1 1673 ## 2 2001 Vorstä… 2 3204 ## 3 2001 Am Ring 3 6579 ## 4 2001 Breite 4 5433 ## 5 2001 St. Al… 5 6179 ## # … with 352 more rows, and 6 more ## # variables: income_mean <dbl>, ## # income_median <dbl>, ## # income_gini <dbl>, ## # wealth_mean <dbl>, ## # wealth_median <dbl>, ## # wealth_gini <dbl> ``` ] --- .pull-left45[ # <mono>dplyr</mono> <ul> <li class="m1"><span>Benefits over Base R:</span></li> <ul class="level"> <li><span><high>No more brackets</high></span></li> <li><span><high>Data masking</high></span></li> <li><span>Tidy selection</span></li> <li><span>Intuitively named functions</span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/dplyr.png" height=240px> </p> ] .pull-right5[ <br><br> <table cellspacing="0" cellpadding="0" class="clean_table" width="100%"> <col width="42%"> <col width="58%"> <tr> <td><b>Key verbs</b></td> <td><b>Purpose</b></td> </tr> <tr> <td style="padding-top:20px"><i>Transformation</i></td> <td></td> </tr> <tr> <td><mono>rename()</mono></td> <td>Rename column names</td> </tr> <tr> <td><mono>mutate()</mono></td> <td>Create/change columns</td> </tr> <td style="padding-top:20px"><i>Organization</i></td> <td></td> </tr> <tr> <td><mono>arrange()</mono></td> <td>Sort</td> </tr> <tr> <td><mono>select()</mono></td> <td>Select variables</td> </tr> <tr> <td><mono>slice()</mono>, <mono>filter()</mono></td> <td>Select rows</td> </tr> <tr> <td><mono>left_join()</mono>, <mono>inner_join()</mono>, etc.</td> <td>Join data sets</td> </tr> <td style="padding-top:20px"><i>Aggregation</i></td> <td></td> </tr> <tr> <td><mono>summarize()</mono></td> <td>Calculate statistics</td> </tr> <tr> <td><mono>group()</mono></td> <td>Summarize group-wise</td> </tr> </table> ] --- # `select()` .pull-left4[ ```r # Select two columns TIBBLE %>% select(VAR1, VAR2) # Select everything but TIBBLE %>% select(-VAR1) ``` ] .pull-right5[ ```r basel %>% # Select columns select(year, quarter, income_mean) ``` ``` ## # A tibble: 357 x 3 ## year quarter income_mean ## <dbl> <chr> <dbl> ## 1 2001 Altstadt Gross… 87776 ## 2 2001 Vorstädte 84109 ## 3 2001 Am Ring 62582 ## 4 2001 Breite 52039 ## 5 2001 St. Alban 89956 ## 6 2001 Gundeldingen 51229 ## 7 2001 Bruderholz 96124 ## 8 2001 Bachletten 70348 ## # … with 349 more rows ``` ] <!--- # `slice()` .pull-left4[ ```r # Slice using : TIBBLE %>% slice(INDEX_START:INDEX_STOP) # Slice using vector TIBBLE %>% slice(c(INDEX1, INDEX2, ...)) ``` ] .pull-right5[ ```r basel %>% select(year, quarter, income_mean) %>% # Select rows 20 to 30 slice(20:30) ``` ``` ## # A tibble: 11 x 3 ## year quarter income_mean ## <dbl> <chr> <dbl> ## 1 2001 Riehen 84857 ## 2 2001 Bettingen 83803 ## 3 2002 Altstadt Gross… 89525 ## 4 2002 Vorstädte 86350 ## 5 2002 Am Ring 64797 ## 6 2002 Breite 52483 ## 7 2002 St. Alban 85906 ## 8 2002 Gundeldingen 52035 ## # … with 3 more rows ``` ] ---> --- # `filter()` .pull-left4[ ```r # Filter using logical comparisons TIBBLE %>% filter(VAR1 == VAL1, VAR2 > VAL2, VAR3 < VAL3, VAR4 == VAL4 | VAR5 < VAL5) ``` ] .pull-right5[ ```r basel %>% select(year, quarter, income_mean) %>% # Select rows rows where year is 2017 filter(year == 2017) ``` ``` ## # A tibble: 21 x 3 ## year quarter income_mean ## <dbl> <chr> <dbl> ## 1 2017 Altstadt Gross… 97111 ## 2 2017 Vorstädte 103714 ## 3 2017 Am Ring 78761 ## 4 2017 Breite 56888 ## 5 2017 St. Alban 102457 ## 6 2017 Gundeldingen 56544 ## 7 2017 Bruderholz 105973 ## 8 2017 Bachletten 81580 ## # … with 13 more rows ``` ] --- # `arrange()` .pull-left4[ ```r # Sort ascending TIBBLE %>% arrange(VAR1, VAR2) # Sort descending w/ desc() TIBBLE %>% arrange(desc(VAR1), VAR2) ``` ] .pull-right5[ ```r basel %>% select(year, quarter, income_mean) %>% filter(year == 2017) %>% # Sort by income arrange(income_mean) ``` ``` ## # A tibble: 21 x 3 ## year quarter income_mean ## <dbl> <chr> <dbl> ## 1 2017 Klybeck 41569 ## 2 2017 Kleinhüningen 45664 ## 3 2017 Clara 50680 ## 4 2017 Matthäus 50786 ## 5 2017 Iselin 51600 ## 6 2017 St. Johann 52890 ## 7 2017 Rosental 54543 ## 8 2017 Gundeldingen 56544 ## # … with 13 more rows ``` ] --- # `summarize()` .pull-left4[ ```r # Create new summary variables TIBBLE %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r basel %>% filter(year == 2017) %>% # Calculate averages in 2017 summarize( income = mean(income_mean), wealth = mean(wealth_mean)) ``` ``` ## # A tibble: 1 x 2 ## income wealth ## <dbl> <dbl> ## 1 72388. 560333. ``` ] <!--- # `summarise_if()` .pull-left4[ ```r # Create new summary variables TIBBLE %>% summarise_if( CONDITION, SUMMARY_FUN ) ``` ] .pull-right5[ ```r basel %>% # Calculate averages in 2017 summarize_if(is.numeric, mean) ``` ``` ## # A tibble: 1 x 9 ## year quarter_no N ## <dbl> <dbl> <dbl> ## 1 2009 11.4 5381. ## # … with 6 more variables: ## # income_mean <dbl>, ## # income_median <dbl>, ## # income_gini <dbl>, ## # wealth_mean <dbl>, ## # wealth_median <dbl>, ## # wealth_gini <dbl> ``` ] ---> --- # `group_by()` .pull-left4[ ```r # Create grouped summary variables TIBBLE %>% group_by(GRUPPEN_VAR) %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r basel %>% # Calculate averages for all years group_by(year) %>% summarize( income = mean(income_mean), wealth = mean(wealth_mean)) ``` ``` ## # A tibble: 17 x 3 ## year income wealth ## <dbl> <dbl> <dbl> ## 1 2001 63027. 347770. ## 2 2002 63555. 367401. ## 3 2003 63083. 373278. ## 4 2004 62298. 353968. ## 5 2005 63133. 441864. ## 6 2006 64148. 465242. ## 7 2007 66594 435270. ## 8 2008 66463. 401131. ## # … with 9 more rows ``` ] --- # `group_by()` .pull-left4[ ```r # Create grouped summary variables TIBBLE %>% group_by(GRUPPEN_VAR) %>% summarise( NAME1 = SUMMARY_FUN(VAR1), NAME2 = SUMMARY_FUN(VAR2) ) ``` ] .pull-right5[ ```r basel %>% # Calculate averages for all years group_by(year) %>% summarize( income = mean(income_mean), wealth = mean(wealth_mean)) %>% arrange(income) ``` ``` ## # A tibble: 17 x 3 ## year income wealth ## <dbl> <dbl> <dbl> ## 1 2004 62298. 353968. ## 2 2001 63027. 347770. ## 3 2003 63083. 373278. ## 4 2005 63133. 441864. ## 5 2002 63555. 367401. ## 6 2006 64148. 465242. ## 7 2011 66050. 398102. ## 8 2008 66463. 401131. ## # … with 9 more rows ``` ] --- .pull-left4[ # `*_join()` ```r # Join two tibbles TIBBLE1 %>% left_join(TIBBLE2, by = c("KEY1" = "KEY2")) ``` ] .pull-right5[ <br> ```r basel %>% group_by(year) %>% summarize( income = mean(income_mean), wealth = mean(wealth_mean)) %>% # join back to basel right_join(basel) ``` ``` ## # A tibble: 357 x 12 ## year income wealth quarter ## <dbl> <dbl> <dbl> <chr> ## 1 2001 63027. 3.48e5 Altsta… ## 2 2001 63027. 3.48e5 Vorstä… ## 3 2001 63027. 3.48e5 Am Ring ## 4 2001 63027. 3.48e5 Breite ## 5 2001 63027. 3.48e5 St. Al… ## 6 2001 63027. 3.48e5 Gundel… ## 7 2001 63027. 3.48e5 Bruder… ## 8 2001 63027. 3.48e5 Bachle… ## # … with 349 more rows, and 8 more ## # variables: quarter_no <dbl>, ## # N <dbl>, income_mean <dbl>, ## # income_median <dbl>, ## # income_gini <dbl>, ## # wealth_mean <dbl>, … ``` ] --- # <mono>tidyr</mono> .pull-left4[ <ul> <li class="m1"><span>Benefits over Base R:</span></li> <ul class="level"> <li><span>Did not exist before.</span></li> </ul></span></li> </ul> <br> <p align="center"> <img src="https://github.com/rstudio/hex-stickers/raw/master/PNG/tidyr.png" height=240px> </p> ] .pull-right5[ <p align="center"> <img src="https://github.com/gadenbuie/tidyexplain/raw/master/images/tidyr-spread-gather.gif" height=420px><br> <font style="font-size:10px">adapted from <a href="https://github.com/gadenbuie/tidyexplain">tidyexplain</a></font> </p> ] --- # `pivot_longer()` .pull-left4[ ```r # wide to long TIBBLE %>% pivot_longer(cols = VARS, names_to = NAME1, values_to = NAME2) ``` ] .pull-right5[ ```r # wide to long basel %>% select(year, quarter, income_mean, wealth_mean) %>% pivot_longer(c(income_mean, wealth_mean)) ``` ``` ## # A tibble: 714 x 4 ## year quarter name value ## <dbl> <chr> <chr> <dbl> ## 1 2001 Altstadt Gr… income… 8.78e4 ## 2 2001 Altstadt Gr… wealth… 1.01e6 ## 3 2001 Vorstädte income… 8.41e4 ## 4 2001 Vorstädte wealth… 1.12e6 ## 5 2001 Am Ring income… 6.26e4 ## 6 2001 Am Ring wealth… 3.01e5 ## 7 2001 Breite income… 5.20e4 ## 8 2001 Breite wealth… 1.05e5 ## # … with 706 more rows ``` ] --- class: middle, center <h1><a href="https://therbootcamp.github.io/EDA_2020Sep/_sessions/Wrangling/Wrangling_practical.html">Practical</a></h1>