Wrangling Data Day Texas Slides

rstats
Since twitter threads are excessively cumbersome to navigate, Maëlle asked me to relocate the list of #rstats Data Day Texas slides to a blog post, so here we are!
Author

Lucy D’Agostino McGowan

Published

January 28, 2018

Since twitter threads are excessively cumbersome to navigate, Maëlle asked me to relocate the list of #rstats Data Day Texas slides to a blog post, so here we are!

The titles link to the slides r emo::ji("dancers")

Pilgrim’s Progress: a journey from confusion to contribution

Mara Averick

Navigating the data science landscape can be overwhelming. Luckily, you don’t have to do it alone! In fact, I’ll argue shouldn’t do it alone. Whether it be by tweeting your latest mistake, asking a well-formed question, or submitting a pull request to a popular package, you can help others and yourselves by “learning out loud.” No matter how much (or little) you know, you can turn your confusion into contributions, and have a surprising amount of fun along the way.

Making Causal Claims as a Data Scientist: Tips and Tricks Using R

Lucy D’Agostino McGowan

Making believable causal claims can be difficult, especially with the much repeated adage “correlation is not causation”. This talk will walk through some tools often used to practice safe causation, such as propensity scores and sensitivity analyses. In addition, we will cover principles that suggest causation such as the understanding of counterfactuals, and applying Hill’s criteria in a data science setting. We will walk through specific examples, as well as provide R code for all methods discussed.

Statistics for Data Science: what you should know and why

Gabriela de Queiroz

Data science is not only about machine learning. To be a successful data person, you also need a significant understanding of statistics. Gabriela de Queiroz walks you through the top five statistical concepts every Data Scientist should know to work with data.

R, What is it good for? Absolutely Everything

Jasmine Dumas

Good does not mean great, but good is better than bad. When we try to compare programming languages we tend to look at the surface components (popular developer influence, singular use cases or language development & design choices) and sometimes we forget the substantive (sometimes secondary) components of what can make a programming language appropriate for use, such as: versatility, environment and inclusivity. I’ll highlight each of these themes in the presentation to show and not tell of why R is good for everything!

infer: an R package for tidy statistical inference

Chester Ismay

How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The infer package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on the design principles of the package, which are firmly motivated by Hadley Wickham’s tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we’ll walk through some examples of how to implement the code of the infer package. The package is aimed to be useful to new students of statistics as well as seasoned practitioners.

Something old, something new, something borrowed, something blue: Ways to teach data science (and learn it too!)

Albert Y. Kim

How can we help newcomers take their first steps into the world of data science and statistics? In this talk, I present ModernDive: An Introduction to Statistical and Data Sciences via R, an open source, fully reproducible electronic textbook available at ModernDive.com, co-authored by myself and Chester Ismay, Data Science Curriculum Lead at DataCamp. ModernDive’s authoring follows a paradigm of “versions, not editions” much more in line with software development than traditional textbook publishing, as it is built using RStudio’s bookdown interface to R Markdown. In this talk, I will present details on our book’s construction, our approaches to teaching novices to use tidyverse tools for data science (in particular ggplot2 for data visualization and dplyr for data wrangling), how we leverage these data science tools to teach data modeling via regression, and preview the new infer package for statistical inference, which performs statistical inference using an expressive syntax that follows tidy design principles. We’ll conclude by presenting example vignettes and R Markdown analyses created by undergraduate students to demonstrate the great potential yielded by effectively empowering new data scientists with the right tools.

Building Shiny Apps: Challenges and Responsibilities

Jessica Minnier

R Shiny has revolutionized the way statisticians and data scientists distribute analytic results and research methods. We can easily build interactive web tools that empower non-statisticians to interrogate and visualize their data or perform their own analyses with methods we develop. However, ensuring the user has an enjoyable experience while guaranteeing the analyses options are statistically sound is a difficult balance to achieve. Through a case study of building START (Shiny Transcriptome Analysis Resource Tool), a shiny app for “omics” data visualization and analysis, I will present the challenges you may face when building and deploying an app of your own. By allowing the non-statistician user to explore and analyze data, we can make our job easier and improve collaborative relationships, but the success of this goal requires software development skills. We may need to consider such issues as data security, open source collaborative code development, error handling and testing, user education, maintenance due to advancing methods and packages, and responsibility for downstream analyses and decisions based on the app’s results. With Shiny we do not want to fully eliminate the statistician or analyst “middle man” but instead need to stay relevant and in control of all types of statistical products we create.

Using R on small teams in industry

Jonathan Nolis

Doing statistical analyses and machine learning in R requires many different components: data, code, models, outputs, and presentations. While one person can usually keep track of their own work, as you grow into a team of people it becomes more important to keep coordinated. This session discusses the work we do data science work at Lenati, a marketing and strategy consulting firm, and why R is a great tool for us. It covers the best practices we found for working on R code together over many projects and people, and how we handle the occasional instances where we must use other languages.

The Lesser Known Stars of the Tidyverse

Emily Robinson

While most R programmers have heard of ggplot2 and dplyr, many are unfamiliar with the breath of the tidyverse and the variety of problems it can solve. In this talk, we will give a brief introduction to the concept of the tidyverse and then describe three packages you can immediately start using to make your workflow easier. The first package is forcats, designed for making working with categorical variables easier; the second is glue, for programmatically combining data and strings; and the third package is tibble, an alternative to data.frames. We will cover their basic functions so that, at the end of the talk, we will be able to use and learn more about the broader tidyverse.

Text Mining Using Tidy Data Principles

Julia Silge

Text data is increasingly important in many domains, and tidy data principles and tidy tools can make text mining easier and more effective. I will demonstrate how we can manipulate, summarize, and visualize the characteristics of text using these methods and R packages from the tidy tool ecosystem. These tools are highly effective for many analytical questions and allow analysts to integrate natural language processing into effective workflows already in wide use. We will explore how to implement approaches such as sentiment analysis of texts, measuring tf-idf, and measuring word vectors.

Speeding up R with Parallel Programming in the Cloud

David Smith

There are many common workloads in R that are “embarrassingly parallel”: group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I’ll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel. Many of these techniques require the use of a cluster of machines running R, and I’ll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.

Making Magic with Keras and Shiny

Nicholas Strayer

The web-application framework Shiny has opened up enormous opportunities for data scientists by giving them a way to bring their models and visualizations to the public in interactive applications with only R code. Likewise, the package keras has simplified the process of getting up and running with deep-neural networks by abstracting away much of the boiler-plate and book-keeping associated with writing models in a lower-level library such as tensorflow. In this presentation, I will demo and discuss the development of a shiny app that allows users to cast ‘spells’ simply by waving their phone around like a wand. The app gathers the motion of the device using the library shinysense and feeds it into a convolutional neural network which predicts spell casts with high accuracy. A supplementary shiny app for gathering data will be also be shown. These applications demonstrate the ability for shiny to be used at both the data-gathering and model-presentation steps of data science.