View on GitHub

Data Science Depot

Len Greski's articles and associated reference content for data science

Data Science Depot is a collection of data science articles I’ve written as part of my participation in the Data Science Community on Coursera.org. I also post references to articles of interest, or content that I’ve used as supporting material for answers I’ve provided to students in my role as a Community Mentor in the Johns Hopkins University Data Science Specialization.

Data Science Careers

Articles in this section cover responses to community members’ questions about how to start or grow a career in Data Science.

  1. Getting Started in Data Science
  2. Interviewing Techniques: Answering Questions about an Unfamiliar Domain
  3. Interviewing Tips: Red Flags or Opportunities? A response to Robinson and Nolis’ July 2018 article, Red Flags in Data Science Interviews

Guidelines for Science

Content in this section is related to Armstrong (2016) et. al. paper and presentation helping people distinguish true “science” from political / policy advocacy.

  1. 2016 Presentation: Improving Management Science – Problems and Solutions PDF version of presentation given at 2016 Marketing conference in Hong Kong.
  2. 2016 Paper: Improving Management Science – Problems and Solutions Full length version of the paper discussed in the above presentation. PDF version of the paper is also available on researchgate.net. Stored here because these references have a tendency to disappear from the internet over time.
  3. Guidelines for Science: Checklist Checklist that can be used to evaluate whether a research paper is truly science, or advocacy disguised as science.

Future of Data Analysis / John Tukey Retrospective

  1. The Future of Data Analysis by John Tukey, 1962 paper where he challenges statisticians to move away from ever more complicated mathematics to tackle data analysis problems in more realistic ways.
  2. 50 Years of Data Science by David Donoho, a 2015 retrospective on John Tukey’s 1962 paper.

Statistical Methods

  1. Statistical Methods in Psychology Journals Classic 1999 article by Leland Wilkinson and the Task Force on Statistical Inference for the American Psychology Association.
  2. Ordinal Independent Variables in Linear Regression Article by Richard Williams at the University of Notre Dame, making the argument to treat ordinal variables as continuous unless the analysis of linear effects indicates they should be treated as nominal level of measurement.
  3. Interpreting the Magnitudes of Correlation Coefficients Article by James Hemphill, validating Cohen’s (1988) recommendations via a meta analysis of 380 psychology studies.

Reproducible Research

  1. When the Revolution Came for Amy Cuddy 2017 New York Times article about difficulties scientists experienced when trying to replicate the research of Amy Cuddy, Ph.D., a prominent TED speaker whose 2010 study on power poses, turned her into an internet celebrity. Since then, questions about the reproducibility of her research sparked a firestorm about statistical practices in social psychology research.

Machine Learning

Johns Hopkins Data Science Specialization

Articles in this section are related to the 10 course Data Science Specialiazation that is offered by Johns Hopkins University via Coursera. An analysis of the specialization, The democratization of data science education, was published during the summer of 2017.

Course 1: Data Scientist’s Toolbox

  1. Course Prerequisites and Difficulty Levels Provides an overview of the Data Science Specialization courses, explaining from a practical perspective the courses a student needs as prerequisites to other courses. While students may take more than one class at a time, it’s important to know how information from earlier courses is used in subsequent ones.

    The article also ranks the difficulty levels from most to least difficult, based on the author’s experience in the curriculum as well as Discussion Forum feedback contributed by other students.
  2. Configuring RStudio to work with git / github - Mac OSX
  3. Configuring RStudio to work with git / github - Windows 7, 8, and 10
  4. Using Editor Modes in Discussion Forum Posts
  5. Buying a Computer for Data Science
  6. R and RStudio on Chromebook
  7. Installing R and RStudio on Chromebook Walkthrough demonstrating how to install R and RStudio on a Chromebook with Crouton and Ubuntu Linux.

Issue: Students Struggle to find URLs in Lecture Slides

If you’re interested in the URLs for the lecture slides, they are available in the Data Science Specialization Courses github repository. Each course is stored in a subdirectory within the repository, and the slides are built in R Markdown language, a technique you’ll learn in Developing Data Products.

Miscellaneous Articles about Data Science

  1. The Future of Data Analysis by John Tukey, 1962 paper where he challenges statisticians to move away from ever more complicated mathematics to tackle data analysis problems in more realistic ways.
  2. 50 Years of Data Science by David Donoho, a 2015 retrospective on John Tukey’s 1962 paper.

Course 2: R Programming

START HERE

If you’re new to the course and trying to figure out what to do in what order, start with these articles.

  1. Resources for R Programming Provides a summary of student-generated content to support the course, some of which is indexed on the Data Science Specialization’s github.io site
  2. References for R Programming Provides a list of references for R programming, ranging from beginning to advanced topics.
  3. Data Science Specialization: what is the value? Addresses a common question raised by students in R Programming who are frustrated by the amount of work they have to do on their own to complete quizzes and assignments.
  4. Swirl: common problems & getting help Discusses a couple of frequent problems students have getting swirl to work on their computers, and provides URLs to support from the creators of swirl.
  5. R versus Python Roundup of articles and surveys comparing R and Python, including usage, history, and pros / cons.

The next set of articles includes general commentary about the course, R programming in general, and R in relationship to other statistics packages.

  1. Commercial Statistics Packages: An Historical Perspective
  2. Configuring RStudio to work with git / github - Mac OSX
  3. A Data Frame is Also a List
  4. Forms of the Assignment Operator
  5. Forms of the Extract Operator
  6. S Objects, R Objects, and Lexical Scoping
  7. Thinking in R versus Thinking in SAS
  8. Strategy for the Programming Assignments
  9. Why is R More Difficult than SAS?
  10. R Onboarding for SAS Users
  11. References for R Programming Provides a list of references for R programming, ranging from beginning to advanced topics.
  12. Object Oriented Programming and R Explains how object oriented programming concepts are implemented in R, in response to a student question about accessing content output by the R linear models function, lm().
  13. Scoping in C/C++ vs. R Compares variable scoping in R versus C/C++.

Posts regarding specifics of programming assignments

  1. Assignment 1: Breaking Down Pollutantmean
  2. Assignment 1: A SAS Version of Pollutantmean
  3. Assignment 1: Common Mistakes - Weighted vs. Unweighted Means
  4. Assignment 1: Common Mistakes - complete(“specdata”,332:1) fails
  5. Assignment 1: A More Elegant Solution
  6. Assignment 2: Demystifying makeVector
  7. Assignment 2: makeCacheMatrix as an Object
  8. Assignment 2: Using Github Desktop
  9. Assignment 2: Grading the SHA-1 Hash Code
  10. Assignment 3: Functions to Sort Data Frames

Miscellaneous Code Examples and Instructions

  1. Permanently Setting R Working Directory Link to R-bloggers.com article that explains how to set your working directory permanently in R (instead of RStudio)
  2. Tutorial: Downloading Files Illustrates various ways of downloading files, including binary and text files.
  3. Creative Use of R: Downloading Course Lectures Article illustrating how to use R to automate the download of lectures from Data Science Specialization courses, such as R Programming. Techniques used in this article are helpful to make research reproducible, as required for courses like Getting and Cleaning Data and Reproducible Research.
  4. How to Upgrade R without Losing Your Packages article by Kris Eberwein on datascienceriot.com that includes code to save your list of packages to an rds file, and reinstall any packages that don’t make it through the upgrade process.
  5. Common R Mistakes: Overwriting R Functions with Output Variables
  6. R Programming Cheat Sheet Based on content from R for Everyone by Jared Lander.

Interesting R News and Blog Articles

  1. R vs. Python: 2016 Survey of Software used for Data Science Overview of results from a 2016 KDNuggets Software Poll, written by Gregory Piatetsky. The follow up article with expanded analysis is What Big Data, Data Science, Deep Learning software goes together, also on kdnuggets.com.
  2. R and Python vs. SAS and SPSS Jeroen Kromme’s take on strengths and weaknesses of these languages, posted on r-bloggers.com.
  3. Scaling R for Data Science August 2016 article by Federico Castanedo explaining three ways to scale R.
  4. Lexical Scoping and Statistical Computing Article by Robert Gentleman and Ross Ihaka at the University of Auckland describing how lexical scoping works, and why it is valuable in statistical computing.
  5. Data Science Job Report 2017: R Passes SAS, But Python Leaves Them Both Behind Bob Muenchen’s take on the job market for various data science langauges.

Course 3: Getting and Cleaning Data

  1. Week 1: Demystifying HTML Parsing: Baltimore Ravens Game Scores
  2. Real World Example: Importance of Getting and Cleaning Data Illustrates what happens when we use data to make inferences when we don’t understand the errors and/or limits of data collection and cleaning. Taken from a chart posted on Twitter during August 2017 by a company that sells data visualization software
  3. Real World Example: Reading American Community Survey data Illustrates concepts covered in Getting and Cleaning Data with U.S. Census data, including how to process a hierarchical file format in R, as well as using an electronic codebook to generate the parameters required to read the data file into a data frame.
  4. Common Problems: Quiz 1 - Missing Java Runtime Explains how to solve the problem of a missing Java Runtime for the question that requires students to process a Microsoft Excel spreadsheet.
  5. Strategy for Reading Files & APIs / Quiz 2
  6. Common Problems: Quiz 2 - sqldf() driver fails to connect
  1. Tidy Data Hadley Wickham’s paper on Tidy Data, required reading for the course project.
  2. data.table Github Wiki Repository for data.table package, including video.
  3. Tutorial: Downloading Files Illustrates various ways of downloading files, including binary and text files.

Course 4: Exploratory Data Analysis

  1. Assignment 1: Reading a Subset of Raw Data
  2. CONCEPTS: Strategies for Imputing Missing Values

Course 5: Reproducible Research

  1. Assignment 2 Checklist
  2. Configuring knitr to Retain Markdown Output Explains how to configure knitr so that markdown file and any associated graphics are retained after building an HTML document, so they can be uploaded to Github and viewed there.
  3. Assignment 2: Improving Runtime Performance of Initial Data Load

Course 6: Statistical Inference

  1. Reference Materials for Statistical Inference Start here if you’re looking for help on the statistical techniques taught in this course.
  2. Using MathJax with Discussion Forums, R Markdown, and Github Pages
  3. CONCEPTS: Calculating Area for a Point on the Normal Curve Explains why one cannot calculate the exact proability for a specific value within a distribution for a continuous variable.
  4. CONCEPTS: Variance of a Binomial Distribution Explains why the calculation for variance of a binomial distribution during the Variance lecture looks different than the way it is described in Wikipedia.
  5. Poisson Confidence Interval Explained Explains the formulas on slide 26 of the Asymptopia lecture, illustrates the differences in two calculations on slide 27.
  6. Power Calculations: Optimal Sample size
  7. Permutation Tests Explained
  1. Exponential Distribution / Central Limit Theorem - Assignment Checklist
  2. ToothGrowth Analysis - Assignment Checklist
  3. Exploratory Data Analysis in ToothGrowth Assignment, explaining the exploratory data analysis requirement for students who have not taken the Exploratory Data Analysis course prior to taking Statistical Inference.
  4. Accessing R Code from an Appendix in Knitr
  5. Theoretical Variance of Sampling Distribution of the Mean
  6. Kable Tables with Data Frames illustrates how to display a custom table in a knitr() document by creating a data frame to contain the information to be rendered with kable().
  7. Installing MiKTeX on Windows 10 / Generating a PDF with knitr
  8. Commentary on Factorial Design in Toothgrowth Analysis Illustrates how to conduct a full factorial analysis of variance with the toothgrowth data, comparing it to the techniques used in the course project for Statistical Inference.

Course 7: Regression Models

  1. Why does sum of errors * X equal 0?
  2. Using MathJax with Discussion Forums, R Markdown, and Github Pages

Course 8: Practical Machine Learning

  1. Week 3: Installing Rgtk2 and Rattle on OS X
  2. Week 4: Combining Predictors Math Explained
  3. Course Project - gh-pages Setup with RStudio
  4. Course Project - Improving Runtime Performance of Random Forest Models with caret::train()
  5. Course Project - Predicting Test Scores based on Training Model Accuracy

Course 9: Developing Data Products

  1. Configuring shinyapps.io Application Timeout A walkthrough on how to configure a Shiny application so it doesn’t waste the free monthly server processing time.

Course 10: Capstone

  1. Speech and Language Processing, 3rd Edition Working version of Jurafsky, et. al. book on natural language processing whose content on n-grams is helpful for the capstone.
  2. n-gram Computations and Computer Capacity Explains the amount of memory required to convert the text files for the course project into n-grams, using the quanteda package.
  3. Capstone Strategy Describes a general strategy to get through the Capstone: use the simplest approaches possible.
  4. Choosing a Text Analysis Package Reviews pros and cons of various R packages used for natural language processing, in the context of requirements for the Capstone project.

Content for Community Mentors

  1. Tips for New Community Mentors A list of tips for new mentors supporting the Data Science Specialization, ranging from when to direct students to paid / professional resources such as the Coursera Learner Help Center, to how to optimize the value of content that is posted by mentors.

© 2017 - 2018 Leonard M. Greski - copying with attribution permitted