View on GitHub

Data Science Depot

Len Greski's articles and associated reference content for data science

Data Science Depot is a collection of data science articles I’ve written as part of my participation in the Data Science Community on Coursera.org, interactions on Twitter @lgreski, and content I’ve posted on StackOverflow. I also post references to articles of interest, or content that I’ve used as supporting material for answers I’ve provided to students in my role as a Community Mentor in the Johns Hopkins University Data Science Specialization.

More information about Len is available on the About page.

Recent Articles

Date		Article
16 August 2020		R Objects, S Objects, and Lexical Scoping
13 June 2020		Reading Excel Files
30 May 2020		Accessing Data from Elements of Statistical Learning
28 April 2020		Estimating Runtime for an R script

Data Science Careers

Articles in this section cover responses to community members’ questions about how to start or grow a career in Data Science.

Getting Started in Data Science
Interviewing Techniques: Answering Questions about an Unfamiliar Domain
Interviewing Tips: Red Flags or Opportunities? A response to Robinson and Nolis’ July 2018 article, Red Flags in Data Science Interviews

Guidelines for Science

Content in this section is related to Armstrong (2016) et. al. paper and presentation helping people distinguish true “science” from political / policy advocacy.

2016 Presentation: Improving Management Science – Problems and Solutions PDF version of presentation given at 2016 Marketing conference in Hong Kong.
2016 Paper: Improving Management Science – Problems and Solutions Full length version of the paper discussed in the above presentation. PDF version of the paper is also available on researchgate.net. Stored here because these references have a tendency to disappear from the internet over time.
Guidelines for Science: Checklist Checklist that can be used to evaluate whether a research paper is truly science, or advocacy disguised as science.

Future of Data Analysis / John Tukey Retrospective

The Future of Data Analysis by John Tukey, 1962 paper where he challenges statisticians to move away from ever more complicated mathematics to tackle data analysis problems in more realistic ways.
50 Years of Data Science by David Donoho, a 2015 retrospective on John Tukey’s 1962 paper.

Statistical Methods

Statistical Methods in Psychology Journals Classic 1999 article by Leland Wilkinson and the Task Force on Statistical Inference for the American Psychology Association.
Ordinal Independent Variables in Linear Regression Article by Richard Williams at the University of Notre Dame, making the argument to treat ordinal variables as continuous unless the analysis of linear effects indicates they should be treated as nominal level of measurement.
Interpreting the Magnitudes of Correlation Coefficients Article by James Hemphill, validating Cohen’s (1988) recommendations via a meta analysis of 380 psychology studies.
Statistical Paradises and Paradoxes Article by Xiao-Li Meng that argues big data is not a good substitute for random sampling, and paradoxically is more likely to produce biased results than a truly random sample.

Reproducible Research

This set of articles is relevant for what is currently called the replication crisis in social science research, where researchers are frequently unable to replicate or reproduce significant findings from prior studies that were published in peer-reviewed journals.

When the Revolution Came for Amy Cuddy 2017 New York Times article about difficulties scientists experienced when trying to replicate the research of Amy Cuddy, Ph.D., a prominent TED speaker whose 2010 study on power poses turned her into an internet celebrity. Since then, questions about the reproducibility of her research sparked a firestorm about statistical practices in social psychology research.
Openness and Reproducibility: Insights from a Model-Centric Approach Baumgaertner et. al. 2019 pre-print article about the use of model-centric approaches (i.e. probability theory and statistics) to the practice of reproducible research.
The case for formal methodology in scientific reform Quis custodiet ipsos custodes? …or “who guards the guardians?” In this 2020 pre-print article Berna Devezer et. al. turn the tables on reproducible research advocates, challenging them to stop making the same mistakes and over-generalizations they purport to address.
The New Statistics: Why and How Article by Geoff Cumming at LaTrobe University that argues for substantial changes in how we conduct research. He advocates for wide adoption of estimation of effecti sizes, confidence intervals, and meta-analysis in order to improve research integrity.

Machine Learning

Google Machine Learning Crash Course: Developed by the engineering education team at Google, the Machine Learning Crash Course introduces students to machine learning with the TensorFlow toolkit. The course is based on Python, so some background in Python programming, along with high school algebra, is required.

Tidy Data

Tidy Data Vignette: Per its opening paragraph, the vignette is a code-heavy and informal version of Hadley Wickham’s seminal paper, Tidy Data.
Tidy Data: The version of Wickham’s 2013 paper that was published in the Journal of Statistical Software.

Johns Hopkins Data Science Specialization

These articles are related to the 10 course Data Science Specialization that is offered by Johns Hopkins University via Coursera. An analysis of the specialization, The democratization of data science education, was published during the summer of 2017.

The content covers all ten courses in the Specialization, from The Data Scientist’s Toolbox to the Capstone course, and is indexed on my JHU DSS Community Mentor Github repository.