Getting Started in Data Science

Students studying data science ask lots of questions about how to break into the field from another line of work.

My primary advice to someone starting out in data science is actually general job seeking advice. Since 75% of jobs are found through people’s contact networks (reference: 2015 personal conversation with a consultant at the Lee Harrison Hecht career coaching firm), a beginning data scientist needs to develop a contact network that includes data scientists.

The toughest part of any job search is getting past the keyword-based resume filters employed by many HR departments. Building your network is the best way to get past the automated resume filters.

Second, consider jobs that are related to data science, but called by other names: market research, analytics, web analytics, etc.

Third, don’t be afraid to take a job that is “entry level,” even if it requires that you take a salary cut relative to your last job. Why? After a year or two of real world experience, you’ll be a lot more valuable.

Question: What are the key skill areas I must develop to be a viable candidate for data science roles?

A well-rounded data scientist in a corporate environment will have enough breadth of skill to contribute to a team in at least three of the following eleven areas, including a credible story in the first three skill areas.

Area	Description
1. Collection & Analysis	The ability to acquire data from a variety of sources, manipulate it to remove or mitigate impurities (e.g. missing data), and transform it into a format suitable model building
2. Modeling	The ability to develop hypotheses, select algorithms based on the characteristics of the data available, and build highly accurate predictive models.
3. Interpretation	Explain the results from modeling in practical terms, highlighting the substantive significance versus statistical significance. Explain how the techniques used along with their underlying assumptions affect one’s ability to generalize modeling conclusions beyond the data used to build a model.
4. Research Methods	Define the unit(s) of analysis for the area to be studied. Determine whether an experimental design is needed to establish cause & effect relationships or test a hypothesis. Decide on other characteristics of the analysis, such as cross-sectional vs. longitudinal analysis. Define null and alternate hypotheses to be tested.
5. Applications	Embed predictive models into systems that are used by customers / end users on an operational basis (e.g. recommending cross-sell up-sell products on an e-commerce website), including the ability to generate predictions at high volume in less than 500 milliseconds.
6. Operations	Manage, update, and support models in production software applications at an acceptable cost structure with no downtime.
7. DevOps	Manage versions of code, algorithms, externally sourced components and test cases. Automate the build and deploy of models and and supporting components.
8. Solution Architecture for Data Science	Assign responsibilities to components in a logical software architecture in a way that enables high performance, manageable cost, fault tolerance, security, and ease of scaling with large volumes of data.
9. Software Selection and Supplier Management	Evaluate purchased components ranging from cloud-based infrastructure to machine learning capabilities (e.g. h2o.ai) based on objective evaluation criteria. Define and negotiate contracts with suppliers of purchased components so the costs of applications are manageable as end user usage and data volumes grow.
10. Business Value Management	Define a market opportunity for a data science powered application, including one time costs, ongoing costs, and benefits over a 3 - 5 year period. Manage the implementation of the data science powered application to a production deployment, manage its operation and track benefits to ensure they meet or exceed originally estimated values. Add or modify deployed capabilities to increase generation of benefits relative to costs over the lifespan of the application.
11. Work Management	The ability to take an ambiguous problem and break it down into small work items with clearly defined acceptance criteria, and then move the smaller work items through a series of steps to complete the work and verify that the completed work meets stated acceptance criteria.

Many of the data science curricula in universities are focused on the first three areas described above:

Data collection and analysis,
Model building, and
Interpretation of results.

The Johns Hopkins University Data Science Specialization on Coursera offered via Coursera covers Applications in addition to the first three areas.

Generally speaking, the last six categories aren’t taught in universities because many of the PhDs teaching data science don’t have sufficient industry experience to teach in these areas, especially Work Management and Business Value Management.

However, experienced technology professionals have many of these skills, and these are the things one can leverage in an interview to gain access to data science jobs when one is at an entry level in the first three skill areas.

Question: referring to the prior question, what is a “credible” level of skill?

For an entry level data scientist role, “credible” means being able to provide relevant answers to questions that are appropriate for people who have completed a data science curriculum or bootcamp.

For example, students who have completed the Johns Hopkins Data Science Specialization should be able to provide concrete but entry level answers to the following types of questions.

How would you combine one hundred data files and calculate descriptive statistics on the numeric data across the files?
What criteria does one use to select one machine learning algorithm over others?
When does the gradient boosting algorithm deliver a lower error rate versus a random forest?
What does the assumption of homoskedacity mean? How does one determine whether this assumption is valid for an ordinary least squares model?
How does one interpret the R squared in a regression model?
Why would one conduct an analysis of variance prior to individual comparisons of means?
Questions about one or more programming languages or statistical packages such as R, Python, SAS, Stata, SQL, etc.
Share a situation where you had to clean messy data. What steps did you take to find the problems, and how did you eliminate or mitigate them?
Describe two or more strategies for handling missing values in a data analysis, along with the strengths and weaknesses of each.

“Credible” also means knowing one’s limitations, and relating experiences where one has quickly learned new things.

Question: What are data science career options for an experienced IT professional?

The biggest question is whether one is willing to take an entry level data science job at an entry level salary. Depending on one’s current salary and financial flexibility, it may be worthwhile to take an entry level data science job instead of a technical project or engineering manager role.

2 - 3 years of significant work as a data scientist will be more valuable to a person’s long term career prospects than taking a job that is related to data science but where a person isn’t developing a portfolio of completed data science projects.

The most important thing a person can do to enhance her/his career prospects is to develop relationships in the field where one wants to work. In the U.S., As noted above, 75% of jobs are found by networking, so the market places a premium on developing relationships before you make a career move.

Return Home

Data Science Depot

Len Greski's articles and associated reference content for data science

Getting Started in Data Science

Question: What are the key skill areas I must develop to be a viable candidate for data science roles?

Question: referring to the prior question, what is a “credible” level of skill?

Question: What are data science career options for an experienced IT professional?