Interviewing for new data scientists
People who are new in their data science careers often struggle to find their first data science job. One reason for this is because they often find themselves in interview situations where they are asked questions about problem domains with which they have no previous experience.
An effective strategy for responding to this type of question is to highlight your knowledge of the main activities in a data science project, and explain how you would apply these steps to quickly learn about the domain.
Data Science Activities
Data science bootcamps and specializations teach one form or another of a basic data analysis process that includes the following activities.
- Collect & clean the data
- Conduct exploratory data analysis to understand the basic features of the data
- Identify hypotheses to test and/or variables to predict
- Develop inferential and/or predictive models
- Make inferences / draw conclusions
If the project requires embedding the models in some type of production process (i.e. deciding what products to upsell on an e-commerce website), the following steps are also required.
- Embed the models into production software applications
- Test the runtime performance at scale, to ensure prediction algorithms run at “internet speed”
- Build the underlying processes to store and track versions of models, retrain the models, and embed updated models into the production software
- Build the underlying processes to track the purchased and/or open source software so you can reproduce the predictions done at any point in time (e.g. versions of R, and all packages like caret, randomForest, etc.)
- Develop the operational processes to monitor and maintain the applications that have model-based predictive algorithms when they’re being used by customers / end users.
Using the Data Science Activities to Give Credible Answers
Knowing the process allows a candidate to ask the interviewer questions about the problem domain, which than can be connected back to the candidate’s experiences in a bootcamp or specialization.
The flow of discussion should look like the following:
- questions about data sources and common data problems in the domain
- answers that highlight how you’ve overcome data collection and cleaning problems
- questions about the basic features of the data
- answers that highlight how you’ve assessed basic features of a data set
- Etc.
A candidate could also start his/her answer by acknowledging a lack of experience in the domain, summarize the main activities in a data science project, and explain how these activities would apply to the problem domain.
Conclusion
A data scientist needs to have the confidence that s/he can use the process to competently analyze any problem domain. By knowing the data science process and being able to talk about your specific experiences with it, you can relate your problem solving skills to any domain. This is the essence of data science: lots of messy data, scores of algorithms, and constrained computer resources.