Data Science Standards
What are the Data Science Standards?
The Data Science Standards are a proven process at both the university and the bootcamp level for students to create production grade machine learning for their portfolio, to excel in the job interview. This process has been stress-tested with over 5,000 students and offers you the following:
- A Framework that leads to confidence with client success and career interviews
- A portfolio to share as Proof of Concepts to clients and for career opportunities
- A Standard for mental model and business framework to solving production grade machine learning
- An organized, and centralized repository for state of the art resources for production grade data science
- Available for any technology stack
Foundational Literature:
- Part 1: Project Proposal Criteria - Prepare an Abstract as both a Document and a PowerPoint (Start with 3 to 6 project ideas)
- Part 2: Perform Exploratory Data Analysis, Visualizations, and Feature Engineering
- Part 3: Perform Machine Learning, Performance Metrics, and Deployment for your project
- Part 4: Present your project as a Presentation to your business stakeholders
- Part 5: Submit your project for your Advisors and business stakeholders
Please prepare your project proposal as a sharable document, and a PowerPoint presentation
- Project Title
- What is your Project Theme (I.e., Industry Vertical/Machine Learning Topic)?
- What is your Abstract? Write a 1-paragraph Executive Summary of your Solution.
- Problem Statement & Business Case
- What is the technical problem you are solving?
- What is the applied business case for this problem?
- Business perspective (I.e., Likelihood, sentiment, demand, price, market strategy, groups, automation)
- Data Science Workflow
- What Null/Alternative Hypothesis are you testing against?
- Does X Predict Y? (I.e., Distinct groups, key components, outliers)
- What is the response column/predictor that is important for you to measure?
- What assumptions are important for you to assess and to benchmark?
- What solutions would you like to deliver against?
- How will you measure your benchmarks and their performance drift over time (I.e., Automate jobs/predictive monitoring)?
- What alternative questions would you like to explore and provide solutions?
- What analytics and insights would you like to discover from your data? - What types of graphics or machine learnings would you like to discover?
- What is the business case for your project?
- How will your solution help generate revenue, reduce costs, or impact another Key Performance Indicator or Objective Key Result?
- Who will be impacted (Executive Stakeholders/Sponsors) by your solution? Who is your ideal client/customer?
- Data Collection
- What raw datasets/APIs will you extract for machine learning?
- What are the data schemas for your current datasets (I.e., SQL, CSVs, Parquet, Avro, Snowflake/Star)
- What are the dimensions and sizing (I.e., MB/GB/TB/PB) of your current datasets?
- Is the data from open-source, paid crowdsourcing, internal?
- What is the structures, file types, and quality of the data?
- How will you collect, store, and process the data (I.e., locally, databases, cloud)?
- Of your known data, what is the current data dictionaries that exist, or that you can further describe? (You can create these data dictionaries in a spreadsheet, markdown table, or listed)
- Data Processing, Preparation, & Feature Engineering
- What techniques will you use to improve your data quality?
- How will you handle missing data and outliers?
- What calculations/formulas would you like to create, that may not yet exist?
- Machine Learning: Model Selection
- Which model architecture(s) will you use to solve your problem?
- How will you validate the model performance?
- Model Persistence: Deployment, Training, & Data Pipelines
- How would your results operate LIVE in a production environment? (I.e., Web App, Architecture flow, DAG diagram, end-to-end workflow)
- What technology stack, what integrations, and which Engineers would you cooperate?
- Where will you share your results internally or externally to stakeholders through Marketing, Implementation and Deployments?
- How will you validate your machine learnings with a timeline from development to production? How will you generate more data to train?
The Exploratory Data Analysis is a significant progression from Defining a Data Science Problem to determine the specific characteristics needed to solve the problem. From Data Wrangling, Data Munging, Pre-processing, Pipelines, Data Visualization, and Data Analytics, all these areas are essential for effective Exploratory Data Analysis.
1. Compute and Storage Considerations: Projects that scale require more compute, faster computer, and more storage. In the market, many solutions from many providers exist. If you need Cloud Compute and Storage consider the following options:
- Paperspace - For under $10 per month, basic cloud compute and storage is available, with automation, Docker containers, and pre-installed Python packages in a Jupyter notebook.
- Google Colab - Cloud Notebooks with the potential to accelerate with GPUs and TPUs. Data can be accessed and stored from Google Drive.
- Microsoft Notebooks - Cloud Notebooks and data on Azure.
- Custom environments: Amazon Web Services with EMR, Microsoft Azure, Google Cloud Platform, and IBM Watson Data Studio.
- Note: Today there are dozens of other platforms that can help in the cloud, including Domino Data Lab, Anaconda Cloud, Crestle, Spell.ai, Comet.ml, among others.
2. Developer Environment:
- Pick a consistent Framework (Python or R) that can be used for your end-to-end project workflow.
- Consider a consistent environment for your project development (Jupyter, PyCharm, or Visual Studio Code which support code, Markdown Text, and LaTeX.
3. Data Collection:
- Import your Data in-memory from SQL Databases, APIs, or Files with Pandas IO and Camelot PDFs or BeautifulSoup for web scraping
4. Data Exploration:
- Examine your data, columns and rows and rename and adjust indexing and encoding as appropriate. This Pandas Cheatsheet could be resourcesful for you. Did you also know that Python has excellent built-in functions.
- Explore null, NaN, None, and missing data with python packages such as missingno and pandas-profiling. Repair this data either by dropping or imputing values (I.e., mean, median, ffill, bfill, knn calculation)
- Indexing: Change indices and datatypes as appropriate for your dataset. (I.e., string, category, integer, float, datetime, timedelta). The datetime module will assist for datetime objects.
- Reduce memory constraints: Consider changing datatypes from Int64/Float64 to Int32/16 if memory performance is important for your compute requirements.
- Forensically Repair Data: The regex package, or alternatively built-in functions such as .replace and .apply could be used to fix data issues I.e., ($,;|\n\t, etc.)
- Repair imbalanced datasets with upsampling or downsampling with imblearn or scikit-learn
- Join, Concatenate and merge datasets with Pandas, or SQL modules
- Generate statistics for columns, distributions, pivots, and aggregations with Numpy, Scipy, and Pandas modules.
- Generate custom calculations, including correlation analysis.
- Repair attribute columns with Pre-processing, Pipeline, and parameter tuning
- List hypothesis for response variable to predict, classify, cluster, or reinforce with Machine Learning.
5. Data Visualizations:
- Visualizations can include 100+ types of graphs, available in a sample of the following modules: Turtle, Matplotlib, Seaborn, Plotly/Dash, Bokeh, Altair, Plotnine, Vincent, Mlpd3, Folium, pygal), Sci-kit plot and Yellow Brick.
- Design considerations: Color Maps, Styles, and Palettes. Custom colors can be chosen from Adobe color, Lyft Colorbox, Geenes, and Color Data Styleguides.
- All graphs/plots must be labeled, formatted and reproducible. All graphs must be saved as PNG files in an Images folder, and saved as an overall PDF for project submission.
Scripts & Notebooks:
- Create Jupyter Notebook or Scripts where DataFrames and data files are called for machine learning pipeline
- Revisit your Working Hypothesis(es) to benchmark or backtest your response prediction/classification/cluster/reinforcement.
- Select machine learning modules for your data science (I.e., scikit-learn, statsmodels, pytorch, TensorFlow, Fast.AI, XGBoost, LightGBM, sktime, fbprophet, etc.)
Note: Module versions may require dependencies and may be unstable, and as such, you are recommended to develop and debug in isolated developer environments. (I.e., Conda, Docker, Kubernetes, Cloud instances)
- Perform feature selection/variable importance as a result of Exploratory Data Analysis, Data Visualizations, Dimensionality Reduction, and Model Tuning
- Select Algorithms for Machine Learning (I.e., Linear Regression(s), Logistic Regression, Trees, Proximity Models, Classifiers, Natural Language Processors - Spacy, Word2Vec, Time Series, Neural Networks, XGBoost, LightGBM, Ensembles, Stacked Models)
- Parameter Tuning - Perform Grid Search or Randomized Search to optimize parameters for validation, splits, regularization, tuple/dictionary parameters, etc.
- Compare and interpret appropriate metrics for regression, classification, or clustering models to the base case/null scenario/benchmark against your majority class. Prepare your model for deployment and Model Persistance
Use this Presentation Skeleton for your Data Science, Solution Engineering or Customer Success Demonstration
1. Cover Slide:
- Project Title/Name
- Team Member Names
- Job Titles, Organizations, E-mail Addresses
2. Agenda Slide:
- Topics Included and Timing
3. Introduction Slide:
- Introduction to your Stakeholders
- Introduction to your team
4. Problem Statement Slide:
- Describe Thesis, Problem Statement, or Core Problem
- Describe Solutions both technical/non-technical to the problem
5. Data Analysis Slides:
- Discuss Techniques, Software stack, platforms used
- Discuss Data Dictionary, Feature Engineering Techniques
- Discuss Benchmarks or baseline metrics as statistical controls
- Discuss data visualizations with business context (Maximum 2 visualizations per slide)
6. Machine Learning Slides:
- Discuss the 3 Best Scoring Models or Leaderboard, metrics, and business case interpretation
- Describe how robust metrics performed relative to baseline (Model Persistance)
7. Model Deployment Slide:
- Discuss how solution will be implemented or Deployed in Production
8. Conclusion Slide:
- Abstract of Solution Summary
- Recommendations and Results with applied business context
- Additional Research and Analysis
9. Next Steps Slide:
- Contact information, Github/Gitlab URL, Presentation Link, and Call to Action
10. Appendix Slides
- Works Cited and Media Resources
Submit the following requirements for your project to be considered complete
1. Code Requirements:
- To share with your Advisors on Github, Gitlab, or Bitbucket Repository
- To share all code files, Jupyter Notebooks or Script files, data/database files, and digital assets to be shared in a private repository
- To Include markdown, LateX, HTML, or Restructured Text to document your Jupyter Notebooks, and to include Comments and Docstrings where relevant for code
- To share Final Presentation as PowerPoint AND Adobe PDF
- To save and share all graphs and visualizations as separate PNG files in an Images folder, and as a PDF document
2. Slides Requirements:
- To focus PowerPoint presentation on applied business use case, analysis, insights, and business impact
- To not focus PowerPoint presentation on code
- To use less than 3 fonts in Presentation
- To include less than 20 slides in Presentation
- To present under 7 minutes talking time for Presentation
- To practice and prepare for remarks on 3 minutes Questions & Answers for business stakeholders and executive sponsors
License
To the extent possible under law, David Yakobovitch has licensed this work under Creative Commons, 4.0-NC-ND. This license is the most restrictive of Creative Commons six main licenses, only allowing others to download your works and share them with others as long as they credit the author, but they can’t change them in any way or use them commercially.