Data Science Workbench

Data science teams need an environment that allows them to connect with stakeholders, gain context fast, work through data at scale (beyond their PCs), and publish, share, and enable others to make data-driven decisions. Luckly we are at point where most the technologies, tools and techniques are open sourced (see below), given us more time to focus on the domain, the problem, the people, the process. Ultimately, the latter makes a more effective data science team.

What is a data science workbench?

My 94 year old grandfather (Vern) had a workbench in his garage. I was in love with his space. He made wooden sailboats,toys, furniture, fixed wobbly chairs…whatever grandma needed, he figured it out. He seemed to have the right tools, at the right place, at the time right. I was inspired by this. Back to us. With proven problem solving methodologies (DMAIC and Six Sigma have been around since 1980s) and open source technology, advancement in machine learning and data everywhere… having a data science workbench is critical. Data science teams tasks range from asking right questions, gaining domain knowledge, assessing opportunities/problems, and acquiring the right data, tools, technologies to enable an effective solutions. Below is a collection of frameworks, techqiques, how-to’s, code-examples and tools that I have collected over the years supporting the end-to-end range of task for a data science team.

Full-Service Data Science Engagements

Engagements Description Insights
Data Enablement Make curated datasets available for consumption Investment can be large early on, but a well established and understood data foundation will scale for descriptive, diagnostic, and predictive analysis.
Data Products Build useful apps, tools, dashboards which can have embedded M.L / A.I algorithms Starting with descriptive (BI) and automation use-cases can establish trust and drive productivity, but upskilling other teams/communities to handle that is a must to focus core resources on more advanced analytics.
Consultancy Identify problems and opportunities, assess impact, risk, and return Establish trust as early as possible and challenge with “why?” first before getting into details on the “what?”. Many teams want solutions without validating specific problem.
Data-driven Diagnostics Structured problem solving, identify root cause, and effective improvement plans Starting with executive metrics and mapping to operational metrics and even specific processes can become an asset for effective diagnostics
Ad-hoc analysis for Executives Provide informed decisions on complex and strategtic issues With mature data foundation, context of domain, and strong understanding of “horizantal” impact up and down-stream of issue, this engagement is important to create top-down culture of being analytically-driven.

Tasks data science teams need to do well…

Tasks Description Tools & Technologies
Identify opportunities or problems Gain context of products, processes, key decisions, and control points Interviews, SWOT, SIPOC, Process Mapping, Affinity Mapping, Issue Trees, Ishikawa diagram, FMEA,QFD,KPIs
Define problem and measure impact Understand urgency, severity, complexity, and business impact. Should be convincing in order to gain suppoer and resources (if needed) Problem Statement Guidelines, Avoid Solutioning
Set a business goal for project Goals can be incremental or transformative (or both). Understanding impact if goal is not met can help determine level of precision/accuracy/scalabilty against cost. SMART goals, Benchmarking, Normality Test
Prioritize Problems / Opportunities Gain business alignment on priorities Prioritization Matrix, Cause and Effect Matrix
Acquire data Ability to ingest data from various data sources Webscrapping, APIs, Sensors, B2B feeds, Audio, Images, SQL and NoSQL languages
Store data select right database technology, create/design tables, write data, and create data model (make data consumeable) Postgres, Pgadmin4, MongoDB, SAP Hana, Hadoop, Cloud Storage
Access and transform data Ability to query data tables & join, aggregate, clean, validate useable dataset SQL and NoSQL (MongoDB aggregation), Python, Notebooks, Pandas, Numpy, Loops, Excel
Automate data pipelines Schedule datasets to be automatically updated/refreshed Cron, Windows Task Scheduler, Airflow
Visualize Data Ability to interpret and ask questions of dataset to gain context of domain Tableau, Spotfire, Google Data Studio, Seaborn, Plotly, Matplotlib
Identify key variables Determine relationships and correlations of variables and effect on response variable / control point EDA, Correlation Matrix, Covariance, ANOVA, t-test, Histograms, I-MR Charts, Kruskal-Wallis Test, Multi-Vari Chart, Density Plots, Normal Probability Plot, Pareto Charts, p chart, Regression, Root Sum of Sqaures, Run Chart, Scatter plot, Xbar-R charts
Select and Build M.L. Models Ability to select effective model to address objectives/goal Classification, Regression, Clustering, Optimization, Decision-Tree, Bayesian Trees, Neutral Nets, Distance Formulas, scikit-learn, caret, tensorflow
Implement containerization and Virtual environments Ability to contain experiments and projects within a local environment to avoid version conflicts. Also makes projects portable/shareable Docker, DockerHub, virtualenv
Understand compute & HW limits Understand trade-off of time and cost regarding computational power GPUs vs CPUs, Differentiable programming (e.g Tensorflow), Cloud Services, Distributed Computing (e.g Dask), Kubernetes
Document and Storytell Ability to articulate approach, findings, challenges, results to both collegues and business leaders through written and verbal communication github, git, markdown, Tableau, powerpoint, research paper template
Package solution Make solution useful. Build minimum-vaiable product (MVP) for consumer/user to interact with solutions Flask, Dash, Ploty, RShiny, Tableau
Measure Change Management Ability to assess adoption of solution and recieve feedback Stakeholder Analysis, Performance and Compliance Metrics, Data Collection Plan, Surveys, Audits, Marketing and Communication Plan
Meta-Learning Ability to “learn to learn”, identifying mental frameworks and models that accelerate new skills development or adapting to new environments rapidly. Value of Sleep, Diet, Cardio Habits, Speed Reading, Audiotory Learning,Hand-written notes, Daily Practice, Minimalism, Google search, Reddit (“ELI5:whateveryouwant”), RescueTime, How to read a book