Data science teams need an environment that allows them to connect with stakeholders, gain context fast, work through data at scale (beyond their PCs), and publish, share, and enable others to make data-driven decisions. Luckly we are at point where most the technologies, tools and techniques are open sourced (see below), given us more time to focus on the domain, the problem, the people, the process. Ultimately, the latter makes a more effective data science team.
What is a data science workbench?
My 94 year old grandfather (Vern) had a workbench in his garage. I was in love with his space. He made wooden sailboats,toys, furniture, fixed wobbly chairs…whatever grandma needed, he figured it out. He seemed to have the right tools, at the right place, at the time right. I was inspired by this. Back to us. With proven problem solving methodologies (DMAIC and Six Sigma have been around since 1980s) and open source technology, advancement in machine learning and data everywhere… having a data science workbench is critical. Data science teams tasks range from asking right questions, gaining domain knowledge, assessing opportunities/problems, and acquiring the right data, tools, technologies to enable an effective solutions. Below is a collection of frameworks, techqiques, how-to’s, code-examples and tools that I have collected over the years supporting the end-to-end range of task for a data science team.
Full-Service Data Science Engagements
| Engagements | Description | Insights |
|---|---|---|
| Data Enablement | Make curated datasets available for consumption | Investment can be large early on, but a well established and understood data foundation will scale for descriptive, diagnostic, and predictive analysis. |
| Data Products | Build useful apps, tools, dashboards which can have embedded M.L / A.I algorithms | Starting with descriptive (BI) and automation use-cases can establish trust and drive productivity, but upskilling other teams/communities to handle that is a must to focus core resources on more advanced analytics. |
| Consultancy | Identify problems and opportunities, assess impact, risk, and return | Establish trust as early as possible and challenge with “why?” first before getting into details on the “what?”. Many teams want solutions without validating specific problem. |
| Data-driven Diagnostics | Structured problem solving, identify root cause, and effective improvement plans | Starting with executive metrics and mapping to operational metrics and even specific processes can become an asset for effective diagnostics |
| Ad-hoc analysis for Executives | Provide informed decisions on complex and strategtic issues | With mature data foundation, context of domain, and strong understanding of “horizantal” impact up and down-stream of issue, this engagement is important to create top-down culture of being analytically-driven. |
Tasks data science teams need to do well…
| Tasks | Description | Tools & Technologies |
|---|---|---|
| Identify opportunities or problems | Gain context of products, processes, key decisions, and control points | Interviews, SWOT, SIPOC, Process Mapping, Affinity Mapping, Issue Trees, Ishikawa diagram, FMEA,QFD,KPIs |
| Define problem and measure impact | Understand urgency, severity, complexity, and business impact. Should be convincing in order to gain suppoer and resources (if needed) | Problem Statement Guidelines, Avoid Solutioning |
| Set a business goal for project | Goals can be incremental or transformative (or both). Understanding impact if goal is not met can help determine level of precision/accuracy/scalabilty against cost. | SMART goals, Benchmarking, Normality Test |
| Prioritize Problems / Opportunities | Gain business alignment on priorities | Prioritization Matrix, Cause and Effect Matrix |
| Acquire data | Ability to ingest data from various data sources | Webscrapping, APIs, Sensors, B2B feeds, Audio, Images, SQL and NoSQL languages |
| Store data | select right database technology, create/design tables, write data, and create data model (make data consumeable) | Postgres, Pgadmin4, MongoDB, SAP Hana, Hadoop, Cloud Storage |
| Access and transform data | Ability to query data tables & join, aggregate, clean, validate useable dataset | SQL and NoSQL (MongoDB aggregation), Python, Notebooks, Pandas, Numpy, Loops, Excel |
| Automate data pipelines | Schedule datasets to be automatically updated/refreshed | Cron, Windows Task Scheduler, Airflow |
| Visualize Data | Ability to interpret and ask questions of dataset to gain context of domain | Tableau, Spotfire, Google Data Studio, Seaborn, Plotly, Matplotlib |
| Identify key variables | Determine relationships and correlations of variables and effect on response variable / control point | EDA, Correlation Matrix, Covariance, ANOVA, t-test, Histograms, I-MR Charts, Kruskal-Wallis Test, Multi-Vari Chart, Density Plots, Normal Probability Plot, Pareto Charts, p chart, Regression, Root Sum of Sqaures, Run Chart, Scatter plot, Xbar-R charts |
| Select and Build M.L. Models | Ability to select effective model to address objectives/goal | Classification, Regression, Clustering, Optimization, Decision-Tree, Bayesian Trees, Neutral Nets, Distance Formulas, scikit-learn, caret, tensorflow |
| Implement containerization and Virtual environments | Ability to contain experiments and projects within a local environment to avoid version conflicts. Also makes projects portable/shareable | Docker, DockerHub, virtualenv |
| Understand compute & HW limits | Understand trade-off of time and cost regarding computational power | GPUs vs CPUs, Differentiable programming (e.g Tensorflow), Cloud Services, Distributed Computing (e.g Dask), Kubernetes |
| Document and Storytell | Ability to articulate approach, findings, challenges, results to both collegues and business leaders through written and verbal communication | github, git, markdown, Tableau, powerpoint, research paper template |
| Package solution | Make solution useful. Build minimum-vaiable product (MVP) for consumer/user to interact with solutions | Flask, Dash, Ploty, RShiny, Tableau |
| Measure Change Management | Ability to assess adoption of solution and recieve feedback | Stakeholder Analysis, Performance and Compliance Metrics, Data Collection Plan, Surveys, Audits, Marketing and Communication Plan |
| Meta-Learning | Ability to “learn to learn”, identifying mental frameworks and models that accelerate new skills development or adapting to new environments rapidly. | Value of Sleep, Diet, Cardio Habits, Speed Reading, Audiotory Learning,Hand-written notes, Daily Practice, Minimalism, Google search, Reddit (“ELI5:whateveryouwant”), RescueTime, How to read a book |
