Spinning Up a Start-Up Data Team

Part 1: Making Your First Start-up Data Team Hire

Craig Booth
Plumbers Of Data Science

--

Across all businesses, the vast majority of projects designed to drive value from data resources fail, by some estimates, up to 87% of data science projects are never deployed in production and by others “only 20% of analytic insights will deliver business outcomes”.

If you’re in the position where you’re considering starting to invest in data in your organization, these numbers can look incredibly disheartening. So, how do you set yourself up to be one of the minority of companies that are able to derive real value from your data?

Before we begin, one word about my personal point-of-view. Although my background is in academia — where I worked on purely algorithmic code and statistics — my current reality is start-up CTO at Packback. I remain hands-on in the day-to-day running of the data team. This article reflects my lived experience in that world, where we have gone from having zero data teams to having one data team. If your background or company is wildly different from mine please correct for that before taking anything I say too seriously.

Data Scientist vs Data Engineer

One of the first questions I ask all new engineers who want to work on a data team is where they sit on this spectrum

A horizontal line, the left is labeled “data engineer” and the right is labeled “data scientist”
The data engineer to data scientist spectrum, image by author

Since not everybody is familiar with these job titles, let’s describe their relative responsibilities. A data scientist

…cleans and analyzes data, answers questions, and provides metrics to solve business problems.

That is, your data scientists are the people in charge of using data to train models, interacting with the business to find and answer valuable questions, and effectively communicating the answers to those questions to a wide — and sometimes non-technical — audience. A data engineer, on the other hand,

develops, tests, and maintains data pipelines and architectures, which the data scientist uses for analysis.

I would characterize data scientists as researchers and communicators and data engineers as the deeply technical infrastructure support that makes the work of the data scientists usable at scale.

Now, everybody sits somewhere on this line. For some people, whose tendency is to specialize is to pick a specialty and go deep (“I am, and want to continue to be a data scientist”). Others occupy a broader range (“I do a bit of everything, probably weighted towards data science/engineering”). Notably, many people you speak to will communicate that they are in flux as they figure out what they want in their careers (“well, historically I have been in data science roles, but I want to grow into more of an engineering role”)

As one concrete example, here is me. Ten years ago, my background was almost pure data science and statistics. I made a significant career shift and became a pure software engineer, developing products, which allowed my skillset to widen significantly.

My career path on the data-engineer vs data-scientist spectrum, image by author

During this time I found that I enjoyed working on infrastructure and scalability issues every bit as much as I enjoyed working on pure data science problems. Over the years I have had the luxury of being able to move around and broaden my skill set. I consider myself reasonably strong across the board, but if asked will identify first as a data engineer and then as a data scientist.

Starting a Team Starts with Hiring One Person

With definitions out of the way, let’s talk about how to start a team. Whether you’re hiring a new person in, or transferring people cross-team, I think there are a couple of useful heuristics that point to the perfect candidate to start a new data team.

Rule 1: Before you can do anything, you need to get your data reliably into a centralized location.

A data scientist who exists in an organization without consistent, reliable, timely access to the underlying business data is fundamentally unempowered to succeed.

Before you can begin to think about all the cool things you’re going to do with your data, you need for it to be accessible. I strongly believe that data efforts should begin with data engineers.

You may ask “how will I get value from my data without access to data scientists?”, the answer to this is that the simple act of getting data centralized is valuable. Taking an organization from having no overarching view of its data resources to having one is incredibly powerful. Simply getting data into a single store and putting a piece of dashboarding software in front of it will allow very many members of your organization to derive valuable insights. Will they be building machine learning models? Nope, will they be reporting on the metrics that matter most to them doing their jobs? Absolutely!

Furthermore, the very process of centralizing and making available your data will teach you things about the interactions in your company that you never previously realized. For example, simply being able to see how marketing outreach correlates with platform activity will produce a flurry of insights that don’t need trained data specialists to understand.

Rule 2: You’re going to start simple with the analytics

Even after you get the data into one place, your first port of call should not be to chase today’s fanciest buzzwords (it’s “deep learning”, it’s a “GAN”) but rather to focus on deriving explainable value for the business, and this means using simple tools. You will find that in most situations, the simple tools are plenty to get the job done.

In fact, I would go so far as to say that 80% of data tasks can probably be handled by one of the simplest models, logistic regression. Why do I say this? First and foremost, the effort-reward curve

The effort-reward curve. There are diminishing returns in increasing effort in any area, image by the author

Getting a model that is 80% accurate may take you an hour. Getting a model that’s 90% accurate from the same data may take weeks or months of research. Especially in an organization where data is truly being exploited for the first time, that 80% is big! The best thing you can do for the organization is to answer lots of questions well enough to add value and be open with everybody about the limits of your analysis. Only once the low-hanging fruit has been harvested should you move on to higher-effort techniques to wring out small amounts of additional performance.

There is a second reason to prefer simple models when possible, and that is the trade-off between accuracy and explainability. Here is a schematic showing the approximate relationship between how accurate a model is and how “explainable” it is (that is: How easy is for me, as a human being, to gain a logical understanding of how the model arrived at a prediction). Simple models have a huge advantage in that you can actually tell users “this is why the model did what it did”. Especially in an organization that is new to using data to make decisions, being able to explain to stakeholders the why of the decisions is a huge advantage in terms of having those predictions taken seriously.

Accuracy vs. explainability. More accurate models are inherently less explainable, image by the author

The Profile of Your First Data Engineering Hire

Given the previous sections, I would advocate for making the first member(s) of the data team be “data engineering pragmatists”. The data engineering pragmatist is a data engineer first and foremost, but also somebody who has an interest in seeing how they can use their data work to better the organization. The data engineering pragmatist enjoys seeing their work make an organization function more effectively but may or may not have any interest in growing into a data scientist themselves.

The path of the data engineering pragmatist on a new team is likely that they spend the first period in their role as a pure data engineer, and then as they embed into the organization, they begin to take on simple data tasks to demonstrate the value of the cool stuff they just built. In terms of the exact profile I am looking for in this position

I want to know that they are a solid values fit with my team

  • Values alignment with the team is by far the highest priority in terms of being able to say yes. A first data engineer must have an investors mindset and be fearlessly curious. A values miss even for a technically outstanding candidate makes them a definite no.

I want to know if their tendency is towards best practices

  • A first data engineer will have to think big and build a lot. The world is full of data engineering best practices, I want somebody whose first tendency is to learn the best practice rather than engineer a new thing

I want to know that they have a solid engineering background

  • A first data engineer is an engineer first and foremost. I hold data engineers to the same technical standards as any software development engineer. A data team staffed by engineers with the capability to build a product from the ground up is a data team that is empowered to make big changes

I want to know that they have the desire to learn data stuff.

  • Finally, and at the bottom of the list, I want the first data engineer to have the desire to learn data science. Honestly, this is probably optional. There is so much value in data engineering alone, and this first hire will not run out of valuable data engineering work to do.

What Next

This article got fairly long, so in a follow-up, I want to talk a bit about what to do with the nascent team once it exists.

A team without a direction — particularly a new team — will get pulled in a hundred directions and never be able to make decisive moves of its own. Having a long-term strategic roadmap allows the team to set a direction, and increased its probability of getting there by orders of magnitude.

--

--

Craig Booth
Plumbers Of Data Science

Tech CTO at packback.co. Values collaboration, teamwork and fun. Likes data, data visualization and baking.