Software and software engineering are relatively well-understood fields. Although technologies change, the general frameworks and ways of working have remained pretty consistent for decades.
On the surface, a data team looks like a software engineering team so you’d expect things to translate. However, this isn’t currently the case.
Because of the speed at which things are changing in data and machine learning, there aren’t many well-established best practices for data teams - and the few there are mainly only apply to larger companies rather than startups. This can make it a minefield to navigate when building out a small team.
In this post, I wanted to share some lessons that we’ve learnt at Ophelos whilst building out our data function from scratch.
For a more general insight into the challenges that data teams face globally, this blog post is great:
Lesson 1. Build a team of generalists
In a data team at large organisations you’d typically see a combination of:
- Business Analysts: business-focused and able to gain insight from well-structured data
- Data Analysts: more technical than Business Analysts and able to perform more deep-dive analysis.
- Data Scientists: perform more experimental research work and build machine learning models.
- Machine Learning Engineers: even more focused on machine learning than Data Scientists.
- Data Engineers: responsible for infrastructure and building data pipelines to provide accurate and clean data to the other data functions.
- MLOps: deploy machine learning models into production.
- Analytics Engineers: a hybrid of a data engineer and an analyst. Responsible for data modelling and ensuring all data is analytics-ready.
That’s 7 different job roles already. And I could go on. However, if your data team is as small as 2, 5 or even 10 people, you don’t have the luxury of covering everything with separate roles. One option is to select a few of the most pertinent roles for your objective. For example, in a team of 3, you might have 1 data engineer, 1 data scientist and 1 data analyst.
What we’ve chosen to do instead is build a team of generalists where each person can work across the whole stack and cover multiple roles. So for a team of 3, you would have 3 people capable of building data pipelines, machine learning models and analysing data.
We’ve seen multiple advantages of this team structure:
- Projects can be completed end-to-end from problem statement to production by one person
- Everyone has a greater understanding of the data since they see it through from raw to clean
- The team can move faster as there are fewer bottlenecks
- Better personal development through working on more varied tasks
Lesson 2. Build an open and flexible architecture
The paradigm shift of moving from on-premise data warehouses to a cloud-based architecture has brought with it a huge number of benefits around scalability. However, the modern data stack has continued evolving and is becoming more and more decentralised, to the point where it can be difficult for small teams to work with.
The problem isn’t that there are a lot of players in the market and choices to choose from, but that each tool only handles a small part of the overall data stack and the expectation is that you should piece together your architecture from many different providers.
For example, you would have a different cloud-based provider for:
- Ingesting data
- Storing data in a data warehouse/lakehouse
- Building data pipelines
- Training and deploying machine learning models
- Building data models
- BI and analytics
- Orchestration
To visualise just how much the data tooling industry has bloated recently, check out Matt Turck’s 2023 Data landscape.
With each new provider you add to your stack you add in another layer of complexity and more $$$. Of course, you can always build parts yourself but this takes time away from the true function of your data team. Our data stack is predominantly built on Databricks which is able to handle a large chunk of the overall stack.
When defining your architecture we believe there are three key principles to stick to:
- Don’t silo data. Having as much data as possible available in one place is a real superpower.
- Don’t silo functionality. Our workloads regularly involve combining lots of different processes. For example, accessing multiple data sources, running a machine learning model, making an API call and then writing the data back. So having a framework where we can do all of this in one place is essential.
- Fast time to production. We run lots of experiments and our company priorities change frequently so being able to build and deploy as fast as possible is a real asset.
Lesson 3. Focus on providing actual value
This statement is important for anyone who works at a startup, but we’ve found that constantly reminding ourselves of this has helped our team to prioritise work. Everyone who works with software wants to write perfect code and build perfect products. However, it’s important to remember the function of your team and how you actually provide value to the organisation.
There is always some way you can optimise your data pipeline or refactor code to be cleaner. But it’s all about balancing perfectionism with progression, and sometimes things don’t need to be absolutely perfect in order to drive your startup forwards.
Ophelos is an applied technology company, not just a technology company. So the primary purpose of our data team is to solve business problems. We do this through running experiments, analysing data and building machine learning models.
Lesson 4. Observability over testing
As I mentioned earlier, the modern data stack consists of piecing together components from multiple 3rd party providers. More providers mean more integrations and more integrations mean more integration testing. Furthermore, a lot of these tools have a strong focus on usability and have disregarded how to actually perform unit or integration tests. If you’ve ever tried to unit test a BI platform then you’ll understand the pain.
We have found success in focusing on observability over test coverage. By setting expectations of what data should look like and having alerts for job failures we’re able to quickly find out when issues have occurred and fix them quickly.
Also, to build on Lesson 3: test as much as you can, especially mission-critical components, but don't go over the top. It’s important to remember what the true purpose of your data team is. If it takes you days or weeks to write comprehensive tests for an experimental feature which might be redundant in 3 months then your time would have been better spent elsewhere.
A final word
It’s hard to tell what the future of data and AI holds. It’s also difficult to predict if the ways of working for data teams will slowly adopt the principles of software engineering teams. In any case, I hope the lessons we’ve learnt navigating this landscape provide value to other new teams out there!