The Optimized Data Stack: Strategies for building a budget-conscious data stack
By Demilade Agboola, Senior Analytics Engineer @ Data Culture
Building a robust data infrastructure is not only a complex initiative for organizations, but it can easily blow initial budgets out of the water. Ongoing use cases and changing needs can seem overwhelming or almost impossible to tackle, however with the right planning, it will prove its return on investment significantly. Companies that further optimize their data infrastructure ensure that they are getting exponentially more than they are putting in. By visualizing current data stack costs, leveraging open source solutions, improving data warehouses, and putting a focused effort into forecasting future data needs data teams can get in front of modern data stack cost optimization.
Data and the growth of data
In a recent Forbes article, Tony Bradley writes that “In 2021, 2.5 quintillion bytes of data were created every day. Experts predict that humans will produce and consume 94 zettabytes of data in 2022.” From videos to articles, the data explosion, due to the increase of technological advancements and ease of user access, has been publicized over the past decade by every news and social media outlet. With the increase in the quantity of data, the availability of cloud resources to handle it, the increase in the skill level of data practitioners, and the proliferation of low-code/no-code data tools, data has become even more accessible to technical and non-technical business professionals alike. These professionals are harnessing data to give their businesses an edge, measure risk, increase revenue, drive innovation, and improve efficiency to name a few use cases.
While the increase in the availability also corresponds to an increase in the utility of data across businesses, a massive concern to business teams is the potentially high cost to set up the necessary infrastructure to gain insights into their operations, customers, and markets. These added insights can enable leaders to make better decisions backed by objective data that improve customer experience, resource allocation, and business growth while also increasing the chances of the business surviving a harsh economic climate. Such insights should, however, not come at an unexpected cost to the company.
The use of the modern data stack
A popular implementation of data strategy has been coined the “Modern Data Stack” (MDS). As data has exploded, so have the number of businesses investing in data teams and data infrastructure. The MDS consists of data ingestion, data storage, data transformation, business Intelligence, and/or AI/ML modeling. It enables automation of many routine data management tasks, freeing up time and resources for more valuable activities. The MDS also includes a range of security technologies and practices that help to protect data from unauthorized access and breaches as the security is upheld by providers that have signed Service Level Agreements.
The concept of the MDS has grown because of the now reduced technical requirements, its scalability, flexibility, improved efficiency, enabling collaboration and improved security. The technical expertise required to set up an MDS compared to traditional data stacks is no longer a barrier to entry. Plug-and-play tools allow for less technical skill and knowledge and enable business stakeholders to also engage in data initiatives. Storage and computing resources are made available through cloud-based services making it easier to share data and collaborate more easily with partners, further facilitating the exchange of ideas and insights and removing on-premise hurdles. This combined makes implementing a data stack possible for teams that understand the potential offered by the added ease of access.
Cost of the data stack
What most articles, blogs, and product descriptions don’t cover in detail is that the costs of the MDS can skyrocket if adequate thought isn’t given to the strategy required in setting up the infrastructure. Prices of tools can vary depending on several factors, such as the size and complexity of the system, the types of data sources and storage systems used, and the specific tools and technologies that are implemented.
Many of the tools and technologies used in the modern data stack are proprietary software products that require organizations to purchase multiple licenses for use. Some business needs require migrating legacy systems to cloud infrastructure to handle large amounts of data to then be computed on and stored, which can also be expensive. This also requires skilled personnel to design, implement, and maintain these new systems. Labor costs that come with these implementations can include salaries and benefits for employees, as well as the cost of training and professional development adding to the initial assumed total cost for the overall efforts.
The Optimized Data Stack
There are many levers that can be pulled to keep the cost of the MDS down and operate within budgetary requirements. Though the goal is to cut costs, it’s still important to do this without sacrificing the quality or performance of the system. What’s offered in this article is the concept of the Optimized Data Stack (ODS).
It is impossible to reduce costs if there is not an established way to measure the current state of infrastructure and what it actually costs. Ideally, every part of the data stack should be seen from one dashboard and monitored over time. This allows visibility into what parts of the data stack are increasing costs as well as data volume at an unexpected pace and need to be addressed.
Collecting the necessary metadata from these tools and funneling them into the data warehouse to estimate the cost in total ownership is a great first step. This data can then be fed into a business intelligence (BI) tool like Tableau or PowerBI which will visualize the rolled-up pricing to the necessary business and data owners. An accessible dashboard for cost builds confidence and synergy within the company as surprise surges in cost don’t appear at the end of the fiscal period.
Open Source Tools
In some cases, the best tool for business needs might actually be a free open-source tool that provides the same value as a paid tool. By leveraging open-source options, organizations can save money on licensing fees and other costs associated with proprietary software until they reach the data maturity to consider a paid solution, once the initial value has been proven.
However, security is one of the main concerns with an open-source solution, as vulnerabilities of the software can be seen from the codebase. Open-source software may also lack some of the flexibility that paid tool options come with out of the box. Ample research is required to determine the advantages and disadvantages of whatever tool is selected. This research then needs to be merged with the use case. As an example, if a BI tool is only used by a limited number of people, there might not be an advantage to paying $20,000+ for a tool when a free BI tool will provide similar value despite slightly less functionality, while still meeting the overall business requirements.
Evaluating Data Warehousing
Companies can also cut costs by evaluating data warehouses in different environments. Creating different warehouses to be used in the production and test environments, setting up alerts in the infrastructure, and ensuring proper query optimization are just a few examples that can have a high impact on the annual dollar amount spent by the data pipeline.
Putting a focus on optimizing data warehousing needs also allows for future scalability. As your data increases, the impact on monthly invoicing isn’t as high as it could be if this is a consideration from the start. There is a cascading effect that, if left unchecked, will snowball as pipelines will run continuously for years. If data teams don’t accurately establish these pipelines initially or if they’re never modified, data warehousing alone can eat through yearly budgets in just months.
Forecasting Future Needs
After measuring data infrastructure costs, free tools, and optimizing data warehouses, what else can be done to optimize costs? At this point, it’s a good idea to allocate time and resources to forecasting your future data needs at regular intervals and regularly iterate on data strategy based on new learnings and use cases as data capabilities grow. The volume of data and the organizational use case are never static and will constantly change over time. As a result, it is important to keep tabs on overarching goals holistically as tools and strategies that are initially selected because they might have been the most cost-effective option two years ago, today might have increased in pricing to a point where it is no longer feasible to continue use.
Forecasting data needs requires constant research by pooling information from teams across various business lines and an ongoing learning effort on the ever-changing data landscape to ensure that the data stack is as cost-efficient as possible. It also helps to monitor tooling prices (both increases and decreases), especially tools that are new to the market.
Implementing, testing, and revising your data strategy is integral to proving its value over time. There’s never a one size fits all approach and many levers can be pulled as you take on internal conversations surrounding strategy for your data infrastructure, and you don’t have to start from scratch. Leverage what you have and improve upon your data initiatives with a focused and organized effort.
At Data Culture, we are passionate about meeting organizations where they are in their data maturity, helping unlock the value of their data, and strategizing optimal ways to implement, measure, and gain business value from Modern Data Stacks. From assisting in growing out data teams to guiding data infrastructure projects, our team of experts is positioned to assist organizations in navigating the data landscape to meet internal data goals that drive business decisions.