by Michael Distler, Senior Director of Product Marketing at Qlik
Like many of us hunkering down at home these days, I was recently browsing through my media streaming services looking for a movie to watch when I came across an interesting title: The Hummingbird Project. Turns out it’s very similar to the Michael Lewis book Flash Boys (the movie’s director says he was inspired by the book).
Both stories center on some entrepreneurs who decide they want to build a fiber optic connection in an exact straight line between a Midwestern city and New York. The idea was to reduce the transmission time for a stock trade from 17 to 13 milliseconds (or 16 milliseconds in the movie). Even though it cost hundreds of millions of dollars to build this link, they calculated the payoff was worth it.
This made me think about how Data Architects are struggling with this same issue but at a different scale. How can they setup an accessible data source that doesn’t take months or years to setup? And how can they ensure that the new data source is kept current? Users are looking for the data to be up to the minute or even the last few seconds.
It starts with adopting a new approach to the overall problem: DataOps. We’re written a number of blogs about this emerging concept. DataOps builds on the methods of the DevOps concept, which combines software development and IT operations to improve the velocity, quality, predictability and scale of software development and deployment. DataOps seeks to bring similar improvements with delivering data for analytics, enabling practices, processes and technologies for building and enhancing data pipelines to quickly meet business needs.
An example of where a DataOps approach could have helped was with the early big data projects. Several studies have shown that very few of these projects provided a decent ROI or even any real business value. Most of these projects were run by IT and/or data engineers who focused almost exclusively on storing the data in Hadoop or an equivalent technology. Everyone was focused on putting the data into the source, not on how to take data out.
Not surprisingly, these massive data stores were then vastly under-utilized. As no one from the business or data consumer side was involved in defining the requirements, the collected data was either useless or indecipherable. A DataOps approach would have first had IT working closely with the business to define requirements and then taking an iterative approach to make sure the initial collected data was meeting the business needs before the spigot was opened up.
Once there is a foundation of DataOps, there are some additional strategies to consider:
- Use change data capture (CDC). CDC technology can be used to continuously identify and propagate data changes as they occur. This means that as soon as a data change is detected on the source system, it’s immediately replicated to the target system. Using a CDC method that is agentless and/or log-based will minimize the performance impact to the source system
- Automate the creation of data warehouses. Instead of employing the traditional methods of building and managing data warehouses using lengthy and manual ETL development efforts, utilize tools that can automatically generate ETL code and quickly apply updates, thus greatly accelerating both the initial warehouse design process and any subsequent changes.
- Automate the creation of data lakes. Like data warehouse creation, the process of creating and refining a data lake can also be a long and laborious project when using manual coding methods. One should look to automatically generate schemas and Hive Catalog structures for operational data stores and historical data stores. By automating data ingestion, schema creation and continual updates, organizations can realize faster time-to-value with their data lake investments.
- Build and employ an enterprise data catalog. Having one simplified view that can show every available data set makes it easy for data users to find, understand and utilize data from any enterprise repository. Also, if the catalog enables users to be self-service data consumers, that eliminates the need for IT to have to manually gather and prepare responses to the ever-growing number of data requests coming from the business.
Want to learn more? Download the eBook: “Enterprise Architect’s Guide: Top 4 Strategies for Automating and Accelerating Your Data Pipeline”.
Would you like to join the Data Architects und Data Engineering DACH community?
Join ScaleUp 360° Smart Data Architectures & Data Engineering