May 9, 2019 | InBrief

Is data virtualization a panacea for the age-old data centralization problem: Pros and cons

Is data virtualization a panacea for the age-old data centralization problem: Pros and cons

Chances are your legacy enterprise data warehouse is expensive, does not scale, has limited domain coverage, and comes with fragile ETL process that is very expensive to maintain and extend. Your recent Data Lake investment struggles to provide value it promised. You’ve probably heard about data virtualization and federation – is this the modern solution to the age-old problem?

A few years ago, Gartner announced “Logical Data Warehouse” as modern data warehousing architecture with data virtualization technologies to enable it. This created confusion for many of our clients and prospects. Is data virtualization the best way to deliver analytics projects? Does logical data warehouse completely replaces traditional centralized data warehouse? In March I attended Gartner’s 2019 Data and Analytics Summit and had the opportunity to connect with several analysts to clarify both the definition of logical data warehouse and applicability of data virtualization technology.

Let’s start with defining the data landscape

  • Logical data warehouse is defined as unified semantic layer covering entire enterprise data universe according to Gartner’s Distinguished VP Analyst Mark Beyer. It is essentially an enterprise metadata layer that maintains consistent definitions of enterprise data assets and provides ways to access and query the data.

  • How data virtualization works: Data virtualization technologies facilitate and enable centralized read/write SQL query access to multiple underlying data stores: flat files, relational databases, NoSQL data stores and operational systems like CRM or ERP via REST API endpoints.

  • Data federation is similar to virtualization but provides read-only access.

Does this mean that traditional physical data warehouse is dead? Absolutely not.

Data virtualization excels in agile proof of concept applications. By eliminating the need to develop ETL pipelines and store data, it greatly accelerates data exploration and enables quick, iterative development processes for reports, dashboards, and analytical models. Data virtualization layers act as effective intermediaries and translator between your BI or data science tool and source systems, including data warehouses. Here is how typical data virtualization system works:

  • Receives and analyzes user query

  • Splits up query and pushes each individual components to its source system

  • Waits for source queries completion

  • Collects results from each source query, returned data is often cached at DV server to help accelerate future queries

  • Aggregate/join/union individual results and return to user

Data virtualization tools isolate complexities of connections to multiple source systems and provides consistent way of querying for all enterprise data assets.

Data virtualization cons

  • Data virtualization relies on 100% availability of all source systems. If any one system goes down, queries will fail. It is partially mitigated with data caching on Data Virtualization servers

  • Data virtualization returns latest data only and does not store history of changes in sources – which is the foundation of the traditional data warehousing

  • Query performance is bad for complex, high cardinality joins and aggregations across source systems

  • Performance impact on sources: pushing queries to sources can impact performance of your CRM or ERP application

  • Data virtualization has very limited options to deal with data quality and handle MDM

  • Data virtualization is unable to handle complex transformations, e.g. requiring multiple passes and lookups

To summarize, data virtualization and federation platforms provide quick and effective ways to prototype and explore data but they cannot fully replace consolidate physical data warehouse in production applications.

For rapid prototyping, data preparation tools can be a viable alternative to data virtualization, addressing several of its shortcomings. At West Monroe we developed unique tool that combines data preparation, virtualization, and data warehouse automation capabilities. Our RAP accelerator enables agile prototyping and development and when you’re ready to deploy it and seamlessly migrates your solution into a robust, fully managed ETL/DW pipeline. Interested to learn more? Visit our RAP page or contact me at vorlov@wmp.com for additional information.

Explore our latest perspectives