
A modern data warehouse platform acts as the centralized data reporting foundation for any data- driven organization. If you’re not building a data warehouse, chances are you may be considering it or are simply maintaining your data warehouse in order to centralize and tap into actionable data within your organization.
This whitepaper helps guide decision makers on the important factors at play when assessing data warehouse platforms and gives tangible metrics to help support their decisions. The performance-per-dollar approach provides an effective way to evaluate platforms by focusing on two metrics that matter for every customer: performance and cost of ownership. These metrics tend to be most important for organizations because they quantify operational costs of the platform for the required performance and scale. To make the comparison easier, we combined these into a single performance-per-dollar metric described below.
As a group of data and analytics practitioners, we wanted to see how vendor claims held up and which aspects turned out to be not so scientific. This is not a scientific approach, but simply an exercise that our data and analytics team approached to best make well-informed recommendations for organizations when choosing the best data warehouse platform for their organization.
For this comparison, we started by selecting four leading data warehousing platforms used by our data engineering practice at West Monroe to implement data solutions. This was a fairly narrowly focused analysis based on cost of performance, though we are not claiming to have tested features comprehensively. These data engineering solutions include dashboarding, data warehousing, data source integration, and advanced analytics, including Machine Learning and AI. These solutions require a foundational platform that offers a wide array of features alongside powerful performance.
To round out the list to five platforms, we selected the new, GPU-based Brytlyt platform. GPU computing is a fast-evolving field, and we wanted to gain experience with it to see how it compares with more traditional columnar databases.
For consistency, we chose to standardize on AWS cloud infrastructure where possible, minimizing any differences in infrastructure architectures.
Having all five contenders hosted in the cloud enables easy conversion of performance to hourly run cost which we will use to normalize observed performance results.
For each technology, we came up with four deployment sizes from small to extra-large in order to align contenders as close as possible on cost per hour, compute, and storage capacity. The table below summarizes sizing:
When performing queries on small datasets, we found Brytlyt to be dominant in comparison to the other technologies tested. The performance metrics of Brytlyt were consistently two to three times higher than that of the other technologies. See details in figure B below.
Performing queries on the medium-sized dataset pushed the GPU based Brytlyt to its limits. At the largest server size, Brytlyt continued to perform with substantially higher performance than the other measured technologies. However, Brytlyt was unable to operate on this dataset at lower costs due to memory constraints. This can be seen in Figure C below.
Excluding Brytlyt, as seen in Figure D below, allows us to better see the performance of the other technologies. SQL Server shows its diminishing return on spend here while the other technologies continue to show near-linear growth.
For the large data sets, we have no Brytlyt metrics because the dataset was too large to fit into a memory table and perform queries with any of our tested server sizes. Of the other technologies, Snowflake is clearly dominant here, achieving five to six times the performance of other technologies and continuing to scale in a linear fashion based on spend. Redshift and Azure DW begin with lower performance at low spend and scaled less per dollar as the spend increased. SQL Server shows the lowest starting performance overall and completely flat lines beyond $8/hour (16 cores) as spending increases. This can be seen in Figure E.
Excluding Brytlyt, as seen in Figure D below, allows us to better see the performance of the other technologies. SQL Server shows its diminishing return on spend here while the other technologies continue to show near-linear growth.
Excluding Brytlyt, as seen in Figure D below, allows us to better see the performance of the other technologies. SQL Server shows its diminishing return on spend here while the other technologies continue to show near-linear growth.
For the large data sets, we have no Brytlyt metrics because the dataset was too large to fit into a memory table and perform queries with any of our tested server sizes. Of the other technologies, Snowflake is clearly dominant here, achieving five to six times the performance of other technologies and continuing to scale in a linear fashion based on spend. Redshift and Azure DW begin with lower performance at low spend and scaled less per dollar as the spend increased. SQL Server shows the lowest starting performance overall and completely flat lines beyond $8/hour (16 cores) as spending increases. This can be seen in Figure E.
Using our performance-per-dollar measure, Snowflake drives four times more cost efficiency compared to Redshift and SQL Azure DW.
It was, in general, the easiest platform to work with—no external tools required, loading/queries and monitoring can be done via a web-based UI.
Although we were unable to test with the largest dataset due to memory constraints, our results show clear benefits vs. the competition on a price for performance basis if you primarily process extremely complex or compute intensive queries.
This was expected, as it’s unable to leverage MPP scale of other platforms.
As previously stated, this was not an opinions-based result, but rather based on testing with parameters around cost of performance. By focusing on a single metric, we were able to uncover interesting results. In future work, we hope to expand on these results to include larger datasets, as well as a wider array of query types—most notably JOIN performance. For this first pass, we focused on a fully flattened, denormalized dataset, as it has become our target model for use with RAP.
The table below illustrates sizes of each platform that we selected for our testing, aligned into size groups for an apple-to- apple comparison. Results were further adjusted by total hourly cost.
I am even more accessible than the other modals.