Lakehouse databricks paper

8/28/2023

The beauty of Serverless is the Polaris engine. It doesn’t really have any unique serving capabilities when compared to Synapse Serverless and Databricks SQL. On top of that, you should again ensure your Delta files are in good shape by running a vacuum and optimise every now and then. With Databricks SQL you have an endpoint/clusters to monitor and manage, which comes with some overhead. If you’re using the Delta format (you should be) then you’ll want to apply vacuuming and optimising occasionally to keep things tip top. You need to think about the distribution method used for each table to avoid skew, whether table replication is appropriate, and the more traditional maintenance that comes with indexing.Īside from ensuring your lake is in order, there really isn’t any maintenance involved with Serverless - that’s the beauty of Polaris. Managing a Dedicated pool can be quite involved. If you’re not using Databricks for your processing, then you’d need to mount your lake, create a database, and create your tables specifying the lake location to be read from. Assuming that you’re using Databricks as your processing engine, you probably already have your tables saved within the Hive metastore, which makes them available for querying in Databricks SQL. I’ve blogged previously about Azure Synapse Serverless Lake Access Patterns, so I won’t go into any more detail on that here. Here we need to provide a method of reading from the lake which may again include creating credentials, external file formats, and external data sources - plus views over those entities from which to serve, but there’s no PolyBase or COPY or data storage.

Similar to Dedicated, but a little simpler. That’s quite a lot of configuration, plus you have to pay an additional ~£18 per TB of data stored inside the dedicated pool each month. Once the data is into a dedicated pool, it may need merging from staging tables into serving tables using slowly changing techniques (SCD). This could potentially include creating credentials, external file formats, external data source, external tables, and using PolyBase, or using the COPY command.

In order to serve out of a dedicated SQL pool, you would first need to ingest the data into that pool. Setup and Configurationīy setup and configuration, I mean all of the work involved with getting to the point you can actually serve data from the tool - excluding your specific security setup… Things like the maintenance involved with managing each of those tools, how arduous the set up and configuration is, and their unique abilities (or USP’s). The comparison chart above condenses a lot of information into a single table, but of course there are other things to consider. Below I’ve produced a comparison chart across these 5 metrics for the tools in scope for this blog.Ģ022 Lakehouse Serving Tool Comparison Chart.įor reference, here’s a link to Synapse Serverless' Polaris Engine White Paper. In my previous post, I called out 5 main things to look out for when choosing a tool Concurrency, Security, Performance, Integration, and Cost. This is the type of thing that’ll have a changing landscape as these tools develop further, and new tools emerge, so I wanted to create that distinction now to support future editions of this post. You’ll notice I’ve put 2022 in the title of this blog. I’ve decided to include Dedicated here, as although it defeats the purpose of the lakehouse paradigm, I’m still often asked whether it’s suitable. In this post I want to go a step further and do a direct comparison of some of the more obvious Azure based lakehouse serving tools - Synapse Serverless, Databricks SQL, and Synapse Dedicated. I blogged previously on my thoughts about what makes a good lakehouse serving layer - the functionality and features to consider when trying to identify the right serving tool for you.

0 Comments

Lakehouse databricks paper

Leave a Reply.

Author

Archives

Categories