Azure Stack and Azure Arc for data services

For those companies that can’t yet move to the cloud, have certain workloads that can’t move to the cloud, or have limited to no internet access, Microsoft has options to build your own private on-prem cloud via Azure Stack and Azure Arc. I’ll focus this blog on using these products to host your databases.

Azure Stack is an extension of Azure that provides a way to run apps and databases in an on-premises environment and deliver Azure services via three options:

Azure Stack Hub: Run your own private, autonomous cloud—connected or disconnected with cloud-native apps using consistent Azure services on-premises. Azure Stack Hub integrated systems are comprised in racks of 4-16 servers built by trusted hardware partners and delivered straight to your datacenter. Azure Stack Hub is built on industry standard hardware and is managed using the same tools you already use for managing Azure subscriptions. As a result, you can apply consistent DevOps processes whether you’re connected to Azure or not. The Azure Stack Hub architecture lets you provide Azure services for remote locations with intermittent connectivity or disconnected from the internet. You can also create hybrid solutions that process data locally in Azure Stack Hub and then aggregate it in Azure for additional processing and analytics. Finally, because Azure Stack Hub is installed on-premises, you can meet specific regulatory or policy requirements with the flexibility of deploying cloud apps on-premises without changing any code. See Azure Stack Hub overview

Azure Stack Edge: Get rapid insights with an Azure-managed appliance using compute and hardware-accelerated machine learning at edge locations for your Internet of Things (IoT) and AI workloads. Think of it as a much smaller version of Azure Stack Hub that uses purpose-built hardware-as-a-service such as Pro GPU, Pro FPGA, Pro R, and Mini R. The Mini is designed to work in the harshest environment conditions, supporting scenarios such as tactical edge, humanitarian and emergency response efforts. See Azure Stack Edge documentation

Azure Stack HCI (preview): A hyperconverged infrastructure (HCI) cluster solution that hosts virtualized Windows and Linux workloads and their storage in a hybrid on-premises environment. Think of it as a virtualization fabric for VM or kubernetes hosting – software only to put on your certified hardware. See Azure Stack HCI solution overview

These Azure Stack options are almost all VMs/IaaS, with no PaaS options for data services such as SQL Database (the only data service available is SQL Server in a VM). It is integrated certified hardware and software run by Microsoft, just plug in and go. For support, there is “one throat to choke” as the saying goes. It is a great option if you are disconnected from Azure. It extends Azure management and security to any infrastructure and provides flexibility in deployment of applications, making management more consistent (a single view for on-prem, clouds, and edge). It brings the Azure fabric to your own data center but allows you to use your own security requirements. Microsoft orchestrates the upgrades of hardware, firmware, and software, but you control when those updates happen.

Azure Arc is a software only solution that can be deployed on any hardware, including Azure Stack, AWS, or your own hardware. With Azure Arc and Azure Arc-enabled data services (preview) you can deploy Azure SQL Managed Instance (SQL MI) and Azure Database for PostgreSQL Hyperscale to any of these environments, which requires kubernetes. It can also manage SQL Server in a VM by just installing an agent on the SQL server (see Preview of Azure Arc enabled SQL Server is now available). Any of these databases can then be easily moved from your hardware to Azure down the road. It allows you to extend Azure management across your environments, adopt cloud practices on-premises, and implement Azure security anywhere you choose.

This allows for many options to use Azure Arc on Azure Stack or on other platforms (click to expand):

Some features about Azure Arc:

  • It can be used to solve for data residency requirements (data sovereignty)
  • It is supported in disconnected and intermittently connected scenarios such as air gapped private data centers, cruise ships that are off the grid for multiple weeks, factory floors that have occasional disconnects due to power outages, etc.
  • Customers can use Azure Data Studio (instead of the Azure Portal) to manage their data estate when operating in a disconnected/intermittent connected mode
  • Could eventually support other products like Azure Synapse Analytics
  • Can use larger hardware solutions and more hardware tiers then what is available in Azure, but have to do your own HA/DR
  • You are not charged if you shut down SQL MI, unlike in Azure, as it’s your hardware, where in Azure the hardware is dedicated to you even if you are not using it
  • With Arc you are managing the hardware, but with Stack Microsoft is managing the hardware
  • Can use modern cloud billing models on-premises for better cost efficiency
  • With Azure Arc enabled SQL Server, you can use the Azure Portal to register and track the inventory of your SQL Server instances across on-premises, edge sites, and multi-cloud in a single view. You can also take advantage of Azure security services, such as Azure Security Center and Azure Sentinel, as well as use the SQL Assessment service
  • Azure Stack hub provides consistent hardware, but if you use your own hardware you have more flexibility and possibly cheaper hardware costs

These slides covers the major benefits of Azure Arc and what the architecture looks like:

Looking at the differences when you are connected directly vs connected indirectly (i.e. an Arc server is not connected to the Internet so must coordinate with a server that is connected):

Here is what an Azure Arc data services architecture looks like:

Some of the top use cases we see with customers using Azure Stack and/or Azure Arc:

  • Cloud-to-cloud failover
  • On-prem databases with failover to cloud
  • Easier migration: Deploy locally, then flip a switch to go to cloud

This slide provides details on the differences with SQL databases (click to expand):

More info:

Understanding Azure Arc Enabled SQL Server

What is Azure Arc Enabled SQL Managed Instance

Posted in Azure Arc, Azure SQL Database, Azure Stack, SQLServerPedia Syndication | Leave a comment

External tables vs T-SQL views on files in a data lake

A question that I have been hearing recently from customers using Azure Synapse Analytics (the public preview version) is what is the difference between using an external table versus a T-SQL view on a file in a data lake?

Note that a T-SQL view and an external table pointing to a file in a data lake can be created in both a SQL Provisioned pool as well as a SQL On-demand pool.

Here are the differences that I have found:

  • Overall summary: views are generally faster and have more features such as OPENROWSET
  • Virtual functions (filepath and filename) are not supported with external tables which means users cannot do partition elimination based on FILEPATH or complex wildcard expressions via OPENROWSET (which can be done with views)
  • External tables can be shareable with other computes, since their metadata can be mapped to and from Spark and other compute experiences, while views are SQL queries and thus can only be used by SQL On-demand or SQL Provisioned pool
  • External tables can use indexes to improve performance, while views would require indexed views for that
  • Sql On-demand automatically creates statistics both for a external table and views using OPENROWSET. You can also explicitly create/update statistics on files on OPENROWSET. Note that automatic creation of statistics is turned on for Parquet files. For CSV files, you need to create statistics manually until automatic creation of CSV files statistics is supported
  • Views give you more flexibility in the data layout (external tables expect the OSS Hive partitioning layout for example), and allow more query expressions to be added
  • External tables require an explicit defined schema while views can use OPENROWSET to provide automatic schema inference allowing for more flexibility (but note that an explicitly defined schema can provide faster performance)
  • If you reference the same external table in your query twice, the query optimizer will know that you are referencing the same object twice, while two of the same OPENROWSETs will not be recognized as the same object. For this reason in such cases better execution plans could be generated when using external tables instead of views using OPENROWSETs
  • Row-level security (Polybase external tables for Azure Synapse only) and Dynamic Data Masking will work on external tables. Row-level security is not supported with views using OPENROWSET
  • You can use both external tables and views to write data to the data lake via CETAS (this is the only way either option can write data to the data lake)
  • If using SQL On-demand, make sure to read Best practices for SQL on-demand (preview) in Azure Synapse Analytics

I often get asked what is the difference in performance when it comes to querying using an external table or view against a file in ADLS Gen2 vs. querying against a highly compressed table in a SQL Provisioned pool (i.e. managed table). It’s hard to quantify without understanding more about each customers scenario, but you will roughly see a 5X performance difference between queries over external tables and views vs. managed tables (obviously, depending on the query, that will vary but that’s a rough number – could be more than 5X in some scenarios). A few things that contribute to that: in-memory caching, SSD based caches, result-set caching, and the ability to design and align data and tables when they are stored as managed tables. You can also create materialized views for managed tables which typically bring lots of performance improvements as well. If you are querying Parquet data, that is in a columnstore file format with compression so that would give you similar data/column elimination as what managed SQL clustered columnstore index (CCI) would give, but if you are querying non-Parquet files you do not get this functionality. Note that for managed tables, on top of performance, you also get a granular security model, workload management capabilities, and so on (see Data Lakehouse & Synapse).

Posted in Azure Data Lake, Azure Synapse Analytics/SQL DW, SQLServerPedia Syndication | 4 Comments

Top Azure Synapse Analytics and Power BI questions

I have listed below the top questions I see from customers about Azure Synapse Analytics and Power BI, and the blogs that I have written that try to answer each question. I have also added these questions to my blogs FAQ page that also includes “Modern data warehouse” and “SQL Server migration to Azure”, which I have updated.

Azure Synapse Analytics

Below are the top questions I am seeing from customers about Azure Synapse Analytics (formally called Azure SQL Data Warehouse), and the blogs that I have wrote that try to answer each question:

  1. What is the difference between the GA version of Synapse and the one in public preview? Azure Synapse Analytics confusion
  2. What are the newest features for the SQL Pool? Azure Synapse Analytics new featuresMicrosoft Ignite Announcements 2020
  3. What are the performance options when using Synapse and Power BI? Azure Synapse Analytics & Power BI performance
  4. Can I run Power BI dashboards against Synapse? Azure Synapse Analytics & Power BI concurrency
  5. Can you explain the new SQL on-demand in Synapse? SQL on-demand in Azure Synapse Analytics
  6. What is a “Data Lakehouse”? Data Lakehouse & Synapse
  7. What are all the ways to query data in Synapse? Query options in Azure Synapse Analytics
  8. Are there samples to use in Synapse? Azure Synapse Analytics overlooked features
  9. How can I create federated queries in Synapse? Synapse and federated queries

Power BI

Below are the top questions I am seeing from customers about Power BI, and the blogs that I have wrote that try to answer each question:

  1. What are Composite models and Aggregations? Power BI new feature: Composite models
  2. What are Dataflows? Can I clean data with Power BI? Power BI: DataflowsWhere should I clean my data?
  3. What is the Microsoft Power Platform? The Microsoft Power Platform
  4. Can Power BI access data in ADLS Gen2? Ways to access data in ADLS Gen2
  5. What are performance options when using Power BI against Synapse? Azure Synapse Analytics & Power BI performance
  6. Can I use Power BI dashboards against Synapse? Azure Synapse Analytics & Power BI concurrency
  7. How can I stream data into Power BI and do real-time updates? Power BI Real-time Streaming
  8. What are the various type of security options in Power BI? Power BI Security
  9. What are all the ways to share dashboards and reports in Power BI? Power BI Sharing
  10. What are the ways to monitor performance and usage in Power BI? Monitoring Power BI
  11. What were the latest announcements about Power BI at Ignite? Microsoft Ignite Announcements 2020
  12. What are other questions you hear customers asking? Power BI FAQ

Posted in Azure Synapse Analytics/SQL DW, Power BI, SQLServerPedia Syndication | 4 Comments

Synapse and federated queries

As I was finishing up a demo script for my presentation at the SQL PASS Virtual Summit on 11/13 (details on my session here), I wanted to blog about part of the demo that shows a feature in the public preview of Synapse that is frankly, very cool. It is the ability to query data as it sits in ADLS Gen2, a Spark table, and Cosmos DB and join the data together with one T-SQL statement using SQL On-demand (also called SQL serverless), hence making it a federated query (also known as data virtualization or a logical data warehouse). The beauty of this is you don’t have to first write ETL to collect all the data into a relational database in order to be able to query it all together, and don’t have to provision a SQL pool, saving costs. Further, you are using T-SQL to query all of those data sources so you are able to use a reporting tool like Power BI to see the results (or any tool that supports T-SQL). Synapse is the tool to easily create that virtual layer on top of the data.

The queries are also fast as SQL On-demand will push-down queries from the front-end to the back-end nodes (which contain a SQL Server engine) with those back-end nodes sitting next to the storage (this is done via POLARIS: The Distributed SQL Engine in Azure Synapse, which is being used for SQL On-demand pools and eventually for SQL Provisioned pools). Note there is no cache yet in SQL On-demand so the queries won’t run faster after the first run.

My demo queries tweets in parquet files in ADLS Gen2. Users or Cities can be banned and the query has to return the tweets only from non-banned entities. The list of banned users are in a Spark table and the banned cities are in a Cosmos DB table. Now on to my demo:

Working inside Azure Synapse Studio, first, I wrote a T-SQL view that queried all the parquet files in an ADLS Gen2 folder (the key is the OPENROWSET function):

CREATE VIEW [dbo].[vFQ_ADLSTable] AS
SELECT
  ADLSTable.*
FROM
  OPENROWSET(
  BULK 
  'https://asaexpdatalakejs1234.dfs.core.windows.net/twitterdata/*.parquet',
  FORMAT='PARQUET'
  ) AS ADLSTable

Notice how I am using a wildcard in the file path to query all parquet files in the folder instead of just one.

Then I wrote a T-SQL view that queried a Spark table called bannedusers in the Spark database users. Even though it is a Spark table and the Spark cluster is not running, I am able to query this table using SQL On-demand because of a feature in Synapse that copies the metadata for the Spark table to SQL On-demand so you can use it outside of Spark (see Azure Synapse Analytics shared metadata). Plus, when a Spark table is created it actually stores the data in the Synapse primary storage area (in my case it is in the folder asaexpworkspacejs1234/tempdata/synapse/workspaces/
serrademoworkspace/warehouse/users.db/bannedusers
) so there is no need to use the Spark engine as SQL On-demand can directly query that storage location:

CREATE VIEW [dbo].[vFQ_SparkTable] AS
SELECT 
  SparkTable.* 
FROM users.dbo.bannedusers AS SparkTable

In both views above the data is read directly from storage, in parallel, and filtering is done at the SQL engine nodes themselves. Also, in some cases the data being read is filtered – for example, eliminating some files early on based on partition elimination, and in case of Parquet also skipping reading some column segments where possible, and in the future using such technologies as Query Acceleration for ADLS.

Then I wrote a T-SQL view that queries a database in Cosmos DB called CosmosIOTDemo and the container bannedcities using Azure Synapse Link for Cosmos DB (see demo here):

CREATE VIEW [dbo].[vFQ_CosmosTable] AS
SELECT
  CosmosTable.*
FROM OPENROWSET (
    'CosmosDB', N'account=synapselinkdemoserra;database=CosmosIOTDemo;region=eastus;key=xxx',bannedcities
    ) with ( id varchar(36), city varchar(1), cityname varchar(30), banned varchar(1) ) AS CosmosTable

When using the Azure Synapse Link for Cosmos DB, the SQL on-demand compute engine pushes down predicates to the remote, columnar store of Cosmos DB (aka Analytical Store) and returns the results to the Synapse compute layer.

Finally, I created a view using all three above views to join the data together. The parquet files in ADLS Gen2 are twitter feeds, and the Spark table and Cosmos DB table are reference tables that list the users and cities in the twitter feeds that should be excluded from the query (they have to be in the table with the banned field set to Y to not show in the query). While the twitter feed was only about 4000 rows, this query only took 2-3 seconds (you may have to wait an additional 2-3 seconds for the SQL On-demand pool to warm up the first time):

CREATE VIEW [dbo].[vFQ_Combined] AS
SELECT 
    t1.*, 
    t2.Banned as UserBanned, 
    t3.Banned as CityBanned
FROM 
    vFQ_ADLSTable t1 
        LEFT JOIN vFQ_SparkTable t2
            ON (t1.UserName COLLATE SQL_Latin1_General_CP1_CI_AS = t2.username)
        LEFT JOIN vFQ_CosmosTable t3
            ON (t1.City = t3.cityname)
WHERE (t2.banned is NULL or t2.banned = 'N')
AND (t3.banned is NULL or t3.banned = 'N')

Then I went to the Develop hub, and under Power BI chose “Power BI datasets” (first you have to link to a Power BI workspace), chose “New Power BI dataset”, chose my SQL On-demand pool, and downloaded the Power BI Desktop Source (.pbids) file that has the connection info to the SQL On-demand database. Once I clicked on that download file, it opened up Power BI Desktop and allowed me to choose the combined query vFQ_Combined. You can choose to import the data or use DirectQuery to get the results in real-time as any file that is dropped into the ADLS Gen2 folder along with the other parquet files will be read.

It appears to the end user that the data is stored in a relational database because the view is creating an abstract layer with the metadata (although because of automatic schema inference you don’t even need to specify the field layouts for a file in ADLS Gen2 or a Spark table).

Data in ADLS Gen2 can be moved to different access tiers to save costs, and eventually when SQL On-demand supports querying the relational SQL Provisioned pool you can archive out older relational data to ADLS Gen2 and use a query on that storage and combine it with a query on the relational database to also save costs and improve performance within the relational database.

As I said, very cool!

More info:

Direct Query your Data Lake Files using Azure Synapse Analytics and Power BI

Posted in Azure Cosmos DB, Azure Data Lake, Azure Synapse Analytics/SQL DW, SQLServerPedia Syndication | 6 Comments

Microsoft Ignite Announcements 2020

Microsoft Ignite has always announced many new products and new product features, and this year was no exception. Many exciting announcements, and below I list the major data platform related announcements:

Usage-based optimization with Azure Synapse and Power BI: A new feature that analyzes usage patterns in Power BI and shares the information with Synapse is now available to improve query performances. Synapse automatically creates a materialized view optimized for Power BI users — greatly accelerating the speed of a query performance. More info

Apache Spark for Azure Synapse In-cluster Caching and Shuffle Service (Preview): Caching and shuffle are two of the components of infrastructure for Apache Spark that have the greatest impact on performance. These new services, which are written from scratch, allow the optimization of performance for these components on modern hardware and operating systems. The service is enabled for Apache Spark Pools in Azure Synapse today.

High throughput output from Stream Analytics to Azure Synapse Analytics (General Availability): ​Azure Stream Analytics jobs now have the ability to output to an existing SQL pool table in Azure Synapse Analytics, and can process throughput rates even higher than 200 MB/s. This will support the most demanding real-time analytics and hot-path data processing needs of Azure customers who depend on their Data Warehouse for workloads such as reporting and dashboarding.

Azure AD authentication features for Azure SQL DB, Azure Synapse Analytics, and Azure SQL Managed Instance: ​Three new features using Azure Active Directory (Azure AD) authentication are currently in preview for Azure SQL Database, Azure Synapse Analytics, and Azure Managed Instance. These features help automate user creation using Azure AD applications and allow individual Azure AD guest users to be created in SQL Database, Azure Synapse Analytics, or SQL Managed Instance. First, support is being added for Azure AD user creation in SQL Database and Azure Synapse Analytics on behalf of Azure AD applications. This functionality is useful in the Azure AD application automation processes where Azure AD objects are created and maintained in SQL Database and Azure Synapse Analytics without human interaction. Second, users can use cloud groups to manage Directory Readers role assignment in Azure AD. This change will promote ease of use when setting up SQL Database, Azure Synapse Analytics, and SQL Managed Instance to create users. Finally, Azure AD guest users can now be created as database users and set as Azure AD admin without the need to first add them as members of a group created in Azure AD. These features extend existing functionality, remove user limitations, and provide customers with greater ease of use when setting up the SQL Database, Azure Synapse Analytics, or SQL Managed Instance.

New Azure Cognitive Services and capabilities in preview: ​​In the Decision category, we have the new Metrics Advisor service that proactively monitors metrics and diagnoses issues so organizations can protect their growth engine, from sales revenue to manufacturing. Built on top of Anomaly Detector, Metrics Advisor can be applied to business metric monitoring, AI Ops, and predictive maintenance scenarios.  See an example in action. In the Vision category, the new spatial analysis feature in Computer Vision helps organizations maximize the value of their physical spaces by understanding peoples movements in near-real time – especially critical these days in helping organizations reopen safely by supporting social distancing and other health compliance measures. It comes available in a container to run at the edge. Here’s a walkthrough of how customers are applying it. Additionally for containers, the following are available in preview: for Speech there are automatic language detection and neural text-to-speech containers and for Computer Vision, the Read 3.0 and 3.1 containers for text extraction.

Announcing Azure SQL Edge now in general availability: ​Azure SQL Edge, which brings the most secure Microsoft SQL data engine to Internet of Things (IoT) gateways and edge devices, is now available. Optimized for IoT workloads, Azure SQL Edge supports built-in data streaming, storage, and artificial intelligence packed into a small footprint container that works in connected, disconnected, and semi-connected environments where local storage and analytics are important. You can reduce cost, drive innovation, and improve operational efficiencies by using Azure SQL Edge to stream, store, and analyze IoT data. You simply need to develop your application once and deploy it anywhere across the edge, your datacenter, and Azure.

Hosting catalog databases for all supported versions of SSRS in Azure SQL Managed Instance: Now in general availability, Azure SQL Managed Instance can host catalog databases for all supported versions of SQL Server Reporting Services (SSRS). Last year, SQL Server 2019 Reporting Services introduced native support for hosting catalog databases in SQL Managed Instance. Now you can also use SQL Managed Instance to host catalog databases for earlier supported versions of SSRS. This is especially useful for fast migration of the existing solutions to SQL Managed Instance, without the need to test and adopt the latest version of SSRS. It also helps you get quick business benefits and then modernize further at your own pace. To learn how to configure your SQL Managed Instance to host catalog databases for earlier supported versions of SSRS, visit the Tech Community blog.

Major performance improvements for Azure SQL Managed Instances: We are announcing a set of major performance improvements for Azure SQL Managed Instances, which enable you to migrate your more performance-hungry database workloads to Azure SQL Managed Instance. These improvements include better transaction log write throughput for general purpose and business critical instances and superior data/log IOPS for business critical instances. Increased performance predictability and stability for general purpose service tier through better integration and enhanced tempdb performance are also included. These improvements are automatically enabled for all existing and future Azure SQL Managed Instances at no extra charge, making Azure SQL Managed Instance the best fully-managed database platform for your mission-critical workloads. Learn more about major performance improvements for Azure SQL Managed Instances

Machine Learning on Azure SQL Managed Instance in preview: Machine Learning Services with support for R and Python languages now include preview support on Azure SQL Managed Instance. When using Machine Learning Services with R or Python support in Azure SQL Managed Instance, you can run R and Python scripts to do data preparation and general purpose data processing, train machine learning models in database, and deploy your models and scripts into production in stored procedures. Azure SQL Managed Instance

Configurable backup storage redundancy options for Azure SQL Database coming soon: By default, all backups taken on Azure SQL Database are stored on read-access geo-redundant storage (RA-GRS) blobs that are being replicated in paired regions, ensuring data is protected in case of planned and unplanned events. Preview of configurable backup storage redundancy options for Azure SQL DB is coming soon to South East Asia, with additional regions expected in the coming months. This feature will provide you with more flexibility and choice to pick between locally redundant storage (LRS), zone redundant storage (ZRS) or geo redundant (RA-GRS) storage blobs for their backups. As additional regions are launched in preview, there will be more backup storage redundancy options available to use to meet your data residency requirements or minimize backup costs. You will be able to easily configure backup storage redundancy in the Azure portal. For more information, visit our Tech Community blog.

Azure Synapse Link now supports Azure Cosmos DB API for Mongo DB (Preview): Azure Synapse Link for Azure Cosmos DB now supports API for Mongo DB in preview. You can analyze the data in Azure Cosmos DB API for Mongo DB collections, using Synapse Apache Spark or Synapse SQL serverless, without impacting the performance of your transactional workloads. Azure Synapse Link, announced earlier this year, creates a tight integration between Azure Cosmos DB and Azure Synapse Analytics. You can directly access your Azure Cosmos DB data from Azure Synapse Analytics and run near real-time business intelligence, analytics, and machine learning pipelines. This integration enables you to build cloud-native HTAP (Hybrid transactional/analytical processing) solutions to perform no-ETL (extract, transform, and load) analytical processing on operational data in Azure Cosmos DB.

Azure Synapse Link for Azure Cosmos DB: SQL serverless runtime support coming soon: ​Azure Synapse Link will soon have preview support for querying Azure Cosmos DB data with Synapse SQL serverless. In the coming months, you will be able to run analytical T-SQL queries over data in Azure Cosmos DB in place within seconds. These queries can be used to build rich near real-time dashboards using Power BI integration with Azure Synapse Analytics. This new feature enhances the capability of Azure Synapse Link and will give you a wide range of business intelligence and ad-hoc querying tools via the T-SQL interface. Azure Synapse Link, announced earlier this year, creates a tight integration between Azure Cosmos DB and Azure Synapse Analytics. This gives you direct access to Azure Cosmos DB data from Azure Synapse Analytics to run near real-time business intelligence, analytics, and machine learning pipelines. This integration enables you to build cloud-native HTAP (Hybrid transactional/analytical processing) solutions to perform no-ETL (extract, transform, and load) analytical processing on operational data in Azure Cosmos DB, without impacting the performance of your transactional workloads. Azure Synapse Analytics will soon support Azure Synapse Link with Synapse Apache Spark and Synapse SQL serverless.

Serverless offer for Azure Cosmos DB coming soon on all APIs: ​Azure Cosmos DB serverless is in preview for the Core (SQL) API and will soon be available in preview for all Azure Cosmos DB APIs, including MongoDB, Cassandra, Gremlin, and Table. This offers NoSQL developers a database operations option with zero capacity planning or management required. As a consumption-based option with no minimum, serverless could also significantly lower the entry price for applications. It is ideally suited for small and medium-sized workloads that do not require steady throughput, receive requests sporadically or in short bursts, and have moderate performance requirements. Learn more.

Enhanced management experience for Azure SQL Managed Instance: ​Management operations are an essential part of Azure SQL Managed Instance, covering create, update, and delete scenarios. Most of the management operations In SQL Managed Instance are long-running but did not have operation progress visible. Through the introduction of a new CRUD API version, the SQL Managed Instance resource is now visible from when the create request is submitted. In addition, the new OPERATIONS API adds the ability to monitor management operations, see operation steps, and take dependent actions based on operation progress. Learn More

Transactionally consistent database copy for Azure SQL Database Hyperscale tier: Now in preview, the Hyperscale service tier in Azure SQL Database supports generating a transactionally consistent database copy to either the same or different logical server or region, similar to database copy already supported in other service tiers. This enables scenarios where Hyperscale databases have to be copied for various reasons, such as development and testing. If a database copy is created within the same region, it is a fast operation regardless of data size, similar to a point in time restore in Hyperscale. A cross-region database copy is a slower size of data operation. If you require support for database copy in your workflow, you can now take advantage of unique Hyperscale capabilities such as support for very large databases, fast scaling, and multiple replicas. Learn how to create a database copy and visit our Hyperscale database copy blog for more details.

Hyperspace, an indexing subsystem for Apache Spark™, is now open source: For Microsoft’s internal teams and external customers, we store datasets that span from a few GBs to 100s of PBs in our data lake. The scope of analytics on these datasets ranges from traditional batch-style queries (e.g., OLAP) to explorative ”finding the needle in a haystack” type of queries (e.g., point-lookups, summarization). Resorting to linear scans of these large datasets with huge clusters for every simple query is prohibitively expensive and not the top choice for many of our customers, who are constantly exploring ways to reducing their operational costs – incurring unchecked expenses are their worst nightmares. Over the years, we have seen a huge demand for bringing indexing capabilities that come de facto in the traditional database systems world into Apache Spark™. Today, we are making this possible by releasing an indexing subsystem for Apache Spark called Hyperspace – the same technology that powers indexing within Azure Synapse Analytics.

New MERGE command for Azure Synapse Analytics: ​The new MERGE command in Azure Synapse Analytics allows customers to insert, update, or delete data from one table based on the differences discovered in another table. With the MERGE command, the previous multi-step process to synchronize two tables in a dedicated SQL pool is now streamlined into a single step, thus reducing the data operation cost and improving the user experience. Learn More

COPY command now generally available in Azure Synapse Analytics: The COPY command feature in Azure Synapse Analytics provides users a simple, flexible, and fast interface for high-throughput data ingestion for SQL workloads. With this announcement, we’ve added performance improvements with file splits, ease of use with Parquet auto-schema discovery and complex data type support, and the support of the COPY command within our data integration partner ecosystem including: Databricks, Informatica, Streamsets, Panoply, and Talend. Learn More

Column-level encryption for Azure Synapse Analytics: ​Column-level encryption (CLE) helps customers implement fine-grained protection of sensitive data within a table (server-side encryption). With CLE, customers gain the ability to use different protection keys for columns in tables with each key having its own access permissions. The data in CLE enforced columns is encrypted on disk (and remains encrypted in memory) until the DECRYPTBYKEY function is used to decrypt it. This feature applies to dedicated SQL pools in Azure Synapse Analytics. Learn More

Azure Databricks – Delta engine with Photon: Announced the preview of Photon powered Delta engine on Azure Databricks – fast, easy, and collaborative Analytics and AI service. Built from scratch in C++ and fully compatible with Spark APIs, Photon is a vectorized query engine that leverages modern CPU architecture along with Delta Lake to enhance Apache Spark 3.0’s performance by up to 20x. As organizations worldwide embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly analyze massive amounts and types of data. However, this has been a challenge. While storage and network performance have increased 10x, CPU processing speeds have only increased marginally. This leads to the question if CPUs have become the bottleneck, how can we achieve the next level of performance? The answer with Photon lies in greater parallelism of CPU processing at both the data-level and instruction-level. Photon powered Delta Engine is a 100% Apache Spark-compatible vectorized query engine designed to take advantage of modern CPU architecture for extremely fast parallel processing of data. Read the blog to learn more.

Azure Arc enabled data services: Announcing Azure data services anywhere, a new capability enabled by Azure Arc that allows customers to run Azure data services on-premises, multi-cloud and edge, using Kubernetes on the hardware of choice. Azure Arc enabled SQL Managed Instance and Azure Arc enabled PostgreSQL Hyperscale are available in preview now, with other services to come over time. Customers can now deploy a variety of data services, with access to latest innovation, cloud benefits like automation and elastic scale, unified management across all data assets in hybrid environments, and unique Azure security and governance capabilities like Azure Security Center, Policies and Role-based Access Control for on-premises data workloads. Best of all, customers can use modern cloud billing models on-premises for better cost efficiency. Learn more about Azure Arc enabled data services

Azure ML Ignite 2020 announcements: ​Azure Machine Learning Enterprise Edition (preview) capabilities are merging into Basic Edition on September 22, 2020.  With these enhanced capabilities included in Basic Edition, the Enterprise Edition of Azure Machine Learning will retire on January 1, 2021.

Designer

  • No-code drag and drop experience for building and deploying ML models

AutoML UI

  • Automatically build and deploy predictive models using the no-code UI or through a code-first notebooks experience.

ML Labeling Assist

  • The machine learning assisted labeling feature lets users trigger automatic machine learning models to accelerate the labeling task.

Advanced RBAC (preview)

  • AzureML operation level RBAC will allow customers to set custom roles, or re-use pre-built roles to control specific operations for the individual users in a workspace.

mlflow updates (preview)

  • MLflow projects support for simplified job submission from local to cloud
  • Mlflow model registry support
  • Model deployment support (model from MLflow model registry and deploy it to ACI or AKS)
  • Expanded support for MLflow experimentation UI

Workspace Private Link (preview)

A network isolation feature that enables you to access Azure ML over a private IP in your virtual network (VNet).

Azure SQL Database – Cross-SQL MI distributed Tx – Preview: The distributed database transactions spanning multiple Azure SQL Managed Instances will be added soon to enable frictionless migration of existing applications, as well as development of modern multi-tenant applications relying on vertically or horizontally partitioned database architecture. By utilizing distributed transactions once public preview is announced, customers will be able to save time when migrating existing applications that require this capability as it eliminates the need to change application code and to perform extensive testing. If customers develop new applications, they will be able to benefit from partitioning data into multiple databases to overcome current sizing limitations of Azure SQL Managed Instance while utilizing distributed transactions to keep partitioned data in a consistent state. Once preview is announced, two methods of running distributed transactions will be supported using BEGIN DISTRIBUTED TRANSACTION statement from Transact-SQL code and using TransactionScope class from .NET code.

Power BI announcements from Ignite:

(many of these features are talked about at Driving a data culture in a world of remote everything)

  • Power BI Premium will soon be available on a per-user basis​.  Individuals will be able to upsize their per-seat Professional licenses to Premium. During the preview period, the upgrade carries no additional cost. See Answering your questions around the new Power BI Premium per user license
  • Power BI Premium is getting faster, and is gaining an autoscale capability that, when enabled, allows Microsoft to provision additional “V-cores” (virtual CPU cores) to the customer’s cloud tenant for periods of 24 hours, when overloads are detected. The v-cores are automatically removed during idle periods​. See Announcing the Upcoming Evolution of Power BI Premium to enterprise markets and beyond
  • Preview of “smart narratives,” an augmented analytics feature which provides plain-English summarizations of the data in a report, either on a per-visualization or page-wide basis. The narratives automatically update when data is filtered or drilled down upon, and the narratives are editable, both in terms of formatting and for insertion of arbitrary or calculation-driven text. See Smart Narrative (preview)
  • Dedicated Power BI add-in application for Microsoft’s Teams collaboration platform, released as a preview.  The Teams integration includes the ability to browse reports, dashboards and workspaces  and directly embed links to them in Teams channel chats. It’s not just about linking though, as Teams users can also browse Power BI datasets, both through an alphabetical listing of them or by reviewing a palette of recommended ones. In both cases, datasets previously marked as Certified or Promoted will be identified as such, and Teams users will have the ability view their lineage, generate template-based reports on them, or just analyze their data in Excel. See Announcing: New Power BI experiences in Microsoft Teams
  • Through a new preview feature, Power BI asset sensitivity levels, applied as Microsoft Information Protection (MIP) labels, can be applied from within Power BI Desktop, making it unnecessary to apply them to the report and its underlying dataset in the cloud service’s user interface. More info
  • Microsoft is enhancing its Power Query technology to feature a visual interface rather than relying on the data grid view that has been its hallmark. Essentially, Power Query gets a new Diagram View where all queries in the model, and each of the transformation steps within them, will appear together, in a flowchart-like representation. This visual presentation indicates the dependencies between queries and the sequence of individual transformations within each. New transforms can be easily added in the Diagram View as well. Visual data preparation is coming soon. It will be available initially in Power BI dataflows only (i.e. online), and not in Power BI Desktop. The capability will come to Desktop within the next calendar year. More info
  • A visual anomaly detection (coming soon). More info
  • A new visual that can trigger Power Automate processes in a data-driven fashion (also coming soon)
  • A performance accelerator for Azure Synapse Analytics that automatically creates materialized views to accelerate frequent queries from Power BI. More info
  • GA of deployment pipelines (including availability on government clouds) that can move Power BI assets between development, test and production environments. See Announcing deployment pipelines General Availability (GA)
  • A preview of a data impact analysis feature that can notify users of datasets that may be affected by a change to another dataset upstream. More info

More info:

Driving a data culture in a world of remote everything

Microsoft Ignite 2020 Book of News

What’s new in Azure SQL Managed Instance at Ignite 2020

Posted in Azure Arc, Azure SQL Database, Azure Synapse Analytics/SQL DW, Delta Lake, Power BI, SQLServerPedia Syndication | 3 Comments

Azure Synapse Analytics overlooked features

There are some options in Azure Synapse Analytics that are not obvious that I wanted to point out.

While in Azure Synapse Studio, going to the Data hub and clicking the “+” at the top to the right of Data brings up:

Workspace 
Synapse SQL database 
Linked 
Connect to external data 
Integration dataset 
Browse samples

Under “Workspace”, choosing “Synapse SQL Database” gives you the ability to create a database in a SQL on-demand pool. A SQL on-demand pool (which you access via an endpoint) can have multiple databases. Note that SQL on-demand has no local storage so only metadata objects are stored in the databases.

After you create a database, if you then go to the Develop hub and create a new SQL script, you can choose to connect to “SQL on-demand”, and then the newly created database will be available under “Use database”. You can then create views or external tables in that database.  When you go to the Data hub, under Databases you will see the newly created database (you might need to hit refresh first).

Under “Linked”, choosing “Connect to external data” will create a linked service to ADLS Gen2 or Cosmos DB that you can use in the Data hub (under “Linked”).  This is the same method as going to the Manage hub and choosing “Linked services” and clicking “New” (but that option has more sources to connect to). These linked services can also be used in Azure Data Factory (see Linked Services in Azure Data Factory). 

Under “Linked”, choosing “Integration dataset” will allow you to create a dataset that can be used in pipeline activities and in data flows.  When creating the dataset you will need to specify a linked service to use for the dataset (which you could of done via “Connect to external data”), or you can create a new linked service (see Datasets in Azure Data Factory).

Under “Linked”, choosing “Browse samples” will give you a ton of samples to choose from broken into four sets:

  • Datasets: A bunch of datasets that when added will show up under Data -> Linked -> Azure Blob Storage.  You can then choose an action (via “…” next to any of the containers in the dataset) and choose New SQL script -> Select TOP 100 rows to examine the data as well as choose “New notebook” to load the data into a Spark dataframe.  Any dataset you add is a linked service to files in a blob storage container using SAS authentication.  You can also create an external table in a SQL on-demand pool or SQL provisioned pool to each dataset via an action (via “…” next to “External tables” under the database, then New SQL script -> New external table) and then query it or insert the data into a SQL provisioned database
  • Notebooks: A bunch of sample Apache Spark notebooks in various languages (PySpark/Scala/Spark.NET C#/SparkSQL) that will show up under Develop -> Notebooks
  • SQL scripts: A bunch of sample SQL scripts that will show up under Develop -> SQL scripts
  • Pipelines: This is similar to choosing the “Create pipeline from template” on the home page of a Azure Data Factory workspace and will create a pipeline that will show up under Orchestrate -> Pipelines

These samples offer a great way to shortcut the process of building a solution in Synapse.

More info:

Quickly get started with samples in Azure Synapse Analytics

Posted in Azure Synapse Analytics/SQL DW, SQLServerPedia Syndication | 2 Comments

Query options in Azure Synapse Analytics

(updated 11/12/20)
The public preview version of Azure Synapse Analytics has three compute options and four types of storage that it can access (mentioned in my blog at SQL on-demand in Azure Synapse Analytics). This gives twelve possible combinations of querying data. Not all of these combinations currently are supported and some have a few quirks of which I list below.

(NOTE: I’ll demo these features at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/13 (session info), and Big Data Conference Europe on 11/25 (session info). I hope you can join!)

Storage >
Compute
Relational DatabaseADLS Gen2Spark TableCosmos DB
Dedicated SQL poolYY (*1)N (*2)N (*6)
Serverless SQL poolN (*6)YY (*3)N (*4)
Apache Spark poolYYYY (*5)

*1: This is available via an external table which uses the Polybase technology and does not use push-down queries so can be slow. A feature that will be available after Azure Synapse Analytics goes GA called fast parquet will speed up queries over external tables mapped to parquet files (the technology underneath is the same that is being used for SQL on-demand)
*2: This feature will be available soon, but will only support Spark tables created using Parquet as the input format
*3: This feature only supports Spark tables created using Parquet as the input format (explained here)
*4: This feature is available via the Azure Synapse Link and will soon be available
*5: This feature is available via the Azure Synapse Link
*6: This feature will be available after Azure Synapse Analytics goes GA

If you have data in ADLS Gen2 that is stored in Delta Lake format (via methods such as a Spark Table or Azure Data Factory), that data can be read via the Apache Spark pool (no matter if the Delta Lake is built via open source or Databricks), soon via Serverless SQL (a workaround for now is here), and not via a Dedicated SQL pool. Note that Power BI can read the Delta Lake format but requires you to use the Power BI Databricks connector.

If you have data in ADLS Gen2 that is in the Common Data Model (CDM) format (via methods such as Dynamics 365 CDS export to ADLS Gen2 or stored directly, or via Azure Data Factory), Serverless SQL is not yet able to read it, but you can via the Apache Spark pool (see Spark CDM connector). Note that Power BI can read CDM if using Power BI dataflows.

A huge benefit of using the public preview version of Azure Synapse Analytics is the ability to query a file in the data lake by simply right-clicking the file. When you do that a menu will pop-up giving you three choices on how to query the data. Currently this works on three different files types: parquet, csv, and JSON (note that you can always query additional file types using a Spark notebook). The three query choices are listed below with all but one currently supported:

Right-click
object-type:
PreviewNew SQL script ->
Select TOP 100 rows
New notebook ->
Load to DataFrame
ParquetN (*3)Y (*5)Y
CSVYY (*1)Y
JSONYY (*2)Y
Relational TblNYY
External TblNYY
Spark TblNYY
Cosmos DBNY (*3)Y

“Preview” opens a pop-up window with the contents of the file, “Select TOP 100 rows” opens a tab with a T-SQL SELECT statement using SQL on-demand and then you can run the statement to show the results, and “New notebook” opens a Spark notebook that has PySpark code that you can run to load the data into a Spark in-memory DataFrame and display the rows.

*1: This feature was just added and has a limitation in that the column headers do not show correctly (a feature is being worked on to enable the query to infer the column names if the file has a header row)
*2: Replace “varchar(8000)” with “varchar(MAX)” in the SELECT statement if you receive an error when running the query
*3: A work-around is to right-click the file, choose New SQL script -> Bulk load, click the Continue button and you will see a “Preview data” option
*4: Need to replace “key” with the Cosmos DB key in the generated SQL
*5: Because of automatic schema inference you don’t need to specify the field layouts for a file in a query

As far as writing to ADLS Gen2 via a Serverless SQL pool, that could be done using CETAS. Note that dropping an external table does not delete the data. You can also use the copy activity in ADF to read data via views in SQL on-demand and write it to the data lake, but the write is using ADF compute.

Posted in Azure Synapse Analytics/SQL DW, Delta Lake, SQLServerPedia Syndication | 1 Comment

Data Lakehouse & Synapse

I am starting to see this relatively new phrase, “Data Lakehouse”, being used in the data platform world. It’s the combination of “Data Lake” and “Data Warehouse”. In this post I’ll give my thoughts on it, and how the next version of Azure Synapse Analytics that is in public preview fits right in with the Data Lakehouse.

First, I want to clear up a bit of confusion regarding Azure Synapse Analytics. If you login to the Azure portal and do a search, you will see:

Notice that Azure SQL Data Warehouse (SQL DW) is now called Azure Synapse Analytics. This product has been generally available (GA) for a while and is simply a renaming. It contains new features and soon-to-be new features that I blogged about here, but this is separate from other features in a public preview version that I will call “Synapse PP” (see the new docs). I’ll call the current GA version “Synapse GA”. I blogged about this here.

I like to think a major feature of a Data Lakehouse as the ability to use T-SQL to query data the same way no matter whether the data is in a data lake (storage such as ADLS Gen2) or a data warehouse (a relational database such as Synapse GA).

Here are some of the major new Synapse PP features that support a Data Lakehouse:

There are some great benefits to having a data lake that I blogged about here. And there are many reasons to also have a relational data warehouse along with your data lake that I blogged about here. Up until now, the data lake and the relational data warehouse where sort of on their own island, having to jump to a different product to interface with each. For example, your data lake could be in ADLS Gen2 and you can use Azure Databricks to query it with SparkSQL (first setting up a connection to ADLS Gen2), while your relational data warehouse could be in Synapse GA and you use SSMS to query it with T-SQL. Having to use Azure Data Factory or Power BI means opening up another tab in your browser, logging into those products, and connecting to ADLS Gen2. So it can be time consuming and costly to create an environment to query both, and also be confusing to query using two different versions of SQL.

You can now get the best of both worlds all under one roof via Azure Synapse Studio: using the relational data warehouse when you need fast query performance, high user concurrency, enhanced security, or just prefer working in a relational environment; or using the data lake when you have huge data volumes, need instant access to the data via schema-on-read, or need to deal with semi-structured data. You never have to leave your workspace no matter which tool you want to use, and can use T-SQL no matter if the data is in the data lake or in the relational data warehouse. And querying a file in the data lake is simply a matter of right-clicking the file (more on this in my next blog).

So at a high level you can view Azure Synapse Studio as supporting business intelligence, data science, T-SQL plus other languages (Python, Scala, C#, SparkSQL), schema flexibility, ACID compliance, any data type, petabytes of data, streaming and batch, with high performance and user concurrency.

So as you can see, Synapse PP combines the use of a data lake and a relational database to make it a Data Lakehouse. It is not really a “new” technology, but just a merging of existing technologies to make it easier to gain insights from data in order to make better business decisions.

With these new features, especially SQL on-demand and T-SQL against ADLS, I am starting to see a few use cases where you may not need a relational database anymore in your modern data warehouse, which goes against my previous way of thinking (see Is the traditional data warehouse dead?). I still feel you will want to have a relational database in your modern data warehouse architecture a large majority of the time, but there will be exceptions.

For example, I can land a bunch of parquet files into the data lake and create a T-SQL view on top of that data, where that view is stored in SQL on-demand. Then I call that view from Power BI making it appear to the end-user that they are using a relational database and at the same time they are only paying when the query is being run. So that leads to compute costs savings as well as not having to copy the data from the data lake to a relational database. This means less time is needed to build a solution and there is a reduction in the complexity of the solution and therefore additional cost savings. Below are some of the concerns that arise out of skipping the relational database and just using a data lake and the new data/delta lake features that “counter” some of those concerns:

  • Speed: Queries against a relational storage will always be faster than against a data lake (roughly 5X) because of missing features in the data lake such as the lack of statistics, query plans, result-set caching, materialized views, in-memory caching, SSD-based caches, indexes, and the ability to design and align data and tables. Counter: DirectParquet, CSV 2.0, query acceleration, predict pushdown, and sql on-demand auto-scaling are some of the features that can make queries against ADLS be nearly as fast as a relational database.  Then there are features like Delta lake and the ability to use statistics for external tables that can add even more performance. Plus you can also import the data into Power BI, use Power BI aggregation tables, or import the data into Azure Analysis Services to get even faster performance. Another thing to keep in mind affecting query performance is Synapse is a Massive parallel processing (MPP) technology that has features such as replicated tables for smaller tables (i.e. dimension tables) and distributed tables for large tables (i.e. fact tables) with the ability to control how they are distributed across storage (hash, round-robin). This could provide much greater performance compared to a data lake that uses HDFS where large files are chunked across the storage
  • Security: Row-level security (RLS), column-level security, dynamic data masking, and data discovery & classification are security-related features that are not available in a data lake. Counter: User RLS in Power BI or RLS on external tables instead of RLS on a database table, which then allows you to use result set caching in Synapse
  • Complexity: Schema-on-read (ADLS) is more complex to query than schema-on-write (relational database). Schema-on-read means the end-user must define the metadata, where with schema-on-write the metadata was stored along with the data. Then there is the difficulty in querying in a file-based world compared to a relational database world. Counter: Create a SQL relational view on top of files in the data lake so the end-user does not have to create the metadata, which will make it appear to the end-user that the data is in a relational database. Or you could import the data from the data lake into Power BI, creating a star schema model in a Power BI dataset. But I still see it being very difficult to manage a solution with just a data lake when you have data from many sources
  • Missing features: Auditing, referential integrity, ACID compliance, updating/deleting rows of data, data caching, Transparent Data Encryption (TDE), workload management, full support of T-SQL – all are not available in a data lake. Counter: some of these features can be accomplished when using Delta Lake, Apache Hudi or Apache Iceberg (see A Thorough Comparison of Delta Lake, Iceberg and Hudi), but will not be as easy to implement as a relational database and you will be locked into using Spark. Also, features being added to Blob Storage (see More Azure Blob Storage enhancements) can be used instead of resorting to Delta Lake, such as blob versioning as a replacement for time travel in Delta Lake

Some additional reasons for using a relational database include helping to break up a large problem into smaller pieces (think of the relational database as an additional step in the data journey from a data lake to the end-user), and the need to handle slowly changing dimensions (SCD) (see Create Generic SCD Pattern in ADF Mapping Data Flows). Using time travel in the Delta Lake or creating daily folders for each table in the data lake are options for handling SCD but much more complex than in a relational database world (see Slowly changing data (SCD) Type 2 operation into Delta tables). I also see some customers who have source systems that are relational databases and as they are in the early stages of building a modern data warehouse in Azure they will bypass the data lake and copy the relational data right into a relational data warehouse (especially if they have many existing SSIS packages that they are using for their on-prem data warehouse and they want to continue using those for the short-term). They do this for quick wins and eventually will land the data in the data lake.

A strong reason for skipping a relational database is because, for the short term, SQL on-demand is not able to access a relational database, so you can only use this feature against a data lake. So if you want to save costs by using SQL on-demand it would have to be against data in the data lake. This could be especially true if you have a “small” data warehouse.

There can even be a reason to not copy a large amount of data to a relational database because of the large cost savings in storage: SQL Provisioned pool costs at least $135 per TB per month (some regions cost more); for ADLS Gen2, hot access tier is at least $19 per TB per month, cool access tier is at least $10 per TB per month, and archive access tier is at least $2 per TB per month (some regions cost more) with an additional cost per transaction (which varies greatly so check the Azure calculator). But keep in mind the tradeoffs listed above.

I’m still thinking through this, but these are my initial thoughts and it will be interesting to have discussions with customers on this topic when Synapse PP goes GA. Please comment below on your thoughts!

I’ll be presenting and doing a demo of the public preview of Azure Synapse Analytics at my sessions at European Digital Week on 9/25 (session info), SQL Bits on 10/3 (session info), PASS Summit on 11/13 (session info), and Big Data Conference Europe on 11/25 (session info).

More info:

What is a Lakehouse?

The Data Lakehouse – Dismantling the Hype

THE DATA LAKEHOUSE (updated)

Data LakeHouse — Paradigm of the Decade

Eckerson Group Webinar -Shop Talk – the Data Lakehouse – April 17, 2020

All Hail, the Data Lakehouse! (If Built on a Modern Data Warehouse)

An Architect’s View of the Data Lakehouse: Perplexity and Perspective

Data Lakehouses Hold Water (thanks to the Cloud Data Lake)

The Rise of Data Lakehouse with Data Virtualization

Meet the Data Lakehouse

The Future of Analytics: Leveraging Data Lakes and Data Warehouses

Posted in Azure Data Lake, Azure Synapse Analytics/SQL DW, Delta Lake, SQLServerPedia Syndication | 9 Comments

Cosmos DB for the SQL Professional – Referencing Tables

I had a previous blog comparing Cosmos DB to a relational database (see Understanding Cosmos DB coming from a relational world) and one topic that it did not address that I want to now is how to handle reference tables that are common in the relational database world.

A big difference with Cosmos DB compared to a relational database is you will create a denormalized data model.  Take a person record for example.  You will embed all the information related to a person, such as their contact details and addresses, into a single JSON document.  Retrieving a complete person record from the database is now a single read operation against a single container and for a single item.  Updating a person record, with their contact details and addresses, is also a single write operation against a single item.  By denormalizing data, your application typically will have better read performance and write performance and allow for a scale-out architecture since you don’t need to join tables. 

(Side note: “container” is the generic term. Depending on the API, a specific term is used such as “collection” for the SQL API). Think of a container as one or more tables in the relational world. Going a little deeper, think of a container as a group of one or more “entities” which share the same partition key. A relational table shares a schema, but containers are not bound in that way.)

Embedding data works nicely for many cases but there are scenarios when denormalizing your data will cause more problems than it is worth.  In a document database, you can have information in one document that relates to data in other documents. While there may be some use cases that are better suited for a relational database than in Cosmos DB (see below), in most cases you can handle relationships in Cosmos DB by creating a normalized data model for them, with the tradeoff that it can require more round trips to the server to read data (but improve the efficiency of write operations since less data is written).  In general, use normalized data models to represent one-to-many relationships or many-to-many relationships when related data changes frequently. The key is knowing whether the cost of the updates is greater than the cost of the queries.

When using a normalized data model, your application will need to handle creating the reference document.  One way would be to use a change feed that triggers on the creation of a new document – the change feed essentially triggers an Azure function that creates the  relationship record.

When using a normalized data model, your application will need to query the multiple documents that need to be joined (costing more money because it will use more request units), and do the joining within the application (i.e. join a main document with documents that contain the reference data) as you cannot do a “join” between documents within different containers in Cosmos DB (joins between documents within the same container can be done via self-joins).  Since every time you display a document it needs to search the entire container for the name, it would be best to put the other document type (the reference data) in a different container so you can have different partition keys for each document type (read up on how partitioning can make a big impact on performance and cost).

Note that “partitioning” in a RDBMS compared to Cosmos DB are different things: partitioning in Cosmos DB refers to “sharding” or “horizontal partitioning“, where replica sets which contain both the data and copies of compute (database) resources operating in a “shared nothing” architecture (i.e. scaled “horizontally” where each compute resource (server node) operates independently of every other node, but with a programming model transparent to developers). Conversely, what is often referred to as “partitioning” in a RDBMS is purely a separation of data into separate file groups within a shared compute (database) environment. This is also often called “vertical partitioning”.

Another option that is common pattern for NoSQL databases is to create a separate container to satisfy specific queries.  For example, having a container for products based on category and another container for products based on geography.  Both of those containers for my query/app are being sourced from one that is my “main” or “source” container that is being updated (front end, or another app) and the change feed attached to that pushes out to my other containers that I use for my queries.  This means duplicating data, but storage is cheap and you save costs to retrieve data (think of those extra containers as covering indexes in the relational database world). 

Since joining data can involve multiple ways of reading the data, it’s important to understand the two ways to read data using the Azure Cosmos DB SQL API:

  • Point reads – You can do a key/value lookup on a single item ID and partition key. The item ID and partition key combination is the key and the item itself is the value. For a 1 KB document, point reads typically cost 1 request unit with a latency under 10ms. Point reads return a single item. See Understanding the difference between point reads and queries in Azure Cosmos DB
  • SQL queries – You can query data by writing queries using the Structured Query Language (SQL) as a JSON query language. Queries always cost at least 2.3 request units and, in general, will have a higher and more variable latency than point reads. Queries can return many items. See Getting started with SQL queries

The key in deciding when to use a normalized data model is how frequently the data will change.  If the data only changes once a year it may not be worthwhile to create a reference document and instead just do an update to all the documents.  But be aware that the update has to be done from the client side spread over the affected documents, doing it in batches as one big UPDATE statement does not exist in Cosmos DB.  You will need to retrieve the entire document from Cosmos DB, update the property/properties in your application and then call the ‘Replace’ method in the Cosmos DB SDK to replace the document in question (see CosmosDb – Updating a Document (Partially)). If you are using SQL API and .NET or Java, you can consider using bulk support (.NET) or bulk executor (Java). Other ideas would involve using change feed, or if you really need a level of ACID consistency, you can achieve this using stored procedures, with snapshot isolation scoped to a single partition (this is not the same as stored procedures in SQL – rather these are designed specifically to support multi-doc transactions).  

Also be aware that because there is currently no concept of a constraint, foreign-key or otherwise, any inter-document relationships that you have in documents are effectively “weak links” and will not be verified by the database itself.  If you want to ensure that the data a document is referring to actually exists, then you need to do this in your application, or through the use of server-side triggers or stored procedures on Azure Cosmos DB.

What are OLTP scenarios where a relational database is essential?

Avoiding the broader topic of when to use a relational database over a non-relational database, there are a few use cases where a relational database is essential:

  • The customer experience and comfort zone is with relational databases. It is a reality that relational databases are ahead in the maturity curve with respect to tooling (an example would be foreign-key constraint behavior). However, it should be noted that this is not the same as saying that “more use cases are technically better suited to the relational model”. Rather, the barrier to entry in new customer projects tends to be lower because mindshare is greater in the relational space. In these cases, it often isn’t worth the effort for companies to upskill
  • The system really needs strict ACID semantics across the entire dataset. Sharded/partitioned databases like Cosmos DB will not provide ACID guarantees across the entire set of physical partitions (and likely never will). In reality, however, the use cases where this is necessary is quite small. Things like transaction management and other SDK-level things that go along with these aspects come easier in the RDBMS space, but this is really the same as above point – RDBMS is ahead on maturity curve for user-level tooling to help abstract paradigm specific concepts – but this does not make the paradigm better suited to a greater number of use cases
  • Having a single data store that services both operational and analytical needs with equal utility, including tabular models – this is probably the most powerful argument, and NoSQL engines are likely never going to serve a data structure that coalesces as well into tabular models that produce reports, charts, graphs, etc. But again, history has proven that, at scale, the “one fits all” approach can have some non-trivial drawbacks. And the new Analytical Store in Cosmos DB is addressing the need to service both operational and analytical needs

You can create complex hierarchical “relationships” in Cosmos DB, which would have to be modelled in separate tables in an RDBMS. Cosmos DB can’t handle them using joins – but again, this is a paradigmatic/semantic difference, not a fundamental flaw in the database model itself. In order to do the equivalent of what one may be trying to achieve in a relational database, you may have to “unlearn what you have learned”, but this comes back to your comfort level with a RDBMS, which is not a trivial thing and can be the main and very valid reason for staying with a RDBMS.

In summary, in a NoSQL database like Cosmos DB, most use cases are covered. Some things are a little harder (due to lack of maturity in tooling), but most things are easier, many things can only be done in NoSQL (i.e. handling millions of transactions per second), and very few things cannot be done in a NoSQL database. Most NoSQL engines are characterized by having a lot more configurability, tunability, and flexibility than a RDBMS. And in many ways, that is the hardest challenge for newcomers.

More info:

Data modeling in Azure Cosmos DB

Video Data modelling and partitioning in Azure Cosmos DB: What every relational database user needs to know

Video A tour of Azure Cosmos DB database operations models

Posted in Azure Cosmos DB, SQLServerPedia Syndication | Comments Off on Cosmos DB for the SQL Professional – Referencing Tables

SQL on-demand in Azure Synapse Analytics

The new SQL on-demand (to be renamed SQL serverless) offering within Azure Synapse Analytics opens up a whole new world of getting insights into data immediately, at low-cost, using familiar tools/languages. There is no infrastructure to setup or clusters to maintain. It uses a pay-per-query model (you are only being charged per query for the data processed by the query – see pricing) and auto-scales depending on the amount of data being queried. See Quickstart: Use SQL on-demand

A SQL on-demand pool currently can access data from ADLS Gen2, Spark Tables, and Cosmos DB using T-SQL (click picture to enlarge):

Querying data in ADLS Gen2 storage using T-SQL is made easy because of the OPENROWSET function with additional capabilities (check out the T-SQL that is supported). The currently supported file types in ADLS Gen2 that SQL-on-demand can use are Parquet, CSV, and JSON. ParquetDirect and CSV 2.0 add performance improvements (see Benchmarking Azure Synapse Analytics – SQL Serverless, using .NET Interactive). You can also query folders and multiple files and use file metadata in queries.

Here are the three main scenarios that SQL on-demand is great for:

  • Basic discovery and exploration – Quickly view the data in various formats (Parquet, CSV, JSON) in your data lake, so you can plan how to extract insights from it
  • Logical data warehouse – Provide a relational abstraction on top of raw or disparate data without having to relocate or transform the data, allowing an always up-to-date view of your data. By putting T-SQL views on top of data in your data lake, this makes it appear to the end user that they are querying data in a relational database since they are using T-SQL, blurring the line between a relational database and a data lake
  • Data transformation – Simple, scalable, and performant way to transform data in the lake using T-SQL, so it can be fed to BI and other tools, or loaded into a relational data store (Synapse SQL databases, Azure SQL Database, etc.). For example, using the Copy activity in Azure Data Factory you can convert CSV files in the data lake (via T-SQL views in SQL on-demand) to Parquet files in the data lake.  See Azure Synapse Analytics: How serverless is replacing the data warehouse

Different professional roles can benefit from SQL on-demand:

  • Data Engineers can explore the lake, transform and prepare data using this service, and simplify their data transformation pipelines. For more information, check out this tutorial. You could even create a view over data in the data lake and use that to refresh your data in a tabular model
  • Data Scientists can quickly reason about the contents and structure of the data in the lake, thanks to features such as OPENROWSET and automatic schema inference
  • Data Analysts can explore data and Spark external tables created by Data Scientists or Data Engineers using familiar T-SQL language or their favorite tools, which can connect to SQL on-demand. So if you create a Spark Table, that table will be created as an external table in SQL On-Demand so you can query it without having to keep a Spark cluster up and running. Currently this is only supported with Spark Tables created using Parquet as the input format
  • BI Professionals can quickly create Power BI reports on top of data in the lake and Spark tables. You are connecting to a “database” that has no data, but rather views or external tables to data in the data lake. Not only can you create reports, but you can use Power BI to explore the data and do basic discovery

Here are the top FAQ’s about SQL on-demand that I have heard from customers:

  • Will SQL on-demand be able to query SQL Pool tables? It will be able to in the future
  • Is SQL serverless MPP? SQL serverless uses nodes to scale out processing, similar to MPP, although it is completely different architecture from existing MPP data warehouses.  It uses customized SQL engines
  • What are the warmup times when submitting a query? For 95 percentile it is less than 2s. There is a timeout defined after which there is scale-down, so if you don’t execute any query targeting storage during that timeout, your resources will be “taken from you”. Once you get back/connect and execute a query that targets storage, resources are granted back to your endpoint.  Maintained is a pool of warm clusters to get the quick warmup time
  • Is there a predicate pushdown concept for SQL on-demand in Azure Synapse? Yes, there is filter pushdown where SQL on-demand will push down queries from the front-end to back-end nodes. For example, if you query parquet files parquet metadata is used to target only column groups that contain values you are looking for. Microsoft is expanding the range of cases in which filter pushdown is supported
  • What are the best practices for SQL on-demand?  Check out https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/best-practices-sql-on-demand
  • Why should I choose serverless and have a penalty on first (warmup) query, instead of using provisioned?  It depends on your workload. For constant or high workloads provisioned might be a better choice, while for sporadic or ad-hoc or exploratory workloads serverless might be better fit. It also brings a difference in charges: provisioned means you are paying for resources, while in serverless you are paying for what you use
  • So when would you use provisioned SQL (aka SQL Pool, which is exactly what was in SQL Data Warehouse)? When you need consistent performance, high performance, or have a large number of queries being consistently run. Provisioned SQL may give you better and more predictable performance due to resource reservation. But the good thing is because both use T-SQL, it is easy to transition back-and-forth between SQL Serverless and a SQL pool
  • Most customers are starting to use Delta Lake to store their data. When will SQL OD support this file format?  We are working on the support for Delta Lake, in the following months there will be more details regarding it and when will it be available
  • What is the ADLS Gen2 security? There are different methods to access storage account files using SQL serverless. You can use Synapse workspace MSI, SAS, or control ACL on storage level for a particular AAD account
  • Can we read from files stored in Common Data Model (CDM) format within the Data Lake (ADLS Gen2)?  Yes, at this moment we have a way to read CDM data using the SQL script or provisioned script, which gets metadata from the model.json and creates views over CDM data (native support in the query currently doesn’t exist)
  • Why use external tables over views?  Views are faster and have more features, so avoid external tables if possible. Particularly, virtual functions (filepath and filename are not supported in external tables at this moment which means users cannot do partition elimination on external tables)
  • Can you use PBI and AAS with SQL on-demand?  Yes, you can create views on top of your data lake, use wildcards and filepath/filename functions to expose partitioning column values in a view. That way, for PBI and AAS it is just another object (please make sure that you do proper data type casting of virtual function results as stated in best practices document)
  • Why use a logical data warehouse instead of just using AAS or PBI instead? Hot analytics is one reason as well as a logical data warehouse that can be used to speed up user request to delivery time.  Ideally, there would be no ETLs created (it might not be feasible in all cases, so providing another perspective here)
  • Is there any cost control? Yes, cost control will be available per workspace and time period daily/weekly/monthly. Also, cost estimation for later
  • What about CI/CD? When Synapse goes GA there will be improvements overall in CI/CD area, where customers can create group/department and scope the permissions of particular teams to the artifacts and capabilities
  • Can I use other tools with SQL on-demand? Yes, there is a default endpoint (i.e. workspacejs1966-ondemand.sql.azuresynapse.net) for this service that is provided within every Azure Synapse workspace. Any tool capable of establishing TDS connection SQL offerings can connect to and query Synapse SQL on-demand endpoint. You can connect with Azure Data Studio or SSMS and run ad-hoc queries or connect with Power BI to gain insights. You can also use Private Link to bring your SQL on-demand endpoint into your managed workspace VNet

More info:

Talking about Azure Synapse on Microsoft Mechanics!

Azure Synapse Analytics – Getting Started with SQL On Demand

AZURE SYNAPSE ANALYTICS – THE ESSENTIAL SQL ON-DEMAND CHEAT SHEET

THINGS TO KNOW ABOUT SQL ON-DEMAND (SERVERLESS QUERY) IN AZURE SYNAPSE ANALYTICS

Power BI and SQL on-demand

Posted in Azure Synapse Analytics/SQL DW, SQLServerPedia Syndication | 2 Comments