Eighty-nine percent of organizations agree that the rate of change has accelerated in the past two years, and it’s not likely to slow down anytime soon. In an age of relentless change, companies must have a firm grasp on the data and insights that can help them evolve their customer, workforce, and operational strategies accordingly. At the core, this requires a readily scalable approach to data architecture, best enabled by public cloud architectures. Yet, the movement away from on-premise data storage and towards the public cloud is a tall order. It can be challenging to pick an ideal public cloud platform that meets the needs of your business, and that best utilizes the existing skillsets within your workforce. Using an expertise-guided approach to inform your selection is the best way forward. In this blog series, we distill the data pipeline architectures of the top cloud platforms available today. From there, we equip you with some practical criteria to inform the right selection for your organization.
In the previous blogs in our series, we walked through ETL services provided by Amazon Web Services and Google Cloud Platform. In this final piece, we’ll take a deep dive into those offered by Microsoft Azure.
Azure Data Factory is an intuitive drag-and-drop tool used to extract and migrate your data from various source systems into Azure. You can leverage either the Azure Blob Storage or the Azure Data Lake storage for housing your raw data. A combination of Azure Data Factory and PolyBase lets you transform the raw data to its desired state before loading it into your Azure Data Warehouse. Azure provides you with multiple options to present your data to end-users as well, including Azure SQL databases or PowerBI for visualization and reporting purposes.
Scaling Compute Power and Memory
Azure is flexible and scalable. Azure Synapse Analytics (previously called Azure Data Warehouse) separates the computation from underlying storage to scale computing power and memory independent of the data storage. If there’s a spike in usage for your data solution, Azure SQL databases and SQL Managed Instance can scale on-demand, while keeping downtime to a minimum.
Azure SQL databases offer both the Database Traction Unit (DTU)-based and vCore-based purchasing models. DTU-based provides a blend of compute, memory, and Input/Output resources in three service tiers to support various database workloads. The vCore-based model allows you to select the amount of memory in three tiers: General Purpose, Business Critical, and Hyperscale. In cases where the highest service tiers fail to keep up with your scaling demands, Azure provides additional options:
- Read scale-out: One read-only copy of your data where you can execute demanding, read-only queries without affecting resource usage on your primary database.
- Database sharding: A set of techniques that enables you to split the data into several databases and scale them independently.
Storage
Azure Blob Storage and Azure Data Lake storage are the platform’s two storage options. Azure Data Lake storage is a vast repository of structured and unstructured data. Data Lake combines the best of scalability and cost. Benefits include:
- Hadoop-compatible access treats data as if it’s stored in the Hadoop Distributed File System, along with access to data using compute technologies (like Azure Synapse Analytics) without moving between environments.
- Security permissions can be set at a different directory or file level for the data stored in the data lake. All the data is encrypted at rest using Microsoft or customer-managed keys.
- Data is stored in a hierarchy of directories and subdirectories for better performance and more seamless navigation.
- Data redundancy takes advantage of the Azure Blob replication model that provides data redundancy in a single data center with the locally-redundant storage (LRS), or a secondary region with the geo-redundant storage (GRS) option.
Networking
Azure lets the user set up what is called the SQL server’s “Always On” availability groups to optimize throughput for your ETL solution, ensuring that it is always available. These groups are stored in different Azure regions and can have synchronous commits with automatic failover. The Azure Accelerated Networking is ideal for computing environments that are under extreme demand. It gives you options to optimize and fine-tune the ETL solution for your network workloads on supported virtual machines (VMs).
Below is a is an example of a modern data warehouse architecture in Azure:
Applying an expertise-guided approach, below are a few key diagnostic criteria you can use to validate whether Microsoft Azure is the right pick for your business:
- If your existing architecture is made up of SQL Server Integration Services, Azure may offer the most seamless migration relative to the other options.
- If you’re seeking serverless capabilities that are comparable to AWS Glue, Azure Data Factory is a great alternative pick.
- Regardless of whether you use structured data and relational databases or non-relational databases, Microsoft Azure is an ideal choice. In the case of non-relational databases, you can simply replace the Azure SQL database with the Cosmos database.
Moving away from on-premise data storage and towards the public cloud may initially seem like a daunting task. By applying insights and expertise to guide your path forward, you’ll be one step closer to developing a data architecture strategy that can better help your organization scale, adapt, and transform for the unknowable future ahead.