zaro

How do I connect Azure Data Factory to external data sources and services?

Published in Azure Data Factory Connectivity 6 mins read

Connecting Azure Data Factory (ADF) to external data sources and services is primarily achieved through Linked Services, which store the connection information needed for ADF to interact with various data stores and compute environments. This allows ADF pipelines to move data between different systems, transform it, and execute activities.

Understanding ADF Connectivity with Linked Services

A Linked Service in Azure Data Factory acts as a connection string, defining the specific details required to connect to an external resource. This includes information such as:

  • Type of data store or compute environment: E.g., Azure SQL Database, Azure Blob Storage, Amazon S3, Azure Synapse Analytics, Spark, etc.
  • Connection details: Server name, database name, storage account URL, endpoint.
  • Authentication method: Credentials, managed identity, service principal, access keys.

By separating connection information into Linked Services, you can reuse them across multiple pipelines and datasets, centralizing your connection management.

Step-by-Step: Connecting Azure Data Factory to Azure SQL Database (using System Assigned Managed Identity)

One of the most secure and recommended ways to connect ADF to Azure services like Azure SQL Database is by using a System Assigned Managed Identity. This eliminates the need to manage credentials directly within ADF.

Prerequisites

Before you begin, ensure you have:

  • An active Azure Data Factory instance.
  • An Azure SQL Database.
  • The Azure Data Factory's system-assigned managed identity granted appropriate permissions (e.g., 'db_datareader', 'db_datawriter', or a custom role) within your Azure SQL Database. You can do this by executing CREATE USER [YourADFName] FROM EXTERNAL PROVIDER; and then ALTER ROLE db_datareader ADD MEMBER [YourADFName]; in your SQL Database.

Configuration Steps

  1. Launch ADF Studio: Open your Azure Data Factory instance and navigate to the Launch ADF Studio button.
  2. Navigate to Manage Hub: In ADF Studio, click on the Manage hub icon (represented by a toolbox or wrench icon) in the left-hand navigation pane.
  3. Access Linked Services: Under the Connections section, select Linked services.
  4. Create New Linked Service: Click the + New button to create a new linked service.
  5. Select Data Store Type: In the "New linked service" wizard, search for and select Azure SQL Database from the list of available data stores, then click Continue.
  6. Configure Connection Details:
    • Provide a descriptive Name for your linked service.
    • Select your Azure subscription from the dropdown list.
    • Choose your Server name and Database name from the respective dropdowns.
  7. Choose Authentication Type: For Authentication type, select System Assigned Managed Identity. This tells ADF to use its own identity to authenticate with the Azure SQL Database.
  8. Test and Create:
    • Click Test connection to verify that ADF can successfully connect to your SQL Database using the managed identity.
    • If the test is successful, click Create to save your new linked service.

Once created, this linked service can be referenced by datasets and activities within your ADF pipelines to interact with the specified Azure SQL Database.

Common Connection Methods and Linked Service Types

Azure Data Factory supports a vast array of data stores and compute environments, each potentially offering different authentication methods to suit various security and operational requirements.

Popular Authentication Methods

  • Managed Identity (System/User Assigned): Recommended for Azure services, providing secure, credential-less authentication.
  • Connection String: A common method for many databases and storage services, containing all necessary connection parameters (server, database, credentials).
  • Account Key / SAS URI: Used for Azure Storage accounts (Blob, Data Lake Gen2), providing direct access.
  • Service Principal: An application identity registered in Azure Active Directory, often used for programmatic access to Azure resources.
  • SQL Authentication: Usernames and passwords directly managed within the database system (e.g., SQL Server, Azure SQL DB).
  • OAuth2: For services requiring token-based authentication (e.g., REST APIs).

Examples of Other Linked Service Types

  • Azure Blob Storage / Azure Data Lake Storage Gen2: For cloud object storage, commonly used for data landing zones.
  • Azure Synapse Analytics: For data warehousing and big data analytics.
  • Azure Databricks: For Apache Spark-based analytics platform.
  • Amazon S3 / Google Cloud Storage: For connecting to other cloud storage providers.
  • SFTP / FTP: For secure file transfer protocols.
  • REST / HTTP: For connecting to web services and APIs.
  • On-premises data sources: SQL Server, Oracle, File System (require Self-Hosted Integration Runtime).

Role of Integration Runtimes in Connectivity

Integration Runtimes (IRs) are the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments.

  • Azure Integration Runtime: The default compute infrastructure fully managed by Azure. It's used for connecting to cloud-based data stores and compute services publicly accessible over the internet.
  • Self-Hosted Integration Runtime (SHIR): A customer-managed agent that needs to be installed on a Windows machine (on-premises or in a virtual network). It's required for connecting ADF to data stores located in private networks (e.g., on-premises databases, data stores in a Virtual Network with private endpoints) or to perform custom activities that require specific software.

Best Practices for Secure Connectivity

To ensure robust and secure data integration in ADF, consider these best practices:

  • Principle of Least Privilege: Grant only the necessary permissions to your linked services and ADF's managed identity.
  • Azure Key Vault Integration: Store sensitive credentials (like connection strings, API keys) in Azure Key Vault and reference them in your linked services. This prevents credentials from being hardcoded in ADF.
  • Private Endpoints and Virtual Networks: For enhanced security, connect ADF to Azure data services using Azure Private Link (via private endpoints) to route traffic securely over the Azure backbone network, avoiding public internet exposure.
  • Regular Auditing: Monitor access logs and activities related to your linked services and data sources to detect and respond to suspicious behavior.
  • Managed Identity First: Always prefer Managed Identities for connections to Azure services when available, as they offer the highest level of security and ease of management.

Quick Reference: Common Linked Services & Authentication

Data Store / Service Common Authentication Types Typical Use Case
Azure SQL Database System/User Assigned Managed Identity, SQL Authentication, Service Principal Relational data storage and processing
Azure Blob Storage Account Key, SAS URI, Managed Identity Unstructured object storage, data staging
Azure Data Lake Storage Gen2 Account Key, SAS URI, Managed Identity, Service Principal Large-scale data lake for analytics
Azure Synapse Analytics System/User Assigned Managed Identity, SQL Authentication, Service Principal Data warehousing, big data analytics
On-premises SQL Server SQL Authentication, Windows Authentication (via Self-Hosted IR) Integrating on-premises relational databases
REST API OAuth2, API Key, Basic Authentication Interacting with web services and third-party APIs

For more detailed information, you can refer to the official Azure Data Factory documentation.