An Automated Information to Distributed and Decentralized Administration of Unity Catalog


Unity Catalog supplies a unified governance resolution for all knowledge and AI belongings in your lakehouse on any cloud. As prospects undertake Unity Catalog, they need to do that programmatically and mechanically, utilizing infrastructure as a code method. With Unity Catalog, there’s a single metastore per area, which is the top-level container of objects in Unity Catalog. It shops knowledge belongings (tables and views) and the permissions that govern entry.

This presents a brand new problem for organizations that shouldn’t have centralized platform/governance groups to personal the Unity Catalog administration operate. Particularly, groups inside these organizations now need to collaborate and work collectively on a single metastore, i.e. how one can govern entry and carry out auditing in full isolation from one another.

On this weblog put up, we’ll talk about how prospects can leverage the assist for Unity Catalog objects within the Databricks Terraform supplier to handle a distributed governance sample on the lakehouse successfully.

We current two options:

  • One which utterly delegates duties to groups in the case of creating belongings in Unity Catalog
  • One which limits which assets groups can create in Unity Catalog

Making a Unity Catalog metastore

As a one-off bootstrap exercise, prospects have to create a Unity Catalog metastore per area they function in. This requires an account administrator, which is a highly-privileged that’s solely accessed in breakglass eventualities, i.e. username & password saved in a secret vault that requires approval workflows for use in automated pipelines.

An account administrator must authenticate utilizing their username & password on AWS:


supplier "databricks" {
  host       = "https://accounts.cloud.databricks.com"
  account_id = var.databricks_account_id
  username   = var.databricks_account_username
  password   = var.databricks_account_password
}

Or utilizing their AAD token on Azure:


supplier "databricks" {
  host       = "https://accounts.azuredatabricks.web"
  account_id = var.databricks_account_id
  auth_type  = "azure-cli" # or azure-client-secret or azure-msi
}

The Databricks Account Admin wants to offer:

  1. A single cloud storage location (S3/ADLS), which would be the default location to retailer knowledge for managed tables
  2. A single IAM position / managed identification, which Unity Catalog will use to entry the cloud storage in (1)

The Terraform code can be much like under (AWS instance)


useful resource "databricks_metastore" "this" {
  identify          = "main"
  storage_root  = var.central_bucket
  proprietor         = var.unity_admin_group
  force_destroy = true
}

useful resource "databricks_metastore_data_access" "this" {
  metastore_id = databricks_metastore.this.id
  identify         = aws_iam_role.metastore_data_access.identify
  aws_iam_role {
    role_arn = aws_iam_role.metastore_data_access.arn
  }
  is_default = true
}

Groups can select to not use this default location and identification for his or her tables by setting a location and identification for managed tables per particular person catalog, or much more fine-grained on the schema degree. When managed tables are created, the information will then be saved utilizing the schema location (if current) falling again to the catalog location (if current), and solely fall again to the metastore location if the prior two places haven’t been set.

Nominating a metastore administrator

When making a metastore, we nominated the unity_admin_group because the metastore administrator. To keep away from having a government that may checklist and handle entry to all objects within the metastore, we’ll hold this group empty


useful resource "databricks_group" "admin_group" {
  display_name = var.unity_admin_group
}

Customers could be added to the group for distinctive break-glass eventualities which require a excessive powered admin (e.g., organising preliminary entry, altering possession of catalog if catalog proprietor leaves the group).


useful resource "databricks_user" "break_glass" {
  for_each  = toset(var.break_glass_users)
  user_name = every.key
  pressure     = true
}

useful resource "databricks_group_member" "admin_group_member" {
  for_each  = toset(var.break_glass_users)
  group_id  = databricks_group.admin_group.id
  member_id = databricks_user.break_glass[each.value].id
}

Delegating Duties to Groups

Every workforce is chargeable for creating their very own catalogs and managing entry to its knowledge. Preliminary bootstrap actions are required for every new workforce to get the required privileges to function independently.

The account admin then must carry out the next:

  • Create a bunch known as team-admins
  • Grant CREATE CATALOG, CREATE EXTERNAL LOCATION, and optionally GRANT CREATE SHARE, PROVIDER, RECIPIENT if utilizing Delta Sharing to this workforce

useful resource "databricks_group" "team_admins" {
  display_name = "team-admins"
}

useful resource "databricks_grants" "sandbox" {
  metastore = databricks_metastore.this.id
  grant {
    principal  = databricks_group.team_admins.display_name
    privileges = ["CREATE_CATALOG", "CREATE_EXTERNAL_LOCATION", "CREATE SHARE", "CREATE PROVIDER", "CREATE RECIPIENT"]
  }
}

When a brand new workforce onboards, place the trusted workforce admins within the team-admins group


useful resource "databricks_user" "team_admins" {
  for_each  = toset(var.team_admins)
  user_name = every.key
  pressure     = true
}

useful resource "databricks_group_member" "team_admin_group_member" {
  for_each  = toset(var.team_admins)
  group_id  = databricks_group.team_admins.id
  member_id = databricks_user.team_admins[each.value].id
}

Members of the team-admins group can now simply create new catalogs and exterior places for his or her workforce with out interplay from the account administrator or metastore administrator.

Onboarding new groups

In the course of the technique of including a brand new workforce to Databricks, preliminary actions from an account administrator is required in order that the brand new workforce is free to arrange their workspaces / knowledge belongings to their desire:

  • A brand new workspace is created both by workforce X admins (Azure) or the account admin (AWS)
  • Account admin attaches the prevailing metastore to the workspace
  • Account admin creates a bunch particular to this workforce known as ‘team_X_admin’ which accommodates the admins for the workforce to be onboarded.

useful resource "databricks_group" "team_X_admins" {
  display_name = "team_X_admins"
}

useful resource "databricks_user" "team_X_admins" {
  for_each  = toset(var.team_X_admins)
  user_name = every.key
  pressure     = true
}

useful resource "databricks_group_member" "team_X_admin_group_member" {
  for_each  = toset(var.team_X_admins)
  group_id  = databricks_group.team_X_admins.id
  member_id = databricks_user.team_X_admins[each.value].id
}
  • Account admin creates a storage credential and modifications the proprietor to ‘team_X_admin’ group to make use of them. If the workforce admins are trusted within the cloud tenant, they’ll then management what storage the credential has entry to (e.g. any of their very own S3 buckets or ADLS storage accounts).

useful resource "databricks_storage_credential" "exterior" {
  identify = "team_X_credential"
  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.ext_access_connector.id
  }
  remark = "Managed by TF"
  proprietor   = databricks_group.team_X_admins.display_name
}
  • Account admin then assigns the newly created workspace to the UC metastore

useful resource "databricks_metastore_assignment" "this" {
  workspace_id         = var.databricks_workspace_id
  metastore_id         = databricks_metastore.this.id
  default_catalog_name = "hive_metastore"
}
  • Staff X admins then create any variety of catalogs and exterior places as required
    • As a result of workforce admins will not be metastore homeowners or account admins, they can’t work together with any entities (catalogs/schemas/tables and so forth) that they don’t personal, i.e. from different groups.

Restricted delegation of duties to groups

Some organizations might not need to make groups autonomous in creating belongings of their central metastore. In reality, giving a number of groups the power to create such belongings could be tough to control, naming conventions can’t be enforced and conserving the setting clear is tough.

In such a situation, we propose a mannequin the place every workforce recordsdata a request with an inventory of belongings they need admins to create for them. The workforce can be made proprietor of the belongings to allow them to be autonomous in assigning permissions to others.

To automate such requests as a lot as doable, we current how that is finished utilizing a CI/CD. The admin workforce owns a central repository of their most popular versioning system the place they’ve all of the scripts that deploy Databricks of their group. Every workforce is allowed to create branches on this repository so as to add the Terraform configuration recordsdata for their very own environments utilizing a predefined template (Terraform Module). When the workforce is prepared, they create a pull request. At this level, the central admin has to evaluation (this may be additionally automated with the suitable checks) the pull request and merge it to the principle department, which can set off the deployment of the assets for the workforce.

This method permits one to have extra management over what particular person groups do, nevertheless it entails some (restricted, automatable) actions on the central admins’ workforce.

On this situation, the Terraform scripts under are executed mechanically by the CI/CD pipelines utilizing a Service Principal (00000000-0000-0000-0000-000000000000), which is made account admin. The one-off operation of creating such service principal account admin have to be manually executed by an present account admin, for instance:


useful resource "databricks_service_principal" "sp" {
  application_id = "00000000-0000-0000-0000-000000000000"
}

useful resource "databricks_service_principal_role" "sp_account_admin" {
  service_principal_id = databricks_service_principal.sp.id
  position                 = "account admin"
}

Onboarding new groups

When a brand new workforce needs to be onboarded, they should file a request that can create the next objects (Azure instance):

  • Create a bunch known as team_X_admins, which accommodates the Account Admin Service Principal (to permit future modifications to the belongings) plus the members of the group

useful resource "databricks_group" "team_X_admins" {
  display_name = "team_X_admins"
}

useful resource "databricks_user" "team_X_admins" {
  for_each  = toset(var.team_X_admins)
  user_name = every.key
  pressure     = true
}

useful resource "databricks_group_member" "team_X_admin_group_member" {
  for_each  = toset(var.team_X_admins)
  group_id  = databricks_group.team_X_admins.id
  member_id = databricks_user.team_X_admins[each.value].id
}

knowledge "databricks_service_principal" "service_principal_admin" {
  application_id = "00000000-0000-0000-0000-000000000000"
}

useful resource "databricks_group_member" "service_principal_admin_member" {   
  group_id  = databricks_group.team_X_admins.id
  member_id = databricks_service_principal.service_principal_admin.id
}
  • A brand new useful resource group or specify an present one

useful resource "azurerm_resource_group" "this" {
  identify     = var.resource_group_name
  location = var.resource_group_region
}
  • A Premium Databricks workspace

useful resource "azurerm_databricks_workspace" "this" {
  identify                        = var.databricks_workspace_name
  resource_group_name         = azurerm_resource_group.this.identify
  location                    = azurerm_resource_group.this.location
  sku                         = "premium"
}
  • A brand new Storage Account or present an present one

useful resource "azurerm_storage_account" "this" {
  identify                     = var.storage_account_name
  resource_group_name      = azurerm_resource_group.this.identify
  location                 = azurerm_resource_group.this.location
  account_tier             = "Customary"
  account_replication_type = "LRS"
  account_kind             = "StorageV2"
  is_hns_enabled           = "true"
}
  • A brand new Container within the Storage Account or present an present one

useful resource "azurerm_storage_container" "container" {
  identify                  = "container"
  storage_account_name  = azurerm_storage_account.this.identify
  container_access_type = "personal"
}
  • A Databricks Entry Connector

useful resource "azurerm_databricks_access_connector" "this" {
  identify                = var.databricks_access_connector_name
  resource_group_name = azurerm_resource_group.this.identify
  location            = azurerm_resource_group.this.location
  identification {
    sort = "SystemAssigned"
  }
}
  • Assign the “Storage blob Information Contributor” position to the Entry Connector

useful resource "azurerm_role_assignment" "this" {
  scope                = azurerm_storage_account.this.id
  role_definition_name = "Storage Blob Information Contributor"
  principal_id         = azurerm_databricks_access_connector.metastore.identification[0].principal_id
}
  • Assign the central metastore to the newly created Workspace

useful resource "databricks_metastore_assignment" "this" {
  metastore_id = databricks_metastore.this.id
  workspace_id = azurerm_databricks_workspace.this.workspace_id
}
  • Create a storage credential

useful resource "databricks_storage_credential" "storage_credential" {
  identify            = "mi_credential"
  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.this.id
  }
  remark         = "Managed identification credential managed by TF"
  proprietor           = databricks_group.team_X_admins
}
  • Create an exterior location

useful resource "databricks_external_location" "external_location" {
  identify            = "exterior"
  url             = format("abfss://%[email protected]%s.dfs.core.home windows.web/",
                    "container",
                    "storageaccountname"
  )
  credential_name = databricks_storage_credential.storage_credential.id
  remark         = "Managed by TF"
  proprietor           = databricks_group.team_X_admins
  depends_on      = [
    databricks_metastore_assignment.this, databricks_storage_credential.storage_credential
  ]
}

useful resource "databricks_catalog" "this" {
  metastore_id = databricks_metastore.this.id
  identify         = var.databricks_catalog_name
  remark      = "This catalog is managed by terraform"
  proprietor        = databricks_group.team_X_admins
  storage_root = format("abfss://%[email protected]%s.dfs.core.home windows.web/managed_catalog",
                    "container",
                    "storageaccountname"
  )
}

As soon as these objects are created the workforce is autonomous in growing the undertaking, giving entry to different workforce members and/or companions if vital.

Modify belongings for present workforce

Groups will not be allowed to change belongings autonomously in Unity Catalog both. To do that they’ll file a brand new request with the central workforce by modifying the recordsdata they’ve created and make a brand new pull request.

That is true additionally if they should create new belongings reminiscent of new storage credentials, exterior places and catalogs.

Unity Catalog + Terraform = well-governed lakehouse

Above, we walked by means of some tips on leveraging built-in product options and beneficial finest practices to deal with enablement and ongoing administration hurdles for Unity Catalog.

Go to the Unity Catalog documentation [AWS, Azure], and our Unity Catalog Terraform information [AWS, Azure] to be taught extra

Leave a Reply