Table of Contents

    Redshift Datasets

    Data producers can import their Redshift tables into data.all and make them discoverable and shareable in an easy and secure manner.

    In data.all we will work with 2 main constructs:

    • Redshift Connections, which store the necessary metadata to connect to a Redshift namespace
    • Redshift Datasets, group of tables imported into data.all Catalog using a data.all Redshift Connection.

    Redshift Connections

    Data.all Redshift Connections are metadata used by data.all and by data.all users to connect to Redshift namespaces.

    1) Both Redshift Serverless and Provisioned clusters are supported 2) Connections use AWS Secrets Manager secrets or Redshift users to connect to the namespace. Check the documentation to understand each mechanism. Additional connection mechanisms might be considered in the future.

    Connection Types

    Here is a table to summarize the 2 different types of connections, keep reading to understand each type in depth.

    Connection type Purpose in data.all Redshift permissions required Grantable permissions
    DATA_USER Import Redshift Datasets READ Redshift tables None
    ADMIN Process Redshift share requests MANAGE Redshift datashares Use Connection in share request

    DATA USER Connections

    DATA_USER connections are used to IMPORT Redshift dataset into data.all. The Redshift user used in the connection should have READ permissions to the tables to be imported.

    Recommendations

    In the following example there are 2 teams, ClusterAdminTeam and MarketingTeam. Both have been onboarded to data.all and can log in to the UI. The ClusterAdminTeamA is a team that administrates a Redshift cluster RedshiftClusterA in the AWS Account of a data.all environment EnvironmentA. The MarketingTeam works in this cluster creating some tables marketingTables

    It has been agreed that marketingTables should be imported to data.all. Which type of connection should we use? We need to create a DATA_USER connection with a user that can read marketingTables.

    And, which team should own the connection? This depends on the data ownership requirements of your teams. The connection owners will be able to import the Redshift dataset, becoming the dataset owners. The Redshift dataset owners are in charge of managing the metadata of the dataset, editing/deleting and approving/revoking share requests. If in your organization the ClusterAdminTeamA is in charge of managing all operations on the datasets then they should be the owners of the connection. If on the contrary, your organization has more distributed control over the operations on the data.all dataset, then the MarketingTeam should own the connection.

    ADMIN Connections

    ADMIN connections are used by data.all to process Redshift data share requests. The Redshift user used in the connection should have enough permissions to MANAGE DATASHARES in the cluster.

    Recommendations

    We will continue the example of DATA_USER connections. Let’s imagine that the MarketingTeam has happily imported the marketingTables Dataset and it is now published in the data.all Catalog. In another AWS Account AccountB, linked to data.all as EnvironmentB, the ResearchTeam works in a Redshift cluster RedshiftClusterB managed by ClusterAdminTeamB. The ResearchTeam wants to request access to marketingTables.

    Which type of connection should we use? We need to create an ADMIN connection with a user that can manage Redshift datashares in both the RedshiftClusterA and RedshiftClusterB.

    And, which team should own the connection? The Connection owners should be teams with administrative rights over the clusters. In this case the ClusterAdminTeamA and ClusterAdminTeamB should own the AdminConnectionA and AdminConnectionB respectively.

    How can the ResearchTeam use the AdminConnectionB? The ClusterAdminTeamB needs to grant “Use Connection in share request” permissions for the connection AdminConnectionB to the ResearchTeam. After that the ResearchTeam will be able to open share requests, but they won’t be able to edit/delete the AdminConnectionB. The steps to grant these permissions are explained in the Update Connection permissions subsection.

    Create a Redshift Connection

    data.all requires Redshift clusters and users to be managed by a dedicated team and infrastructure created outside of data.all. For this reason, data.all will work “importing” existing infrastructure and users, requiring the following information on import:

    • Redshift Serverless namespace/workgroup or Provisioned cluster: the user creating the connection must know the namespace ID and the workgroup for Redshift Serverless or the cluster ID for the case of Redshift Provisioned clusters.
    • Redshift user: Redshift administrators manage Redshift users outside of data.all.
    • Connection details:
      • Redshift user (only valid for Provisioned clusters): data.all will generate a temporary password to connect to the database. In this case no password or secret needs to be provided to data.all.
      • AWS Secrets Manager Secret (recommended): the username and password for the Redshift user can be stored in a Secret that MUST be tagged with 2 tags. Check the pictures below to see how it should look in the AWS Console.
        • tagKey: dataall, tagValue: True - Needed for data.all to be able to access the Secret
        • tagKey: Redshift, tagValue: Any - Needed by Redshift to use as connection

    Redshift Connections are created inside the Environment view. Select an Environment and navigate to the Connections tab. Here you can click on the Add Connection button to create a new Redshift Connection.

    Then, fill in the following form:

    Field Description Required Editable Example
    Connection name Name of the Redshift connection Yes No main-cluster-admin
    Connection type Level of access of the connection. It can either be ADMIN or DATA_USER. See definitions above. Yes No ADMIN
    Team Team that owns the connection. This team is the only team that can use this connection to import datasets. Yes No DataScienceTeam
    Redshift type Type of Redshift Namespace. It can either be serverless or cluster. Yes No serverless
    Cluster Id If the Redshift type is cluster, we need to introduce the cluster Id. Yes No redshift-cluster-1
    Namespace Id If the Redshift type is serverless, we need to introduce the namespace Id. Yes No 0000000-0000-0000-0000-000000000000
    Workgroup If the Redshift type is serverless, we need to introduce the workgroup. Yes No workgroup1
    Database Database that we will connect to inside the cluster. Yes No dev
    Redshift User Only available for cluster Redshift type. This is the user Yes No user1
    Secret Arn Secrets Manager secret arn storing username and password for the connection. See pre-requisites section above. Yes Yes arn:aws:secretsmanager:eu-west-1:000000000000:secret:redshift!redshift-cluster-1-awsuser

    Data.all will verify the connection upon creation. If the database does not exist or if the connection details are not accessible or do not correspond to cluster it will notify the user in the error banner.

    Update Connection permissions

    The owners of an ADMIN connection can grant other teams permissions to use the Connection in a share request. At the moment this is the only type of permission that can be granted and it is only available for ADMIN connections. Check the section on Connection types to understand the usage of this permission.

    To update the permissions, select your environment and navigate to the Connections tab. You will see that the ADMIN connections have a button in the Permissions tab called View and Edit (it is disabled for DATA_USER connections).

    If you click on the button the following window will open. Here you can press the Add group and select a group that will get “Use Connection in share request” permissions to the connection. Do not forget to click on the save icon to save the permissions.

    Delete a Connection

    To delete a connection, click on the trash icon next to the item in the Actions column. If the Connection has been used to import datasets it cannot be removed until all associated datasets are deleted.

    📦 Import a Redshift Dataset

    To create a new dataset, navigate to the Datasets view and click on New Dataset. A window like the one in the picture will allow you to select the type of Dataset you want to create or import. In this case you need to select the Import Redshift Dataset option.

    Next, fill in the creation form with the Dataset details. To import Redshift Datasets, only connections of the type DATA_USER can be used. Therefore, data.all will list the Redshift DATA_USER connections owned by the selected team in the environment and fetch the schemas and tables from Redshift. It is possible to select all tables or a subset of tables as appears in the picture.

    Generic dataset fields

    Field Description Required Editable Example
    Dataset name Name of the dataset Yes Yes AnyDataset
    Short description Short description about the dataset No Yes For AnyProject predictive model
    Environment Environment (mapped to an AWS account) Yes No DataScience
    Organization (auto-filled) Organization of the environment Yes No AnyCompany EMEA
    Team Team that owns the dataset Yes No DataScienceTeam
    Stewards Team that can manage share requests on behalf of owners No Yes FinanceBITeam, FinanceMgmtTeam
    Confidentiality Level of confidentiality: Unclassified, Oficial or Secret Yes Yes Secret
    Topics Topics that can later be used in the Catalog Yes, at least 1 Yes Finance
    Tags Tags that can later be used in the Catalog Yes, at least 1 Yes deleteme, ds
    Auto Approval Whether shares for this dataset need approval from dataset owners/stewards Yes (default Disabled) Yes Disabled, Enabled

    Redshift Dataset fields

    Field Description Required Editable Example
    Redshift Connection Name of the Redshift connection used to read the Redshift tables. Only DATA_USER connections can be used to import Datasets. Yes No main-cluster-userA
    Redshift database schema Name of the Redshift schema where the tables are stored Yes No public
    Redshift tables List of tables to be imported. They can be added at a later stage No Yes customer, orders

    Once a Redshift dataset has been imported, the dataset and its imported tables can be searched by any user in the Catalog.

    🔍 Navigate Redshift dataset tabs

    Overview

    This tab includes meaningful metadata about the dataset and the Redshift connection used.

    Data

    This tab shows the Redshift database, schema and tables imported. From here we can add, edit, delete and see the details of a table.

    Shares Show a list of the share requests for this Dataset. It is possible to verify the health and reapply shares for the entire Dataset

    Manage Redshift Tables

    Add tables

    View and edit tables

    We can view the schema of a table directly from the Data tab, by clicking on the Open table schema button.

    We can also see a full view of the table by selecting the arrow in the Actions column. A new window for the table will open. In this view we can edit the metadata of the table in data.all (Tags, glossary, description) and we can see the schema in full-width in the Columns tab.

    Delete a table

    We can delete Redshift tables by clicking on the trash icon next to the table we want to “un-import”. Un-import is a better word to describe what will happen: the metadata of the table will be deleted from data.all Catalog, but the original Redshift table still exists in Redshift.

    Dataset owners need to revoke access to the table before deleting. Data.all prevents deletion of a table if there are share requests currently sharing the table.

    ✏️ Edit and update a dataset

    Data owners can edit the dataset by clicking on the edit button, editing the editable fields and saving the changes.

    🗑️ Delete a dataset

    To delete a dataset, in the selected dataset window click on the delete button in the top-right corner. data.all Redshift Datasets don’t deploy any CloudFormation stack, no additional resources need to be cleaned up. The original Redshift tables will still exist in Redshift.

    In the same way as it happens with single tables, Dataset owners need to revoke access to all tables before deleting. Data.all prevents deletion of a dataset if there are share requests currently sharing any dataset table.