Redshift Datasets
Data producers can import their Redshift tables into data.all and make them discoverable and shareable in an easy and secure manner.
In data.all we will work with 2 main constructs:
- Redshift Connections, which store the necessary metadata to connect to a Redshift namespace
- Redshift Datasets, group of tables imported into data.all Catalog using a data.all Redshift Connection.
Redshift Connections
Data.all Redshift Connections are metadata used by data.all and by data.all users to connect to Redshift namespaces.
1) Both Redshift Serverless and Provisioned clusters are supported 2) Connections use AWS Secrets Manager secrets or Redshift users to connect to the namespace. Check the documentation to understand each mechanism. Additional connection mechanisms might be considered in the future.
Connection Types
Here is a table to summarize the 2 different types of connections, keep reading to understand each type in depth.
Connection type | Purpose in data.all | Redshift permissions required | Grantable permissions |
---|---|---|---|
DATA_USER |
Import Redshift Datasets | READ Redshift tables | None |
ADMIN |
Process Redshift share requests | MANAGE Redshift datashares | Use Connection in share request |
DATA USER Connections
DATA_USER
connections are used to IMPORT Redshift dataset into data.all. The Redshift user used in the connection should have READ
permissions to the tables to be imported.
Recommendations
In the following example there are 2 teams, ClusterAdminTeam
and MarketingTeam
. Both have been onboarded to data.all
and can log in to the UI. The ClusterAdminTeamA
is a team that administrates a Redshift cluster RedshiftClusterA
in
the AWS Account of a data.all environment EnvironmentA
. The MarketingTeam
works in this cluster creating some tables marketingTables
It has been agreed that marketingTables
should be imported to data.all. Which type of connection should we use? We need
to create a DATA_USER
connection with a user that can read marketingTables
.
And, which team should own the connection? This depends on the data ownership requirements of your teams. The connection
owners will be able to import the Redshift dataset, becoming the dataset owners. The Redshift dataset
owners are in charge of managing the metadata of the dataset, editing/deleting and approving/revoking share requests.
If in your organization the ClusterAdminTeamA
is in charge of managing all operations on the datasets then they should be
the owners of the connection. If on the contrary, your organization has more distributed control over the operations on the
data.all dataset, then the MarketingTeam
should own the connection.
ADMIN Connections
ADMIN
connections are used by data.all to process Redshift data share requests. The Redshift user used in the connection
should have enough permissions to MANAGE DATASHARES in the cluster.
Recommendations
We will continue the example of DATA_USER connections. Let’s imagine that the MarketingTeam
has happily imported the
marketingTables
Dataset and it is now published in the data.all Catalog. In another AWS Account AccountB
, linked to
data.all as EnvironmentB
, the ResearchTeam
works in a Redshift cluster RedshiftClusterB
managed by ClusterAdminTeamB
. The
ResearchTeam
wants to request access to marketingTables
.
Which type of connection should we use? We need to create an ADMIN
connection with a user that can manage Redshift
datashares in both the RedshiftClusterA
and RedshiftClusterB
.
And, which team should own the connection? The Connection owners should be teams with administrative
rights over the clusters. In this case the ClusterAdminTeamA
and ClusterAdminTeamB
should own the AdminConnectionA
and AdminConnectionB
respectively.
How can the ResearchTeam
use the AdminConnectionB
? The ClusterAdminTeamB
needs to grant “Use Connection in share request”
permissions for the connection AdminConnectionB
to the ResearchTeam
. After that the ResearchTeam
will be able to
open share requests, but they won’t be able to edit/delete the AdminConnectionB
. The steps to grant these permissions
are explained in the Update Connection permissions subsection.
Create a Redshift Connection
data.all requires Redshift clusters and users to be managed by a dedicated team and infrastructure created outside of data.all. For this reason, data.all will work “importing” existing infrastructure and users, requiring the following information on import:
- Redshift Serverless namespace/workgroup or Provisioned cluster: the user creating the connection must know the
namespace ID
and theworkgroup
for Redshift Serverless or thecluster ID
for the case of Redshift Provisioned clusters. - Redshift user: Redshift administrators manage Redshift users outside of data.all.
- Connection details:
- Redshift user (only valid for Provisioned clusters): data.all will generate a temporary password to connect to the database. In this case no password or secret needs to be provided to data.all.
- AWS Secrets Manager Secret (recommended): the username and password for the Redshift user can be stored in a Secret that MUST be tagged with 2 tags. Check the pictures below to see how it should look in the AWS Console.
- tagKey: dataall, tagValue: True - Needed for data.all to be able to access the Secret
- tagKey: Redshift, tagValue: Any - Needed by Redshift to use as connection
Redshift Connections are created inside the Environment view. Select an Environment and navigate to the Connections tab. Here you can click on the Add Connection button to create a new Redshift Connection.
Then, fill in the following form:
Field | Description | Required | Editable | Example |
---|---|---|---|---|
Connection name | Name of the Redshift connection | Yes | No | main-cluster-admin |
Connection type | Level of access of the connection. It can either be ADMIN or DATA_USER . See definitions above. |
Yes | No | ADMIN |
Team | Team that owns the connection. This team is the only team that can use this connection to import datasets. | Yes | No | DataScienceTeam |
Redshift type | Type of Redshift Namespace. It can either be serverless or cluster . |
Yes | No | serverless |
Cluster Id | If the Redshift type is cluster , we need to introduce the cluster Id. |
Yes | No | redshift-cluster-1 |
Namespace Id | If the Redshift type is serverless , we need to introduce the namespace Id. |
Yes | No | 0000000-0000-0000-0000-000000000000 |
Workgroup | If the Redshift type is serverless , we need to introduce the workgroup. |
Yes | No | workgroup1 |
Database | Database that we will connect to inside the cluster. | Yes | No | dev |
Redshift User | Only available for cluster Redshift type. This is the user |
Yes | No | user1 |
Secret Arn | Secrets Manager secret arn storing username and password for the connection. See pre-requisites section above. | Yes | Yes | arn:aws:secretsmanager:eu-west-1:000000000000:secret:redshift!redshift-cluster-1-awsuser |
Data.all will verify the connection upon creation. If the database does not exist or if the connection details are not accessible or do not correspond to cluster it will notify the user in the error banner.
Update Connection permissions
The owners of an ADMIN
connection can grant other teams permissions to use the Connection in a share request. At the
moment this is the only type of permission that can be granted and it is only available for ADMIN
connections.
Check the section on Connection types to understand the usage of this permission.
To update the permissions, select your environment and navigate to the Connections tab. You will see that the ADMIN
connections have a button in the Permissions tab called View and Edit
(it is disabled for DATA_USER
connections).
If you click on the button the following window will open. Here you can press the Add group
and select a group
that will get “Use Connection in share request” permissions to the connection. Do not forget to click on the save
icon to save the permissions.
Delete a Connection
To delete a connection, click on the trash icon next to the item in the Actions column. If the Connection has been used to import datasets it cannot be removed until all associated datasets are deleted.
📦 Import a Redshift Dataset
To create a new dataset, navigate to the Datasets view and click on New Dataset. A window like the one in the picture will allow you to select the type of Dataset you want to create or import. In this case you need to select the Import Redshift Dataset option.
Next, fill in the creation form with the Dataset details. To import Redshift Datasets, only connections of the type DATA_USER
can be used. Therefore, data.all will list the Redshift DATA_USER
connections owned by the selected team in the environment
and fetch the schemas and tables from Redshift. It is possible to select all tables or a subset of tables as appears in the picture.
Generic dataset fields
Field | Description | Required | Editable | Example |
---|---|---|---|---|
Dataset name | Name of the dataset | Yes | Yes | AnyDataset |
Short description | Short description about the dataset | No | Yes | For AnyProject predictive model |
Environment | Environment (mapped to an AWS account) | Yes | No | DataScience |
Organization (auto-filled) | Organization of the environment | Yes | No | AnyCompany EMEA |
Team | Team that owns the dataset | Yes | No | DataScienceTeam |
Stewards | Team that can manage share requests on behalf of owners | No | Yes | FinanceBITeam, FinanceMgmtTeam |
Confidentiality | Level of confidentiality: Unclassified, Oficial or Secret | Yes | Yes | Secret |
Topics | Topics that can later be used in the Catalog | Yes, at least 1 | Yes | Finance |
Tags | Tags that can later be used in the Catalog | Yes, at least 1 | Yes | deleteme, ds |
Auto Approval | Whether shares for this dataset need approval from dataset owners/stewards | Yes (default Disabled ) |
Yes | Disabled, Enabled |
Redshift Dataset fields
Field | Description | Required | Editable | Example |
---|---|---|---|---|
Redshift Connection | Name of the Redshift connection used to read the Redshift tables. Only DATA_USER connections can be used to import Datasets. |
Yes | No | main-cluster-userA |
Redshift database schema | Name of the Redshift schema where the tables are stored | Yes | No | public |
Redshift tables | List of tables to be imported. They can be added at a later stage | No | Yes | customer, orders |
Once a Redshift dataset has been imported, the dataset and its imported tables can be searched by any user in the Catalog.
🔍 Navigate Redshift dataset tabs
Overview
This tab includes meaningful metadata about the dataset and the Redshift connection used.
Data
This tab shows the Redshift database, schema and tables imported. From here we can add, edit, delete and see the details of a table.
Shares Show a list of the share requests for this Dataset. It is possible to verify the health and reapply shares for the entire Dataset
Manage Redshift Tables
Add tables
View and edit tables
We can view the schema of a table directly from the Data tab, by clicking on the Open table schema button.
We can also see a full view of the table by selecting the arrow in the Actions column. A new window for the table will open. In this view we can edit the metadata of the table in data.all (Tags, glossary, description) and we can see the schema in full-width in the Columns tab.
Delete a table
We can delete Redshift tables by clicking on the trash icon next to the table we want to “un-import”. Un-import is a better word to describe what will happen: the metadata of the table will be deleted from data.all Catalog, but the original Redshift table still exists in Redshift.
Dataset owners need to revoke access to the table before deleting. Data.all prevents deletion of a table if there are share requests currently sharing the table.
✏️ Edit and update a dataset
Data owners can edit the dataset by clicking on the edit button, editing the editable fields and saving the changes.
🗑️ Delete a dataset
To delete a dataset, in the selected dataset window click on the delete button in the top-right corner. data.all Redshift Datasets don’t deploy any CloudFormation stack, no additional resources need to be cleaned up. The original Redshift tables will still exist in Redshift.
In the same way as it happens with single tables, Dataset owners need to revoke access to all tables before deleting. Data.all prevents deletion of a dataset if there are share requests currently sharing any dataset table.