S3 Datasets

In data.all, a S3/Glue Dataset is a representation of multiple AWS resources that helps users store data in a data lake and establish the basis to make this data discoverable and shareable with other teams.

When data owners create a S3/Glue dataset the following resources are deployed on the selected environment and its linked AWS account:

Amazon S3 Bucket to store the data on AWS.
AWS KMS key to encrypt the data on AWS.
AWS IAM role that gives access to the data on Amazon S3 (Dataset IAM role, see below)
AWS Glue database that is the representation of the structured data on AWS.

Dataset IAM role

Usage

Assumed by Dataset owners from data.all UI to quickly ingest or access Dataset data
Assumed by Dataset Glue crawler
Assumed by the Dataset Glue profiling job

IAM Permissions

read and write permissions to the Dataset S3 Bucket (ONLY this bucket)
encrypt/decrypt data with the Dataset KMS key (ONLY this key)
read and write permissions to the Dataset Glue database and tables (ONLY this database)
read permissions to profiling/code folder in the Environment S3 Bucket (ONLY this folder)
read and write permissions to profiling/results/datasetUri folder in the Environment S3 Bucket (ONLY this folder)
put logs permissions to log crawler and profiling jobs results

Data Governance with Lake Formation

In addition to restricting the access via IAM policies, Dataset Glue database and tables are protected using AWS Lake Formation. With Lake Formation, the Dataset IAM role gets granted access to the Dataset Glue database only.

Glue Tables and S3 Folders

Inside a S3/Glue dataset we can store structured data in Glue tables and unstructured data in S3 folders.

Tables are the representation of AWS Glue Catalog tables that are created on the dataset’s Glue database on AWS.
Folders are the representation of an Amazon S3 prefix where any type of file can be stored. Such as images, unstructured text formats…

Dataset ownership

Dataset ownership refers to the ability to access, modify or remove data from a dataset, but also to the responsibility of assigning these privileges to others.

Owners: When you create a dataset and associate it with a team, the dataset business ownership belongs to the associated team.
Stewards: You can delegate the stewardship of a dataset to a team of stewards. You can type a name of an IdP group or choose one of the teams of your environment to be the dataset stewards.

**📝 Dataset owners team is a required, non-editable field, while stewards are optional and can be added post the dataset has been created. If no other stewards team is designated, the dataset owner team will be the only responsible in managing access to the dataset.

Dataset access

In this case we are referring to the ability to access, modify or remove data from a dataset. Who can access the dataset content? users belonging to…

the dataset owner team
a dataset steward team
teams with a share request approved to dataset content

**📝 Dataset metadata is available for all users in the centralized data catalog.

📦 Create a dataset

To create a new dataset, navigate to the Datasets view and click on New Dataset. A window like the one in the picture will allow you to select the type of Dataset you want to create or import. In this case you need to select the Create S3/Glue Dataset option.

create_dataset

Field	Description	Required	Editable	Example
Dataset name	Name of the dataset	Yes	Yes	AnyDataset
Short description	Short description about the dataset	No	Yes	For AnyProject predictive model
Environment	Environment (mapped to an AWS account)	Yes	No	DataScience
Region (auto-filled)	AWS region of the environment	Yes	No	Europe (Ireland)
Organization (auto-filled)	Organization of the environment	Yes	No	AnyCompany EMEA
Owners	Team that owns the dataset	Yes	No	DataScienceTeam
Stewards	Team that can manage share requests on behalf of owners	No	Yes	FinanceBITeam, FinanceMgmtTeam
Confidentiality	Level of confidentiality: Unclassified, Oficial or Secret	Yes	Yes	Secret
Topics	Topics that can later be used in the Catalog	Yes, at least 1	Yes	Finance
Tags	Tags that can later be used in the Catalog	Yes, at least 1	Yes	deleteme, ds
Auto Approval	Whether shares for this dataset need approval from dataset owners/stewards	Yes (default `Disabled`)	Yes	Disabled, Enabled

📥 Import a dataset

If you already have data stored on Amazon S3 buckets in your data.all environment, data.all has got you covered with the import feature. In addition to the fields of a newly created dataset you have to specify the S3 bucket and optionally a Glue database and a KMS key Alias. If the Glue database is left empty, data.all will create a Glue database pointing at the S3 Bucket. As for the KMS key Alias, data.all assumes that if nothing is specified the S3 Bucket is encrypted with SSE-S3 encryption. Data.all performs a validation check to ensure the KMS Key Alias provided (if any) is the one that encrypts the S3 Bucket specified.**

🚨 Imported KMS key and S3 Bucket policies requirements

Data.all pivot role will handle data sharing on the imported Bucket and KMS key (if imported). Make sure that the resource policies allow the pivot role to manage them. For the KMS key policy, explicit permissions are needed. See an example below.

KMS key policy

In the KMS key policy we need to grant explicit permission to the pivot role. At a minimum the following permissions are needed for the pivotRole:

{
  "Sid": "Enable Pivot Role Permissions",
  "Effect": "Allow",
  "Principal": {
    "AWS": "arn:aws:iam::111122223333:role/dataallPivotRole-cdk"
   },
  "Action": [
    "kms:Decrypt",
    "kms:Encrypt",
    "kms:GenerateDataKey*",
    "kms:PutKeyPolicy",
    "kms:GetKeyPolicy",
    "kms:ReEncrypt*",
    "kms:TagResource",
    "kms:UntagResource"
    'kms:DescribeKey'
   ],
  "Resource": "*"
}

✅ Update imported Datasets

Imported keys is an addition of V1.6.0 release. Any previously imported bucket will have a KMS Key Alias set to Undefined. If that is the case and you want to update the Dataset and import a KMS key Alias, data.all let’s you edit the Dataset on the Edit window.

import_dataset

Field	Description	Required	Editable	Example
Amazon S3 bucket name	Name of the S3 bucket you want to import	Yes	No	DOC-EXAMPLE-BUCKET
Amazon KMS key Alias	Alias of the KMS key used to encrypt the S3 Bucket (do not include alias/, just )	No	No	somealias
AWS Glue database name	Name of the Glue database tht you want to import	No	No	anyDatabase

(Going Further) Support for Datasets with Externally-Managed Glue Catalog

If the dataset you are trying to import relates to Glue Database that is managed in a separate account, data.all’s import dataset feature can also handle importing and sharing these type of datasets in data.all. Assuming the following pre-requisites are complete:

There exists an AWS Account (i.e. the Catalog Account) which is:
- Onboarded as a data.all environment (e.g. Env A)
- Contains the Glue Database with Location URI (as S3 Path from Dataset Producer Account) AND Tables
- Glue Database has a resource tag owner_account_id=<PRODUCER_ACCOUNT_ID>
- Data Lake Location registered in LakeFormation with the role used to register having permissions to the S3 Bucket from Dataset Producer Account
- Resource Link created on the Glue Database to grant permission for the Dataset Producer Account on the Database and Tables
There exists another AWS Account (i.e. the Dataset Producer Account) which is:
- Onboarded as a data.all environment (e.g. Env B)
- Contains the S3 Bucket that contains the data (used as S3 Path in Catalog Account)

The data.all producer, a member of EnvB Team(s), would import the dataset specifying the S3 bucket as the bucket name that exists in the Dataset Producer Account and specifying the Glue database name as the Glue DB resource link name in the Dataset Producer Account.

This dataset will then be properly imported and can be discovered and shared the same way as any other dataset in data.all.

🔍 Navigate dataset tabs

When we belong to the dataset owner team

After creating or importing a dataset it will appear in the datasets list (click on Datasets on the left side pane). In this window, it will only ve visible for those users belonging to the dataset owner team. If we select one of our datasets we will see the following dataset window:

with

When we DON’T belong to the dataset owner team

How do we access a dataset if we don’t have access to it? IN THE CATALOG! on the left pane click on Catalog, find the dataset you are interested in, click on it and if you don’t have access to it, you should see only some of the tabs in comparison with the previous pic, something like:

without

✏️ Edit and update a dataset

Data owners can edit the dataset by clicking on the edit button, editing the editable fields and saving the changes.

🗑️ Delete a dataset

To delete a dataset, in the selected dataset window click on the delete button in the top-right corner. As with environments, it is possible to keep the AWS CloudFormation stack to keep working with the data and resources created but outside of data.all.

☁️ Check dataset info and access AWS

The Overview tab of the dataset window contains dataset metadata, including governance and creation details. Moreover, AWS information related to the resources created by the dataset CloudFormation stack can be consulted here: AWS Account, Dataset S3 bucket, Glue database, IAM role and KMS Alias.

You can also assume this IAM role to access the S3 bucket in the AWS console by clicking on the S3 bucket button. Alternatively, click on AWS Credentials to obtain programmatic access to the S3 bucket (only available if modules.dataset.features.aws_actions is set to True in the config.json used for deployment of data.all).

overview

🗂️ Fill the dataset with data

Tables

Quickly upload a file for data exploration

Users may want to experiment with a small set of data (e.g. a csv file). To create tables from a file, we first upload the file, then run the crawler to infer its schema, and finally, we read the schema by synchronizing the table. Upload & Crawl & Sync

Upload data: Go to the Upload tab of the dataset and browse or drop your sample file. It will be uploaded to the dataset S3 bucket in the prefix specified. By default, a Glue crawler will be triggered by the upload of a file, however this feature can be disabled as appears in the picture.

upload

Crawl data: the file has been uploaded but the table and its schema have not been registered in the dataset Glue Catalog database. If you have disabled the crawler in the upload, click on the Start Crawler button in the Data tab. If you just want to crawl one prefix, you can specify it in the Start Crawler feature.

crawl

Synchronize tables: Once crawled and registered in the Glue database, you can synchronize tables from your dataset’s AWS Glue database by using Synchronize tables feature in the Data tab. In any case, data.all will synchronize automatically the tables for you at a frequency of 15 minutes.

You can preview your small set of data right away from data.all, check Tables.

Ingest data

If you need to ingest larger quantities of data, manage bigger files, or simply you cannot work with local files that can be uploaded; this is your section!

There are multiple ways of filling our datasets with data and actually, the steps don’t differ much from the upload-crawl-sync example.

Crawl & Sync option: we can drop the data from the source to our dataset S3 bucket. Then, we will crawl and synchronize data as we did in the previous steps 2 and 3.
Register & Sync option: we drop the data from the source to our dataset S3 bucket. However, if we want to have more control over our tables and its schema, instead of starting the crawler we can register the tables in the Glue Catalog and then click on Synchronize as we did in step 3.

How do we register Glue tables? There are numerous ways:

manually from the AWS Glue console in your environment account
Using AWS Glue API, CreateTable.
In a Glue Job leveraging Glue PySpark DynamicFrame class
With boto3
Or with AWS Data Wrangler, Pandas on AWS.
Also, you can deploy Glue resources using CloudFormation
Or directly, migrating from Hive Metastore.
there are more for sure :)

(Going Further) Creating Filters on Tables

Additionally, dataset owners can create column-level or row-level filters on their dataset tables to more granularly restrict data access when sharing with other teams.

To do so dataset owners can navigate to the Filters Tab for a given table and select Add New Filter:

dataset_table_filter

When creating filters, you have the choice to create a column-level filter or a row-level filter. Column-level filters prompt the user to select a subset of columns to include for the table. Row-level filters use row expressions to specify the rows to include in for the table.

An example of creating a column filter is below:

dataset_table_filter_col

This filter restricts access on the table to only the 3 selected columns: book_id, author, and publisher.

An example of creating a row filter is below:

dataset_table_filter_row

This filter restricts access to only rows where book_id is not null, title is LIKE %Harry Potter% AND num_pages is greater than 100. It is important to note that:

The row filter acts as the intersection (logical ‘AND’) of the row expression(s) - if you need the union (logical ‘OR’) of multiple expressions you can create separate filters here and apply multiple to the table share item
When creating a new row expression be sure to save the row expression by clicking the save icon (highlighted in red in the above) before creating the filter

Once the filters are created, they will show in the Filters Table Tab:

dataset_table_filter_filled

Table filters are not editable. To update an existing filter you must:

Revoke all associated share items using the filter (if applicable)
Delete the table filter
Create a new table filter with any updates as necessary

These filters can be used when reviewing and approving share objects with table share items to more granularly limit data access.

Folders

As previously defined, folders are prefixes inside our dataset S3 bucket. To create a folder, go to the Data tab and on the folders section, click on Create. The following form will appear. We will dive deeper in how to use folders in the folders section.

create_folder

💬 Leave a message in Chat

In the Chats button users can interact and leave their comments and questions on the Dataset Chat.

feed

🏷️ Create key-value tags

Same as in environments. In the Tags tab of the dataset window, we can create key-value tags. These tags are not data.all tags that are used to tag the dataset and find it in the catalog. In this case we are creating AWS tags as part of the dataset CloudFormation stack. There are multiple tagging strategies as explained in the documentation.

Table of Contents