18.8 C
New York

Construct a real-time GDPR-aligned Apache Iceberg information lake

Knowledge lakes are a well-liked selection for right this moment’s organizations to retailer their information round their enterprise actions. As a greatest observe of a knowledge lake design, information needs to be immutable as soon as saved. However laws such because the Basic Knowledge Safety Regulation (GDPR) have created obligations for information operators who should be capable to erase or replace private information from their information lake when requested.

A knowledge lake constructed on AWS makes use of Amazon Easy Storage Service (Amazon S3) as its major storage surroundings. When a buyer asks to erase or replace personal information, the info lake operator wants to search out the required objects in Amazon S3 that comprise the required information and take steps to erase or replace that information. This exercise is usually a complicated course of for the next causes:

  • Knowledge lakes might comprise many S3 objects (every might comprise a number of rows), and sometimes it’s troublesome to search out the article containing the precise information that must be erased or personally identifiable info (PII) to be up to date as per the request
  • By nature, S3 objects are immutable and subsequently making use of direct row-based transactions like DELETE or UPDATE isn’t attainable

To deal with these conditions, a transactional characteristic on S3 objects is required, and frameworks equivalent to Apache Hudi or Apache Iceberg present you the transactional characteristic for upserts in Amazon S3.

AWS contributed the Apache Iceberg integration with the AWS Glue Knowledge Catalog, which allows you to use open-source information computation engines like Apache Spark with Iceberg on AWS Glue. In 2022, Amazon Athena introduced assist of Iceberg, enabling transaction queries on S3 objects.

On this put up, we present you methods to stream real-time information to an Iceberg desk in Amazon S3 utilizing AWS Glue streaming and carry out transactions utilizing Amazon Athena for deletes and updates. We use a serverless mechanism for this implementation, which requires minimal operational overhead to handle and fine-tune numerous configuration parameters, and allows you to prolong your use case to ACID operations past the GDPR.

Resolution overview

We used the Amazon Kinesis Knowledge Generator (KDG) to provide artificial streaming information in Amazon Kinesis Knowledge Streams after which processed the streaming enter information utilizing AWS Glue streaming to retailer the info in Amazon S3 in Iceberg desk format. As a part of the shopper’s request, we ran delete and replace statements utilizing Athena with Iceberg assist.

The next diagram illustrates the answer structure.

The answer workflow consists of the next steps:

  1. Streaming information is generated in JSON format utilizing the KDG template and inserted into Kinesis Knowledge Streams.
  2. An AWS Glue streaming job is linked to Kinesis Knowledge Streams to course of the info utilizing the Iceberg connector.
  3. The streaming job output is saved in Amazon S3 in Iceberg desk format.
  4. Athena makes use of the AWS Glue Knowledge Catalog to retailer and retrieve desk metadata for the Amazon S3 information in Iceberg format.
  5. Athena interacts with the Knowledge Catalog tables in Iceberg format for transactional queries required for GDPR.

The codebase required for this put up is on the market within the GitHub repository.


Earlier than beginning the implementation, be sure that the next conditions are met:

Deploy sources utilizing AWS CloudFormation

Full the next steps to deploy your resolution sources:

  1. After you sign up to your AWS account, launch the CloudFormation template by selecting Launch Stack:
  2. For Stack title, enter a reputation.
  3. For Username, enter the person title for the KDG.
  4. For Password, enter the password for the KDG (this have to be at the least six alphanumeric characters, and comprise at the least one quantity).
  5. For IAMGlueStreamingJobRoleName, enter a reputation for the IAM position used for the AWS Glue streaming job.
  6. Select Subsequent and create your stack.

This CloudFormation template configures the next sources in your account:

  • An S3 bucket named streamingicebergdemo-XX (notice that the XX half is a random distinctive quantity to make the S3 bucket title distinctive)
  • An IAM coverage and position
  • The KDG URL used for creating artificial information
  1. After you full the setup, go to the Outputs tab of the CloudFormation stack to get the S3 bucket title, AWS Glue job execution position (as per your enter), and KDG URL.
  2. Earlier than continuing with the demo, create a folder named custdata beneath the created S3 bucket.

Create a Kinesis information stream

We use Kinesis Knowledge Streams to create a serverless streaming information service that’s constructed to deal with thousands and thousands of occasions with low latency. The next steps information you on methods to create the info stream within the us-east-1 Area:

  1. Log in to the AWS Administration Console.
  2. Navigate to Kinesis console (be sure that the Area is us-east-1).
  3. Choose Kinesis Knowledge Streams and select Create information stream.
  4. For Knowledge stream title, enter demo-data-stream.
  5. For this put up, we choose On-demand because the Kinesis information stream capability mode.

On-demand mode works to get rid of the necessity for provisioning and managing the capability for streaming information. Nonetheless, you possibly can implement this resolution with Kinesis Knowledge Streams in provisioned mode as effectively.

  1. Select Create information stream.
  2. Watch for profitable creation of demo-data-stream and for it to be in Energetic standing.

Arrange the Kinesis Knowledge Generator

To create a pattern streaming dataset, we use the KDG URL generated on the CloudFormation stack Outputs tab and log in with the credentials used within the parameters for the CloudFormation template. For this put up, we use the next template to generate pattern information within the demo-data-stream Kinesis information stream.

  1. Log in to the KDG URL with the person title and password you provided throughout stack creation.
  2. Change the Area to us-east-1.
  3. Choose the Kinesis information stream demo-data-stream.
  4. For Information per second, select Fixed and enter 100 (it may be one other quantity, relying on the speed of document creation).
  5. On the Template 1 tab, enter the KDG information technology template:
"yr": "{{random.quantity({"min":2000,"max":2022})}}",
"month": "{{random.quantity({"min":1,"max":12})}}",
"day": "{{random.quantity({"min":1,"max":30})}}",
"hour": "{{random.quantity({"min":0,"max":24})}}",
"minute": "{{random.quantity({"min":0,"max":60})}}",
"customerid": {{random.quantity({"min":5023,"max":59874})}},
"firstname" : "{{title.firstName}}",
"lastname" : "{{title.lastName}}",
"dateofbirth" : "{{date.previous(70)}}",
"metropolis" : "{{tackle.metropolis}}",
"buildingnumber" : {{random.quantity({"min":63,"max":947})}},
"streetaddress" : "{{tackle.streetAddress}}",
"state" : "{{tackle.state}}",
"zipcode" : "{{tackle.zipCode}}",
"nation" : "{{tackle.nation}}",
"countrycode" : "{{tackle.countryCode}}",
"phonenumber" : "{{telephone.phoneNumber}}",
"productname" : "{{commerce.productName}}",
"transactionamount": {{random.quantity(

  1. Select Take a look at template to check the pattern data.
  2. When the testing is right, select Ship information.

It will begin sending 100 data per second within the Kinesis information stream. (To cease sending information, select Cease Sending Knowledge to Kinesis.)

Combine Iceberg with AWS Glue

So as to add the Apache Iceberg Connector for AWS Glue, full the next steps. The connector is free to make use of and helps AWS Glue 1.0, 2.0, and three.0.

  1. On the AWS Glue console, select AWS Glue Studio within the navigation pane.
  2. Within the navigation pane, navigate to AWS Market.
  3. Seek for and select Apache Iceberg Connector for AWS Glue.
  4. Select Settle for Phrases and Proceed to Subscribe.
  5. Select Proceed to Configuration.
  6. For Achievement choice, select your AWS Glue model.
  7. For Software program model, select the newest software program model.
  8. Select Proceed to Launch.
  9. Beneath Utilization Directions, select the hyperlink to activate the connector.
  10. Enter a reputation for the connection, then select Create connection and activate the connector.
  11. Confirm the brand new connector on the AWS Glue Studio Connectors.

Create the AWS Glue Knowledge Catalog database

The AWS Glue Knowledge Catalog comprises references to information that’s used as sources and targets of your extract, rework, and cargo (ETL) jobs in AWS Glue. To create your information warehouse or information lake, you should catalog this information. The AWS Glue Knowledge Catalog is an index to the situation and schema of your information. You utilize the knowledge within the Knowledge Catalog to create and monitor your ETL jobs.

For this put up, we create a Knowledge Catalog database named icebergdemodb containing the metadata info of a desk named buyer, which can be queried via Athena.

  1. On the AWS Glue console, select Databases within the navigation pane.
  2. Select Add database.
  3. For Database title, enter icebergdemodb.

This creates an AWS Glue database for metadata storage.

Create a Knowledge Catalog desk in Iceberg format

On this step, we create a Knowledge Catalog desk in Iceberg desk format.

  1. On the Athena console, create an Athena workgroup named demoworkgroup for SQL queries.
  2. Select Athena engine model 3 for Question engine model.

For extra details about Athena variations, seek advice from Altering Athena engine variations.

  1. Enter the S3 bucket location for Question end result configuration beneath Extra configurations.
  2. Open the Athena question editor and select demoworkgroup.
  3. Select the database icebergdemodb.
  4. Enter and run the next DDL to create a desk pointing to the Knowledge Catalog database icerbergdemodb. Word that the TBLPROPERTIES part mentions ICEBERG because the desk sort and LOCATION factors to the S3 folder (custdata) URI created in earlier steps. This DDL command is on the market on the GitHub repo.
CREATE TABLE icebergdemodb.buyer(
yr string,
month string,
day string,
hour string,
minute string,
customerid string,
firstname string,
lastname string,
dateofbirth string,
metropolis string,
buildingnumber string,
streetaddress string,
state string,
zipcode string,
nation string,
countrycode string,
phonenumber string,
productname string,
transactionamount int)
LOCATION '<S3 Location URI>'

After you run the command efficiently, you possibly can see the desk buyer within the Knowledge Catalog.

Create an AWS Glue streaming job

On this part, we create the AWS Glue streaming job, which fetches the document from the Kinesis information stream utilizing the Spark script editor.

  1. On the AWS Glue console, select Jobs (new) within the navigation pane.
  2. For Create job¸ choose Spark script editor.
  3. For Choices¸ choose Create a brand new script with boilerplate code.
  4. Select Create.
  5. Enter the code accessible within the GitHub repo within the editor.

The pattern code retains appending information within the goal location by fetching data from the Kinesis information stream.

  1. Select the Job particulars tab within the question editor.
  2. For Title, enter Demo_Job.
  3. For IAM position¸ select demojobrole.
  4. For Kind, select Spark Streaming.
  5. For Glue Model, select Glue 3.0.
  6. For Language, select Python 3.
  7. For Employee sort, select G 0.25X.
  8. Choose Routinely scale the variety of employees.
  9. For Most variety of employees, enter 5.
  10. Beneath Superior properties, choose Use Glue Knowledge Catalog because the Hive metastore.
  11. For Connections, select the connector you created.
  12. For Job parameters, enter the next key pairs (present your S3 bucket and account ID):
Key Worth
--iceberg_job_catalog_warehouse s3://streamingicebergdemo-XX/custdata/
--output_path s3://streamingicebergdemo-XX
--kinesis_arn arn:aws:kinesis:us-east-1:<AWS Account ID>:stream/demo-data-stream
--user-jars-first True

  1. Select Run to begin the AWS Glue streaming job.
  2. To observe the job, select Monitoring within the navigation pane.
  3. Choose Demo_Job and select View run particulars to examine the job run particulars and Amazon CloudWatch logs.

Run GDPR use instances on Athena

On this part, we exhibit a number of use instances which can be related to GDPR alignment with the person information that’s saved in Iceberg format within the Amazon S3-based information lake as applied within the earlier steps. For this, let’s contemplate that the next requests are being initiated within the workflow to adjust to the laws:

  • Delete the data for the enter customerid (for instance, 59289)
  • Replace phonenumber for the customerid (for instance, 51936)

The IDs used on this instance are samples solely as a result of they have been created via the KDG template used earlier, which creates pattern information. You possibly can seek for IDs in your implementation by querying via the Athena question editor. The steps stay the identical.

Delete information by buyer ID

Full the next steps to satisfy the primary use case:

  1. On the Athena console, and ensure icebergdemodb is chosen because the database.
  2. Open the question editor.
  3. Enter the next question utilizing a buyer ID and select Run:
SELECT rely(*)
FROM icebergdemodb.buyer
WHERE customerid = '59289';

This question offers the rely of data for the enter customerid earlier than delete.

  1. Enter the next question with the identical buyer ID and select Run:
MERGE INTO icebergdemodb.buyer trg
USING (SELECT customerid
FROM icebergdemodb.buyer
WHERE customerid = '59289') src
ON (trg.customerid = src.customerid)

This question deletes the info for the enter customerid as per the workflow generated.

  1. Take a look at if there may be information with the shopper ID utilizing a rely question.

The rely needs to be 0.

Replace information by buyer ID

Full the next steps to check the second use case:

  1. On the Athena console, be sure that icebergdemodb is chosen because the database.
  2. Open the question editor.
  3. Enter the next question with a buyer ID and select Run.
SELECT customerid, phonenumber
FROM icebergdemodb.buyer
WHERE customerid = '51936';

This question offers the worth for phonenumber earlier than replace.

  1. Run the next question to replace the required columns:
MERGE INTO icebergdemodb.buyer trg
USING (SELECT customerid
FROM icebergdemodb.buyer
WHERE customerid = '51936') src
ON (trg.customerid = src.customerid)
THEN UPDATE SET phonenumber="000";

This question updates the info to a dummy worth.

  1. Run the SELECT question to examine the replace.

You possibly can see the info is up to date appropriately.

Vacuum desk

A very good observe is to run the VACUUM command periodically on the desk as a result of operations like INSERT, UPDATE, DELETE, and MERGE will happen on the Iceberg desk. See the next code:

VACUUM icebergdemodb.buyer;


The next are a number of concerns to bear in mind for this implementation:

Clear up

Full the next steps to scrub up the sources you created for this put up:

    1. Delete the custdata folder within the S3 bucket.
    2. Delete the CloudFormation stack.
    3. Delete the Kinesis information stream.
    4. Delete the S3 bucket storing the info.
    5. Delete the AWS Glue job and Iceberg connector.
    6. Delete the AWS Glue Knowledge Catalog database and desk.
    7. Delete the Athena workgroup.
    8. Delete the IAM roles and insurance policies.


This put up defined how you should utilize the Iceberg desk format on Athena to implement GDPR use instances like information deletion and information upserts as required, when streaming information is being generated and ingested via AWS Glue streaming jobs in Amazon S3.

The operations for the Iceberg desk that we demonstrated on this put up aren’t the entire information operations that Iceberg helps. Discuss with the Apache Iceberg documentation for particulars on numerous operations.

Concerning the Authors

Dhiraj Thakur is a Options Architect with Amazon Net Providers. He works with AWS prospects and companions to supply steerage on enterprise cloud adoption, migration, and technique. He’s enthusiastic about know-how and enjoys constructing and experimenting within the analytics and AI/ML area.

Rajdip Chaudhuri is Options Architect with Amazon Net Providers specializing in information and analytics. He enjoys working with AWS prospects and companions on information and analytics necessities. In his spare time, he enjoys soccer.

Related Articles


S'il vous plaît entrez votre commentaire!
S'il vous plaît entrez votre nom ici

Latest Articles