Build Data Pipelines Without Coding with AWS Glue Studio

 

With AWS Glue Studio, integrate your data and work together by leveraging data preparation.

What is AWS Glue Studio?

The purpose of AWS Glue Studio, a visual interface found within the AWS Glue service, is to facilitate the creation, execution, and oversight of ETL (Extract, Transform, Load) jobs for data scientists, engineers, and developers. The fully managed ETL solution AWS Glue makes data preparation and loading for analytics easy.

AWS Glue Studio tutorial

AWS is pleased to inform that AWS Glue Studio Visual ETL now offers data preparation authoring on a broad basis. With a spreadsheet-style user interface, this new no-code data preparation tool for business users and data analysts executes data integration tasks at scale on AWS Glue for Spark. It is now simpler for data scientists and analysts to clean and transform data in order to get it ready for analytics and machine learning (ML) thanks to the new visual data preparation experience. With this new experience, you can automate data preparation chores without writing any code by selecting from hundreds of pre-built transforms.

A script that connects to your source data, processes it, and then writes it to your data target is contained in an AWS Glue job. Extract, transform, and load (ETL) scripts are typically run by a job. Scripts created for the Ray and Apache Spark runtime environments can be executed by jobs. General-purpose Python scripts (Python shell jobs) can also be executed by jobs. Jobs can be started by AWS Glue triggers on demand or in response to an event or schedule. To comprehend runtime metrics like completion status, duration, and start time, you can keep an eye on work runs.

Scripts generated by AWS Glue can be used, or you can supply your own. The AWS Glue Studio code generator can generate an Apache Spark API (PySpark) script automatically given a source schema and a target location or schema. This script can be edited to fit your needs, or you can use it as a starting point.

Multiple data formats can be written to output files using Amazon Glue. Different output formats may be supported by each type of job. Common compression formats can be designed for specific data formats.

Authenticating into the AWS Glue interface

The business logic that carries out extract, transform, and load (ETL) tasks is called a job in Amazon Glue. The AWS Glue console’s ETL section is where tasks can be created.

Open the AWS Glue console. After logging into the AWS Management Console to examine the jobs that are currently in progress. Next, select the Jobs tab in Amazon Glue. The Jobs list shows the current job bookmark option, the latest modification date, and the location of the script associated with each job

You can use Amazon Glue Studio to edit your ETL jobs either during the creation of a new job or after you have saved your job. This can be accomplished by modifying the job script in developer mode or by modifying the nodes in the visual editor. Additionally, the visual editor allows you to add and remove nodes to design more complex ETL tasks.

The following actions to create a job in AWS Glue Studio

Nodes for your job are configured using the visual job editor. Every node stands for a different action, such as reading data from its original source or transforming it. There are characteristics on every node you add to your task that tell you about the transform or the location of the data.

Data engineers and business analysts can now work together to create data integration projects. Data engineers can specify connections to the data and configure the data flow process’s ordering using the visual flow-based interface in Glue Studio. Business analysts can specify the data transformation and output by drawing on their experience with data preparation. You can also import your current “recipes” for data cleansing and preparation from AWS Glue DataBrew into the new AWS Glue data preparation experience. This allows you to keep writing them straight in AWS Glue Studio and then scale up recipes to handle petabytes of data at a fraction of the cost of AWS Glue jobs.

Prerequisites for Visual ETL

An AWSGlueConsoleFullAccess IAM managed policy linked to the users and roles that will access AWS Glue is required for the visual ETL.
These roles and users have read access to Amazon Simple Storage Service (Amazon S3) resources and full access to AWS Glue thanks to this policy.

Sophisticated visual ETL flows

Use AWS Glue Studio to author the visual ETL after the necessary AWS Identity and Access Management (IAM) role permissions have been established.

Excerpt

Choose the Amazon S3 node from the list of Sources to create an Amazon S3 node.
Choose the recently established node and search for an S3 dataset. After the file has been properly uploaded, select Infer schema to set the source node. A glimpse of the data in the.csv file will appear in the visual interface.

In order to visualise the data, I first created an S3 bucket in the same region as the AWS Glue visual ETL and uploaded a.csv file called visual ETL conference data.csv.

Change

Add a Data Preparation Recipe and launch a data preview session once the node has been configured. This session usually takes two to three minutes to begin.

Select Author Recipe to begin an authoring session and add transformations when the data frame is finished, once the data preview session is ready. You can inspect the data, apply transformation steps, and see the modified data interactively during the authoring session. The steps can be reversed, repeated, and rearranged. The statistical characteristics of each column as well as its data type are visible.

Fill up

After you’ve interactively prepared your data, you can share your work with data engineers so they may add custom code and more sophisticated visual ETL flows to easily incorporate your work into their production data pipelines.

Currently accessible

Now accessible to the general public in all commercial AWS Regions where AWS Data Brew is offered, is the AWS Glue data preparation writing experience. Go to AWS Glue to find out more.

Post a Comment

0 Comments