Amazon DataZone previews data lineage end-to-end

Amazon DataZone Documentation

Unlock data with integrated governance features across organisational boundaries.

What is Amazon DataZone?

Customers may catalogue, find, share, and manage data kept on AWS, on-site, and from outside sources more quickly and easily with the help of Amazon DataZone, a data management solution. With Amazon DataZone, administrators and data stewards who oversee an organization’s data assets may manage and limit access to data through fine-grained controls. These safeguards are intended to guarantee access with appropriate context and privilege levels. To find, consume, and work together to create data-driven insights, Amazon DataZone facilitates data access throughout an organisation for engineers, data scientists, product managers, analysts, and business users.

Features

Business data catalogue for Amazon DataZone

Find released data, submit an access request, and begin working with your data in a matter of days as opposed to weeks.

Projects for Amazon DataZone

Utilise data assets to manage and track data assets across projects and to collaborate with teams.

The DataZone site on Amazon

Using an API or web application, get analytics for data assets with a customised view.

Data sharing is controlled by Amazon DataZone

With a regulated workflow, you can be sure that the right people are accessing the correct data for the right purposes.

Use cases

Utilize the business data catalogue to find data

To search, share, and access catalogued data that is kept on AWS, on-site, or with other providers, use business keywords.

Make analytics access simpler

With the help of a web-based application, discover, prepare, transform, analyse, and visualise data in a personalised way.

Simplify the procedures for workflow

Boost productivity by working together across teams and by having self-service access to data and analytics tools.

Manage access from a single location

From a single location, control and regulate data access in compliance with your company’s security policies.

An organization’s data producers and consumers can catalogue, find, analyse, exchange, and manage data with the help of Amazon DataZone, a data management service. A unified data portal makes it simple for engineers, data scientists, product managers, analysts, and business users to access data throughout the entire organisation in order to find, use, and work together to extract data-driven insights.

A recent addition to Amazon DataZone is a feature called “data lineage,” which aids in the visualisation and comprehension of data provenance, change management tracking, root cause investigation in the event of a reported data problem, and readiness for inquiries regarding data flow from source to target. This feature offers a thorough view of lineage events that are automatically collected from the catalogue of Amazon DataZone and combined with other programmatically collected events from outside of Amazon DataZone for an asset.

You may rely on manual paperwork or human contacts when you need to verify how the relevant data originated within the company. The lengthy and perhaps inconsistent nature of this manual approach undermines your confidence in the data. Understanding the data’s origins, changes over time, and consumption patterns through data lineage in Amazon DataZone can assist build trust. AWS Glue can be used to perform ETL transformations on the data and display its data lineage, for instance, from the moment the data was taken as raw files in Amazon Simple Storage Service (Amazon S3) to the moment it was used in tools like Amazon QuickSight.

You can spend less time mapping a data asset and its relationships, creating and debugging pipelines, and enforcing data governance procedures when you use Amazon DataZone’s data lineage. With the use of an API, data lineage enables you to compile all lineage information into one location. From there, it presents the information in a graphical style that helps data users increase productivity, make better data-driven decisions, and locate the source of problems with their data.

Using Amazon DataZone to begin data lineage

You can begin hydrating lineage information into Amazon DataZone programmatically in preview by either sending OpenLineage compatible events from pre-existing pipeline components to capture data movement or transformations that occur outside of Amazon DataZone, or by directly creating lineage nodes using Amazon DataZone APIs. For data consumers, like data analysts or engineers, to know if they are using the correct data for their analysis, or for producers, like data engineers, to track who is using the data they produced, Amazon DataZone automatically captures lineage of its states (i.e., inventory or published states) and its subscriptions for information about assets in the catalogue.

As soon as the data is provided, Amazon DataZone may map the identifier sent through the APIs with the assets that have already been catalogued and begin populating the lineage model. The model creates versions to begin the asset’s visualisation at a specific time based on newly received lineage information, but it also lets you go back to earlier iterations.