Google Distributed Cloud Dataproc for Accessing PII data

PII data

Due to operational or regulatory constraints, Google Cloud clients who are interested in developing or updating their data lake architecture frequently have to keep a portion of their workloads and data on-premises.

You can now completely modernise your data lake with cloud-based technologies while creating hybrid data processing footprints that enable you to store and process on-prem data that you are unable to shift to the cloud, thanks to Dataproc on Google Distributed Cloud, which was unveiled in preview at Google Cloud Next ’24.

Using Google-provided hardware in your data centre, Dataproc on Google Distributed Cloud enables you to run Apache Spark processing workloads on-premises while preserving compatibility between your local and cloud-based technology.

For instance, in order to comply with regulatory obligations, a sizable European telecoms business is updating its data lake on Google Cloud while maintaining Personally Identifiable Information (PII) data on-premises on Google Distributed Cloud.

Google Cloud will demonstrate in this blog how to utilise Dataproc on Google Distributed Cloud to read PII data that is stored on-premises, compute aggregate metrics, and transfer the final dataset to the cloud’s data lake using Google Cloud Storage.

Compile and encrypt private information on-site

The customer in Google Cloud test scenario is a telecom provider that keeps event records of user calls:

customer id	customer name	call duration	call type	signal strength	device type	location
1	<redacted>	141	Voice	379	LG Q6	Tammieview, PA
2	<redacted>	26	Video	947	Kyocera Hydro Elite	New Angela, FL
3	<redacted>	117	Voice	625	Huawei Y5	Toddville, MO
4	<redacted>	36	Video	382	iPhone X	Richmondview, NV
5	<redacted>	110	Video	461	HTC 10 evo	Cowanchester, KS
6	<redacted>	0	Video	326	Galaxy S7	Nicholsside, NV
7	<redacted>	200	Data	448	Kyocera Hydro Elite	New Taramouth, AR
8	<redacted>	178	Data	475	Galaxy S7	South Heather, CT
9	<redacted>	200	Voice	538	Oppo Reno6 Pro+ 5G	Gregoryburgh, ID
10	<redacted>	113	Voice	878	ZTE Axon 30 Ultra 5G	Karaview, NV
11	<redacted>	200	Data	722	Huawei P10 Lite	Petersonstad, IA
12	<redacted>	200	Voice	1	HTC 10 evo	West Danielport, CO
13	<redacted>	169	Voice	230	Samsung Galaxy S10+	North Jose, SD
14	<redacted>	198	Voice	1	Kyocera DuraForce	East Matthewmouth, AS
15	<redacted>	155	Data	757	Oppo Find X	Tuckerchester, MD
16	<redacted>	0	Data	1	ZTE Axon 30 Ultra 5G	New Tammy, NC
17	<redacted>	200	Data	656	Galaxy Note 7	East Jeanside, NJ
18	<redacted>	15	Data	567	Huawei Y5	Lake Patrickburgh, OH

PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3-compatible in order to meet this requirement. Now, though, the customer wants to use their larger data lake in Google Cloud to determine the optimal places to invest in new infrastructure by analysing signal strength by geography.

Full local execution of Spark jobs capable of performing an aggregation on signal quality is supported by Dataproc on Google Distributed Cloud, allowing integration with Google Cloud Data Analytics while adhering to compliance standards.

The Cloud Storage output shows multiple low-quality signal areas:

Location	Value
Georgefurt, MS	1.0
Scottside, MA	1.0
Monroemouth, FL	1.0
Lake Robert, OH	1.0
East Lauren, VA	1.0
Shelleyburgh, CT	1.0
Buckville, ID	1.0
Garzaton, WI	3.32
North Danielle, NY	3.99
Port Natalie, ID	5.43

PII is present in this dataset. PII needs to be kept on-site in their own data centre in order to comply with regulations. The customer will store this data on-premises in object storage that is S3 compatible in order to meet this requirement. The customer now wants to analyse signal strength by location and determine the optimal places for new infrastructure expenditures using their larger data lake in Google Cloud.

Reading PII data with Google Distributed Cloud Dataproc requires various steps to assure data processing and privacy compliance.

To read PII data with Google Distributed Cloud Dataproc, just set up your Google Cloud environment.

Create a Google Cloud Project: If you don’t have one, create one in GCP.
Project billing: Enable billing.
In your Google Cloud project, enable the Dataproc API, Cloud Storage API, and any other relevant APIs.

Prepare PII

Securely store PII in Google Cloud Storage. Encrypt and restrict bucket and data access.
Classifying Data: Label data by sensitivity and compliance.

Create and configure Dataproc Cluster

Create a Dataproc cluster using the Google Cloud Console or gcloud command-line tool. Set the node count and type, and configure the cluster using software and libraries.
Security Configuration: Set IAM roles and permissions to restrict data access and processing to authorised users.

Develop Your Data Processing Job

Choose a Processing Framework: Consider Apache Spark or Hadoop.
Write the Data Processing Job: Create a script or app to process PII. This may involve reading GCS data, transforming it, and writing the output to GCS or another storage solution.

Job Submission to Dataproc Cluster

Submit your job to the cluster via the Google Cloud Console, gcloud command-line tool, or Dataproc API.
Check work status and records to guarantee completion.

Compliance and Data Security

Encrypt data at rest and in transit.
Use IAM policies to restrict data and resource access.
Compliance: Follow data protection laws including GDPR and CCPA.

Destruction of Dataproc Cluster

To save money, destroy the Dataproc cluster after data processing.

Best Practices

Always mask or anonymize PII data when processing.
Track PII data access and changes with extensive recording and monitoring.
Regularly audit data access and processing for compliance.
Data minimization: Process just the PII data you need.

Conclusion

PII processing with Google Distributed Cloud Dataproc requires careful design and execution to maintain data protection and compliance. Follow the methods and recommended practices above to use Dataproc for data processing while protecting sensitive data.

Dataproc

The managed, scalable Dataproc service supports Apache Hadoop, Spark, Flink, Presto, and over thirty open source tools and frameworks. For safe data science, ETL, and data lake modernization at scale that is integrated with Google Cloud at a significantly lower cost, use Dataproc.

ADVANTAGES

Bring your open source data processing up to date.

OSS for data science that is seamless and intelligent

Provide native connections with BigQuery, Dataplex, Vertex AI, and OSS notebooks like JupyterLab to let data scientists and analysts do data science tasks with ease.

Google Cloud integration with enterprise security

Features for security include OS Login, customer-managed encryption keys (CMEK), VPC Service Controls, and default at-rest encryption. Add a security setting to enable Hadoop Secure Mode using Kerberos.

Important characteristics

Completely automated and managed open-source big data applications

Your attention may be diverted from your infrastructure to your data and analytics using serverless deployment, logging, and monitoring. Cut the Apache Spark management TCO by as much as 54%. Integrate with Vertex AI Workbench to enable data scientists and engineers to construct and train models 5X faster than with standard notebooks. While Dataproc Metastore removes the need for you to manage your own Hive metastore or catalogue service, the Jobs API from Dataproc makes it simple to integrate large data processing into custom applications.

Use Kubernetes to containerise Apache Spark jobs

Create your Apache Spark jobs with Dataproc on Kubernetes so that you may utilise Dataproc to provide isolation and job portability while using Google Kubernetes Engine (GKE).

Google Cloud integration with enterprise security

By adding a Security Configuration, you can use Kerberos to enable Hadoop Secure Mode when you construct a Dataproc cluster. Additionally, customer-managed encryption keys (CMEK), OS Login, VPC Service Controls, and default at-rest encryption are some of the most often utilised Google Cloud-specific security features employed with Dataproc.

The best of Google Cloud combined with the finest of open source

More than 30 open source frameworks, including Apache Hadoop, Spark, Flink, and Presto, are supported by the managed, scalable Dataproc service. Simultaneously, Dataproc offers native integration with the whole Google Cloud database, analytics, and artificial intelligence ecosystem. Building data applications and linking Dataproc to BigQuery, Vertex AI, Spanner, Pub/Sub, or Data Fusion is a breeze for data scientists and developers.