Data production with BigQuery DataFrames and LLMs

Data processing and machine learning operations have been difficult to separate in big data analytics. Data engineers used Apache Spark for large-scale data processing in BigQuery, while data scientists used pandas and scikit-learn for machine learning. This disconnected approach caused inefficiencies, data duplication, and data insight delays.

At the same time, AI success depends on massive data. Thus, any firm must generate and handle synthetic data, which replicates real-world data. Algorithmically modelling production datasets or training ML algorithms like generative AI generate synthetic data. This synthetic data can simulate operational or production data for ML model training or mathematical model evaluation.

BigQuery DataFrames Solutions

BigQuery DataFrames unites data processing with machine learning on a scalable, cost-effective platform. This helps organizations expedite data-driven initiatives, boost teamwork, and maximize data potential. BigQuery DataFrames is an open-source Python package with pandas-like DataFrames and scikit-learn-like ML libraries for huge data.

It runs on BigQuery and Google Cloud storage and compute. Integrating with Google Cloud Functions allows compute extensibility, while Vertex AI delivers generative AI capabilities, including state-of-the-art models. BigQuey DataFrames can be utilized to build scalable AI applications due to their versatility.

BigQuery DataFrames lets you generate artificial data at scale and avoids concerns with transporting data beyond your ecosystem or using third-party solutions. When handling sensitive personal data, synthetic data protects privacy. It permits dataset sharing and collaboration without disclosing personal details.

Google Cloud can also apply analytical models in production. Testing and validation are safe with synthetic data. Simulate edge cases, outliers, and uncommon events that may not be in your dataset. Synthetic data also lets you model data warehouse schema or ETL process modifications before making them, eliminating costly errors and downtime.

Synthetic data generation with BigQuery DataFrames

Many applications require synthetic data generation:

Real data generation is costly and slow.
Unlike synthetic data, original data is governed by strict laws, restrictions, and oversight.
Simulations require larger data.

What is a data schema

Data schema

Let’s use BigQuery DataFrames and LLMs to produce synthetic data in BigQuery. Two primary stages and several substages comprise this process:

Code creation

Set the Schema and instruct LLM.
The user knows the expected data schema.
They understand data-generating programmes at a high degree.
They intend to build small-scale data generation code in a natural language (NL) prompt.
Add hints to the prompt to help LLM generate correct code.
Send LLM prompt and get code.

Executing code

Run the code as a remote function at the specified scale.
Post-process Data to desired form.
Library setup and initialization.
Start by installing, importing, and initializing BigQuery DataFrames.
Start with user-specified schema to generate synthetic data.
Provide high-level schema.

Consider generating demographic data with name, age, and gender using gender-inclusive Latin American names. The prompt states our aim. They also provide other information to help the LLM generate the proper code:

Use Faker, a popular Python fake data module, as a foundation.
Pandas DataFrame holds lesser data.
Generate code with LLM.
Note that they will produce code to construct 100 rows of the intended data before scaling it.

Run code

They gave LLMs all the guidance they needed and described the dataset structure in the preceding stage. The code is verified and executed here. This process is crucial since it involves humans and validates output.

Local code verification with a tiny sample

The prior stage’s code appears fine.
They would return to the prompt and update it and repeat the procedures if the created code hadn’t ran or Google wanted to fine-tune the data distribution.
The LLM prompt might include the created code and the issue to repair.

Deploy code as remote function

The data matches what they wanted, so Google may deploy the app as a remote function. Remote functions offer scalar transformation, thus Google can utilize an indicator (in this case integer) input and make a string output, which is the code’s serialized dataframe in json. Google Cloud must additionally mention external package dependencies, such as faker and pandas.

Scale data generation

Create one million synthetic data rows. An indicator dataframe with 1M/100 = 10K indicator rows can be initialized since our created code generates 100 rows every run. They can use the remote function to generate 100 synthetic data rows each indication row.

Flatten JSON

Each item in df[“json_data”] is a 100-record json serialized array. Use direct SQL to flatten that into one record per row.

The result_df DataFrame contains one million synthetic data rows suitable for usage or saving in a BigQuery database (using the to_gbq method). BigQuery, Vertex AI, Cloud Functions, Cloud Run, Cloud Build, and Artefact Registry fees are involved. BigQuery DataFrames pricing details. BigQuery jobs utilized ~276K slot milliseconds and processed ~62MB bytes.

Creating synthetic data from a table structure

A schema can generate synthetic data, as seen in the preceding step. Synthetic data for an existing table is possible. You may be copying the production dataset for development. The goal is to ensure data distribution and schema similarity. This requires creating the LLM prompt from the table’s column names, types, and descriptions. The prompt could also include data profiling metrics derived from the table’s data, such as:

Any numeric column distribution. DataFrame.describe returns column statistics.

Any suggestions for string or date/time column data format. Use DataFrame.sample or Series.sample.

Any tips on unique categorical column values. You can use Series.unique.

Existing dimension table fact table generation

They could create a synthetic fact table for a dimension table and join it back. If your usersTable has schema (userId, userName, age, gender), you can construct a transactionsTable with schema (userId, transactionDate, transactionAmount) where userId is the key relationship. To accomplish this, take these steps:

Create LLM prompt to produce schema data (transactionDate, transactionAmount).

(Optional) In the prompt, tell the algorithm to generate a random number of rows between 0 and 100 instead of 100 to give fact data a more natural distribution. You need adjust batch_size to 50 (assuming symmetrical distribution). Due to unpredictability, the final data may differ from the desired_num_rows.

Replace the schema range with userId from the usersTable to initialise the indicator dataframe.

As with the given schema, run the LLM-generated code remote function on the indicator dataframe.

Select userId and (transactionDate, transactionAmount) in final result.

Conclusions and resources

This example used BigQuery DataFrames to generate synthetic data, essential in today’s AI world. Synthetic data is a good alternative for training machine learning models and testing systems due to data privacy concerns and the necessity for big datasets. BigQuery DataFrames integrates easily with your data warehouse, Vertex AI, and the advanced Gemini model. This lets you generate data in your data warehouse without third-party solutions or data transfer.

Google Cloud demonstrated BigQuery DataFrames and LLMs synthetic data generation step-by-step. This involves:

Set the data format and use natural language prompts to tell the LLM to generate code.
Code execution: Scaling the code as a remote function to generate massive amounts of synthetic data.
Get the full Colab Enterprise notebook source code here.

Google also offered three ways to use their technique to demonstrate its versatility:

From user-specified schema, generate data: Ideal for pricey data production or rigorous governance.

Generate data from a table schema: Useful for production-like development datasets.

Create a dimension table fact table: Allows entity-linked synthetic transactional data creation.

BigQuery DataFrames and LLMs may easily generate synthetic data, alleviating data privacy concerns and boosting AI development.