NL2SQL Using BigQuery And Gemini: A Comprehensive Guide

 Natural language conversion to SQL

Gemini and BigQuery are being used for NL2SQL (natural language to SQL).

Natural Language Processing (NLP) and the traditional Structured Query Language (SQL) were combined to develop the fascinating new technology known as Natural Language to SQL, or NL2SQL. It transforms questions in standard human language into SQL queries with structure.

Not surprisingly, the technology has the power to fundamentally alter the way we interact with data.

Non-technical users, such as marketers, business analysts, and other subject matter experts, may interact with databases, analyze data, and derive insights on their own without the need for specialized SQL knowledge thanks to NL2SQL. By creating complex queries using NL2SQL, even SQL specialists may save time and focus more on strategic research and decision-making.
  • How does it look on the ground? Consider providing immediate access to a chat window where you can post questions and get prompt responses, or
  • "What was the total number of units sold this month?"
"When comparing Q1 and Q2 sales, what are the primary factors influencing the shift in APAC sales?"
To extract data from databases and transform it into business insights, this would have previously needed a specialist. By using NL2SQL, it may democratize analytics by reducing barriers to data access.

However, NL2SQL's widespread adoption is hindered by many issues. In this article, we'll examine Google Cloud NL2SQL options and recommended practices for installation.

Problems with data quality in real-world applications

Let's start by looking at a few of the elements that make NL2SQL implementation challenging.

Although NL2SQL works best in controlled environments with simple queries, real-world production data presents many challenges. Among these challenges are:

Variations in data formatting: The same information may be written in a number of different ways, such as "1000," "1k," or "1000.0" for monetary values, or "Male," "male," or "M" for gender. Furthermore, a lot of organizations have their own poorly defined acronyms.

Ambiguity in meaning: Semantic ambiguity is caused by the fact that large language models (LLMs) usually do not understand domain-specific schemas. When a column name has many meanings, for instance, this may lead to user requests being misunderstood.

Syntactic rigidity: Semantically sound queries may not succeed if they do not adhere to SQL's strict syntax.

Special business metrics: NL2SQL needs to understand table connections via foreign keys and handle complex business calculations. Effective translation of the query requires a deep comprehension of the tables that must be linked and modeled collectively. Furthermore, there isn't a single, accepted method for deciding which business KPIs each company should include in its final narrative report.

Client issues


It's not only the facts that might be ambiguous or badly structured; users' inquiries are also complex or imprecise. The implementation of NL2SQL may be difficult due to these three common problems with user queries.

Questions that are unclear: Even seemingly straightforward questions might be ambiguous. To get the "total number of sold units month to date," for instance, a query would need to specify the date field to utilize, whether to use running_total_unit or average_total_unit, etc. When constructing the SQL query, the ideal NL2SQL solution will proactively prompt the user to choose the appropriate column and use their input.

Underspecified questions: Insufficiently detailed questions are another problem. A user's query about "the return rate of all products under my team in Q4" for instance, lacks sufficient information to indicate which team should fully understand the topic. To get a complete representation of the inquiry, an ideal NL2SQL solution should find any ambiguities in the original input and pose follow-up queries.

Complex questions that need to be analyzed in several steps: Many questions need to be analyzed in several steps. For example, think about identifying the primary reasons why sales fluctuate from quarter to quarter: The research should be broken down into manageable chunks by a competent NL2SQL solution, which should then provide intermediate summaries and a comprehensive final report that addresses the user's query.

Managing the challenges

Google created Gemini Flash 1.5 as a routing agent that can classify requests based on their complexity in order to solve these problems. Once the topic has been categorized, it may improve its findings by using techniques like vector embeddings, ambiguity checks, contribution analysis models, and semantic searches.

It uses Gemini to respond to commands in a JSON format. For example, Gemini may function as a routing agent by answering the following few-shot prompt:

Direct questions

By using draft SQL ambiguity checks, user feedback loops, and in-context learning, the correct column names in scope may be made clear for direct queries. Additionally, it is assured that straightforward queries will produce understandable SQL.

Its approach performs the following for simple questions:

  • Collects high-quality SQL/question combinations.
  • Samples are stored in BigQuery rows.
  • Allows vector embeddings in the question.
  •  Uses BigQuery vector search in conjunction with the user's query to get relevant examples.
  • Provides the LLM context by including the question, example, and table structure.
  • Creates a preliminary SQL
  • Runs a loop that incorporates syntax checking, user input, refinement, and a SQL ambiguity check.
  • Executes the SQL
  • Summarizes the facts in plain language.
 
Based on its heuristic testing, Gemini seems to do well on jobs that check for SQL ambiguity. Google began by drafting a SQL model that included all of the questions pertaining to context and database structure. Gemini was able to get clarity by asking the user follow-up questions as a result.

Analysis of key drivers

Multi-step reasoning-based data analysis, also known as key driver analysis, requires analysts to segregate and arrange data depending on every conceivable combination of features (e.g., product categories, distribution channels, and regions). For this use case, Google recommends integrating BigQuery contribution analysis with Gemini.

The following stages are added to the ones carried out with direct questioning by key driver analysis:
  • When users inquire about it, the routing agent directs them to a dedicated handling page for key driver analysis.
  • The agent uses BigQuery ML vector search to find comparable question/SQL embedding pairs from ground truth contained in a BigQuery vector database.
  • In order to build a contribution analysis report, it then generates and validates the CREATE MODEL statement.
Finally, to receive the contribution analysis report, the following SQL is run:
  • The completed report looks like this:
  • Study of major drivers
  • Google Cloud is credited with the image.
You may further compress the information in natural language using Gemini:

NL2SQL Implementation on Google Cloud

Despite the fact that this may seem challenging, Google Cloud offers a whole suite of tools to help you implement a successful NL2SQL solution. Let's check it out.

BigQuery vector search is used for both retrieval and embedding


Better SQL creation may be achieved by rapidly locating instances and context that are semantically significant via the use of BigQuery for embedding storage and retrieval. Embedings may be made using BigQuery's ML.GENERATE_EMBEDDING function or Vertex AI's text embedding API. When BigQuery is utilized as a vector database, its built-in vector search makes it easy to match user queries with SQL pairings.

BigQuery contribution analysis

Contribution analysis modeling may discover regions of a dataset that are causing unexpected changes by identifying statistically significant differences throughout the dataset, including test and control data. A region is a subset of the data based on a combination of dimension values.

BigQuery ML's freshly released contribution analysis preview facilitates automated explanations and insight building of multi-dimensional data at scale to help answer "why?" inquiries."What happened?" and "What's changed?" in relation to your data.

In conclusion, BigQuery's contribution analysis models increase overall efficiency by making it easier to generate a large number of queries utilizing NL2SQL.

Checks for ambiguity using Gemini

NL2SQL is an often unidirectional process that converts natural language queries into structured SQL queries. To improve performance, Gemini may help reduce ambiguity and improve the output statements.

By asking clarifying questions, you may use Gemini 1.5 Flash to get user feedback when a question, table, or column structure is unclear. This will assist you in enhancing and perfecting the generated SQL query. The development of SQL queries and results summaries in natural language may also be accelerated using Gemini and in-context learning.

The recommended architectural layout

The best NL2SQL methods

Consider the following tips to help you with your own NL2SQL attempt. Determine which questions need to be addressed first: Answering a question may seem simple, depending on the purpose of the final report, but it sometimes takes many stages of reasoning to reach the desired response and plot. Prior to your experiment, collect your query, SQL, and the predicted natural language ground truth.


Using LLMs does not take the role of data preparation and purification, which are crucial. Create new table views as necessary, and ensure that business domain acronyms are replaced with helpful explanations or metadata. Start with simple questions that just need one table before moving on to more intricate join-required problems.

Google's heuristic experiment shows that iteration with feedback is more successful after creating an initial draft of your SQL. Practice iteration and SQL refinement with user input.

Use a custom flow for queries that include more than one step: BigQuery contribution analysis models may provide automated insight generation and multi-dimensional data explanations.

What comes next?

Combining NL2SQL, LLMs, and data analysis techniques is a significant step in making data more accessible and helpful for everyone. Allowing users to interact with databases in natural language helps democratize data access and analysis, allowing more individuals in any organization to make better decisions.

Exciting new developments like BigQuery contribution analysis and Gemini have made it easier than ever to rationalize data, quantity, and value.

Post a Comment

0 Comments