It sounds counterintuitive to use a technology that has trust issues to create more trustworthy data. But smart engineers can use generative AI to improve the quality of their data, creating more accurate and trustworthy AI-powered applications.
Generative AI models excel at answering questions in human-like sentences, but they are prone to hallucinations and cannot gain insights from internal company data that was not part of their training. However, this internal data is critical for many enterprise use cases.
Imagine an AI chatbot that tells employees how many days of paid vacation they have left, or one that tells airline passengers whether they are eligible for a seat upgrade. These use cases require precise answers, and machine learning engineers need access to accurate, up-to-date data to maximize the value of generative AI in the enterprise.
Data governance can play a key role here, helping to manage the operational and reputational risks that can result from improper AI decisions. In particular, by applying metadata that describes the structure and provenance of data and how it is intended to be used, data teams can ensure data quality and improve the accuracy of generative AI-powered applications. This goes beyond the business realm and extends to emerging compliance frameworks that require policies to ensure data integrity, security and accountability.
However, creating this metadata is a time-consuming task for data producers, meaning that busy data teams often cut corners or don’t create it at all. As an analogy, you may recall that Tim Berners-Lee once called for the creation of a “semantic web,” where web content would be much more useful because it would be described in a machine-readable form. This required websites to manually tag their content, but this mostly never happened. This is not dissimilar to the governance problem that data teams face today.
But while generative AI is driving the need for stronger data governance, it can also help meet that need. By presenting a generative AI model with examples of how data should be labeled, generative AI can automatically create the necessary metadata. While a human will still need to review the results, the process will be much less laborious than creating metadata from scratch.
Start with a data product mindset
The need for high-quality data isn’t unique to generative AI. As data becomes more important for all types of analytics, there has also been a growing interest in building unified data catalogs that make it easier for other teams to find and use data. By using generative AI to create metadata and a data streaming platform to create reusable data products, data becomes much more accessible, driving innovation and productivity.
This metadata includes machine-readable information, such as a data schema and field descriptions, as well as human-readable information, such as who created the data and how it is intended to be used. The key is to provide enough information so that someone else in the organization who wants to use a data asset knows where it came from, how it can be used, whether there is an associated service level agreement (SLA), and how trustworthy it is.
The fundamental element of data management is a schema – specific metadata that describes the structure of the data. If we present a generative AI model with enough examples of the data it captures or the code that produces it, the model can induce the schema.
This process works best when the metadata is created at data production. We can run a generative AI program retrospectively over older datasets to produce metadata, but the results may be less reliable because the original schema has evolved over time. By creating metadata at data production, the metadata tends to be able to describe the underlying dataset more accurately.
Keep people informed
Human review is necessary due to the limitations of the current state of AI. While the AI will be good at recognizing patterns, it may not be able to generalize the entire schema based on a limited number of examples it has been shown. We have not yet fully replicated the intuition and understanding of experts, and this can complement the amount of information that AI can process quickly. We know that the year has 12 months, or that the United States has 50 states, or that street addresses usually require a house number—and this allows us to easily recognize mislabeled data. The AI process may make mistakes because it lacks this basic knowledge, or because it has not seen enough examples. However, a human can quickly fix these errors and still save a lot of time and effort before non-compliant data is used by downstream engineers.
For this to work well, data producers need to adhere to the data policies set by the organization. Additionally, as a schema evolves, you may need to adapt the model to reflect the new schema. The choice of LLM is important, but less important than the workflows that support data curation and contextualizing the system prompt. For best results, the model needs not only examples of the dataset or production code, but also guidance on the metadata the model should produce.
A data streaming platform is the optimal pattern
If we remember the Semantic Web, we never saw its vision of making the web machine-readable realized in the way its creators envisioned. And yet the web became machine-readable in ways few predicted in the early 2000s because machine learning became much better at understanding media created for humans. Similarly, better machine learning presents a better alternative for doing the routine tasks required for data management.
Applying generative AI in this way requires a platform to work with. A data streaming platform that can process data generated in real time is well suited for this. Data streaming platforms are designed from the ground up to present data in a way that is consumable, so it is an efficient environment to apply metadata at production time and create data products that can be reused in other applications.
A data streaming platform can also help you ensure that governance controls and metadata are integrated into a common data catalog so they can be discovered and reused.
The rapid development of generative AI has created an urgent need for high-quality data and data management, but it has also provided a solution. Over time, generative AI may be able to take on additional management tasks, such as applying data policies, but in general it is not ready for this yet.
Nevertheless, generative AI can help eliminate much of the routine work of defining and applying schemas and other important data properties, creating a virtuous cycle that improves the quality of generative AI-based applications and makes data available for reuse on a much larger scale.
Industry and academia are beginning to define what AI governance should look like, but it’s still a concept that’s still evolving. In practice, there’s no single definition of what AI governance entails, let alone anything resembling a framework. But we can say with certainty that AI governance depends on data governance by helping engineers trust data that they can use to build generative AI applications.
In the future, I would like to see the industry more clearly define what AI governance should look like and data infrastructure providers place a greater emphasis on integrating generative AI into tools and abstractions that promote better data quality.
YOUTUBE.COM/THENEWSTACK
Technology is evolving fast, don’t miss an episode. Subscribe to our YouTube channel to stream all our podcasts, interviews, demos and more.
SUBSCRIBE