Generative AI is here, and yes, it will have implications on data modeling as well! As with any new technology, it is often tempting to wield it like a hammer - you keep trying to find nails to hit. Instead, we should try to figure out what nails we have that could use such a hammer.
In this blog post, we will look into a couple of possible use cases for Large Language Models (LLMs) in the context of Data Modeling. Data Modeling is the basic methodology for designing better data products, and some interesting prospects for LLMs are starting to emerge in this area.
Let’s start by taking a step back. What is a Data Model? For us at Ellie, a Data Model is first and foremost a “description of a business in terms of the things it needs to know about” - as so eloquently described by Alec Sharp. This is the Conceptual level of Data Modeling, where you should always start.
A Data Model, therefore, is a structure that we use to describe something that is going on in the “real life”. A customer purchases products from a store - this is what happens in the business, and as easy it is to write that “narrative” down, it is also very easy to draw a Data Model describing the exact same thing. In effect, with data modeling, we are extracting this structure from a slice of reality.
Now, at this level of abstraction and speaking quite loosely, that is not too far from what LLMs do. They are models of language, created by extracting (extremely complex) structures from massive amounts of textual information. An LLM can extract structures from a vast slice of reality.
A Data Model is created when a modeler figures out a slice of reality and extracts structures from it, right? So the obvious question now is, why couldn’t an LLM do that?
The most obvious LLM use case in this context is, of course, using LLMs to create Data Models. The basic process of Data Modeling is quite well suited for such work:
LLMs are good at processing natural language. We can reasonably expect an LLM to perform quite well on tasks 3 & 4 above, given a description of a “slice of reality” from task 2. This is simply about identifying patterns in language. Of course, there are some twists - more on that later.
We could probably expect an LLM to perform quite well on task 2, too - figuring what is going on - but here is the first and most obvious risk for this use case. A Data Model should always be relevant to your business. Generic industry standard models can often act as a good starting point, but a highly generic Data Model tends to be quite useless. What if even the description of reality that the model is based on is highly generic?
LLMs will “hallucinate” stuff. This is in their nature; they are, after all, “stochastic parrots”, simply mimicking language that would be statistically probable. We could always ask an LLM chatbot to describe a “grocery store business” and create a data model for it, but the hallucinations produced would be extremely generic and bear only passing resemblance to whatever is really going on in our business.
That’s why the right starting point for LLM use in Data Modeling is interpreting an existing and relevant description of our “slice of reality” and turning that into Data Model components (entities and relationships). If our business is “a person walks into a store and buys a product”, then what are the entities & relationships we should add to our Data Model?
Naturally, in real life, that description of the business can often be very complicated and messy. The skill of a data modeler is often truly measured by how well they can identify entities and relationships from a sprawling group discussion!
In the experiments we’ve been running at Ellie, the results of this kind of work have been overall positive but varying. An LLM chatbot can understand tasks 3 & 4 above and perform them with some degree of success, but it will also produce nonsense at times. Given that a Data Model usually lives or dies based on how well its entities have been selected, this is not a place where we would like to see nonsense!
The results have been promising enough, however, that we are continuing experiments with different LLM solutions, different kinds of prompts, and different kinds of feedback loops to improve the outcome. We will be writing more about this use case in the future, so keep watching this space!
While the previous use case focused on producing Data Models with the help of LLMs, interesting prospects are emerging from doing pretty much the opposite - using Data Models to help LLMs perform better.
The LLM has been trained with very messy data. It is a “stochastic parrot”, so it will produce results that kind of work on average. What if we were able to “tell” the LLM what our reality looks like, and guide it to give results that are more relevant to us?
A Data Model is a structure that defines our slice of reality. Giving an LLM such a structure as additional input might help it to formulate something that more accurately matches our particular needs in our particular slice of reality.
In fact, such structures are already being utilized to help guide LLMs to better results in a given domain. For example, here is a scholarly article reviewing the uses of Knowledge Graphs in combination with LLMs. A Knowledge Graph can of course be far larger and deeper than your usual Data Model, but the idea is practically the same (and in fact, a Conceptual Data Model is a kind of a Knowledge Graph, but let’s not go down that rabbit hole here).
Without going too deep into the technicalities and science-y parts of Knowledge Graphs, Data Models, and LLMs, it is fairly straightforward to imagine how this use case might work in simple practical terms. If we were simply to append our prompts with a description of what our Data Model looks like - what are our entities and relationships - we might help the LLM to not only use the right terminology, but also to produce results that make logically more sense. The relationships between the entities in the data model are like rules; they define how things should interact.
There are, of course, very sophisticated technical methods of adding such “domain knowledge” into an LLM, and people are talking about “Small Language Models” that would have carefully curated, domain-specific input data in order to produce more results. But the basic idea of helping an LLM produce better output by giving it some structure is straightforward, and something we see could be quite easily done at its simplest level even when interacting with e.g. ChatGPT.
It remains to be seen how much the Generative AI development will turn towards domain-specific solutions. One could say it’s more or less inevitable, given how the generic outputs of a “stochastic parrot” will inevitably lose value in the flood of AI-generated content after the novelty wears off.
In short, no.
There will always be a need to extract structure from messy descriptions of business reality in order to design data products and other systems. This activity is not going to go away.
It is also highly unlikely that such design activity would be completely automated and assigned to the AI as a whole. In the bigger picture, Data Modeling is about understanding, scoping, making tradeoffs, taking into account different stakeholders and technical needs… It’s not a simple input-output task and requires (more than anything!) human interaction.
The ways how we do Data Modeling might evolve, however. LLMs are tools like any others. If there are benefits to be gained by utilizing an LLM at some point in the process, then let’s do that!
We will continue looking into LLM capabilities here at Ellie. Right now, we are at a highly experimental stage, so we can’t say much - but rest assured that eventually, you’ll hear more!