Video: Office Hours: Building AI‑Ready Data Products | Duration: 1856s | Summary: Office Hours: Building AI‑Ready Data Products | Chapters: Welcome and Introduction (14.415s), Data Products Overview (163.385s), Catalog Documentation Demo (331.165s), Data Product Promotion (595.855s), AI Agent Demo (818.585s), AI Query Results (1052.38s), Q&A Session (1372.17s), Closing Remarks (1700.455s)
Transcript for "Office Hours: Building AI‑Ready Data Products": Hey, everybody. It's, it looks like the top of the hour, as we say, here in The US. I know there's a few places around the globe, but the top of the hour is actually the bottom of the hour, but, hopefully, you know what I mean. Alright. So it's time to get started. And what we're gonna do today is another, instance of another, addition of a little webinar series we've been doing called office hours. As it says there. Starburst office hours. And the format of office hours is all about, giving ourselves, you know, thirty minutes tops, but let's say ten to twenty or fifteen to twenty maybe to present, demonstrate, something, and then turn the tables and be a real office hours where I'm here to answer questions. So we can absolutely ask questions about we're about to see here, absolutely 100%. But in all fairness, think of this as an ask me anything within reason, within the problem scope of data, the data engineering, data analytics, data science, AI, those kinds of things. And I'll do my darnedest to, give you the the correct answer. If I don't know the answer perfectly, I'll give my do my darnedest to, give you a good answer but qualify it. And then if I just need to take some time to figure out how we're gonna solve that, I will absolutely tackle that, as well. So Quincy is backing me up behind stage. She probably put a message out there. If you'd feel free to chat, maybe shout out where you're from. Yeah. I'm from, Atlanta, Georgia right here in, good old US Of A. So the South Hey, Spain. You're Southeast Of The US. Absolutely. But on New York time zone. And all that time, I left on the screen a little connection before content in case anyone does wanna reach out, and connect with me. But I would say the most important thing on here is at the bottom right where it says devro@starburst.io. Think of that as your conduit to people like myself here at Starburst to help you out with Starburst, Trino, all kinds of interesting cool stuff, that we can. So I went back, and then the topic for today is AI ready data products. AI ready data products. So a little bunch of words in there. Started with AI, so you you showed up. Whoo. Thank you. But, really, today is much is as much about data products as as it is about AI. These data products are something that we've been building for a number, a number of years, longer than I've been here, which is over four years. And we did it back in the days of things like data mesh and whatnot. But, really, what data products do is they give you a place wait wait for it. They give you a place to wrap your descriptions, your glossaries, your your rules, your, you know, quirks, you know, when you have a weird column that has weird values and that kind of stuff. It gives you a place to document that. Now historically before, Gen AI, this is our tool, this is our approach to really say, hey. Let's take your curated datasets, your ones that you really want wide use of them, and you don't wanna be called every day or tracked all day long. They say, what's this mean? What does that mean? How do I query that? So I'm gonna show you some examples of how that data product information works, but I'm going to tell you a secret right now. The data product is a wrapper on top of a schema full of, guess what, datasets, data objects, so tables, views, materialize views, all that kind of good stuff. And, how we really interact with the meta system level as a human, as a computer is like everything else. Starburst, Trino, but it's always has been. So context is a term you probably see a lot of on LinkedIn and hear a lot about the halls. Maybe concepts like semantic layers is another phrase out there. I'm putting some other quick videos together trying to explain some of those terms. But I'm gonna say data products are something we have had for, like I said, for a long time, over five years, and they're a great place, the best place in our space, to put that business context. So let's go look at that and see that. Before I move on, I'm gonna get into this stuff right now because as the promise said, we wanna do some, live demonstrations and things. So I just wanna say, yep. We'll go see this. But what's kinda cool about this, not only is there, you know, a UI that I'm gonna show you, what's really new to us, asked for for a long time, internally and externally. But what's available now, I believe it's still in public preview, but it means it's out there. You can use it, is this data products as code. So building a big YAML file describing your data product, and then just coming to us with that and say, hey. Load this up. And that's great for your CICD environment. It's great for reproducing somewhere else. It might be great because you have a third party, another system that does all your data catalog experience and glossary, and you just wanna make sure we can take advantage of that. Maybe that's all you want us to be, a place to replicate that information. And this whole domain is some is a space, not just us, but the whole world that's kinda focused on helping you with your semantic layers. So let's get on into it. Now it's fair to say a couple things. One, Quincy has done a great job with this tool we have here called Goldcast, and she is set up where I believe you can raise your hand and you can come on screen and ask away and that kind of stuff, and I encourage you to do that at any time. But I do know that often it feels a little awkward. You don't know who's out there. You're afraid maybe to do that, don't hesitate to put it in chat, and I will, be peeking over there. When I look over there, it's chat. But at the end of my demo, I'll definitely stop and, wait around a little bit and see if we can get some questions going. Remember, questions are okay about this and about anything. Go crazy. Have fun. Stump me if that's your thing, and all that kinda good stuff. Alright. So my notes say, let's talk about the slides. We've done that. Let's talk about, maybe, what it takes to build a data product in our space. So to build a data product, it doesn't hurt. And I think the font is at might yeah. I'm at a 110. The bigger I get, the easier it is to see. But to be honest, the more likely it's gonna look a little, messed up in a minute here when, I start using the UI because it's gonna be responsive and do its best and all that kind of good stuff. So that said, I got a cluster in this Starburst Galaxy environment called free cluster. I got a catalog called My Cloud. The indication suggests it's an Amazon s three bucket. It sure is. Got a whole bunch of schemas. I'm looking at this scheme right here called snow backup. Snow backup. What is it? Well, it's a backup with some, snowflake tables that I use in some other demos. I just had it represented them, and two tables are the zone lookup and this Pokedex. Two completely different things. One's about taxi rides down here, and another one's about, I don't know, Pokemon and the, definition of all the Pokemon types and all that kinda good stuff. So I wanna make the point of what if that what if the snow park, snowflake backup was a schema that we have, the world can access, and it has probably more than two objects in there. We put a lot of curation around it. We were like, this is something I wanna share. Obviously, this is not that. It's just a simple example to show you the tech. But what I could do is I can go into our, catalog view of that particular activity. So I'm gonna find it. My cloud is that, and then I think I called it snow. Yep. There is no backup. So this is just a, you know, a visual tool, and I can go in here and update some details about, this. And this would be great to do for MCP servers and all that kind of stuff. Just information. You know? So I can say description is, you know, schema of, tables that, are from Snowflake. You know? Again, it's not Snowflake. Just, you know, special then. It's just the technology I'm doing. And, you know, I could add some high level tags like this is, snow it's got we know it's anime because there's some, you know, what you call it fun stuff. I don't have any other really cool tags that represent, the I wanna say city because it's, what you call it, based on, you know, based on, New York City and that kind of stuff. And I can click here to add links and all that kinda good stuff. I'll add one of a bad one. Pokemon and they should be asked, and I'm gonna get it wrong, pokemon.com. Sure. Let's assume that was it. Alright. So I'm gonna save that. So all I did is just, you know, went into my normal catalog and start enhancing it. I can go into each one of these, tables. As you see, they're already inheriting those tags that I just saw. You come in here and, talk about their catch rates or heights. I could just do silly stuff. I'll add a description. This one's height of how how tall this Pokemon is. Right? Nothing too fancy. And and have a little fun with this and fill lots and lots of good data. Again, it fits better with a bigger screen, but I wanna kinda get it get it going here. At some point, again, I'm the data owner. I'm the data producer. I'm the person that knows this, the business person. I could document this as far as I can. As far as I can go is what you saw there. A little detail about the tay schema, detail about the table. We talked about columns, tags, that kinda good stuff. And there's lots of other tools we have here. I won't go into today. But one way several ways to do this, but one way to promote this to a make this schema dataset a data product is just to go here. Look at the schema and promote it. We'll say bring forward all that good stuff, descriptions and links and context. My what is this? Webinar data product. That's the name of our data product. It brought forward, you know, what it saw there. We can put a lot more. We can even put a bunch of markdown here to describe this stuff. Code code, goes here, whatever, you know, just to verify that my markup doesn't look long. Yep. Code goes here. And this could be a little or a lot. You know, you see there's a pretty good lim jet 15 k, that kind of stuff. Also, you know, anytime someone's looking at a data product, we wanna let them see what's going on. So we say, hey, Kenna, where were they where were they peek into and do some previews, add or change those links, change with the contacts, that kind of stuff, and we'll promote that thing up. Now that showed up on our right here under data data products, it showed up as one of our new data products right there, webinar data product. And we can go in here and edit this thing, manipulate. We'll get a lot of statistics about who, when, where, why use this stuff. We'll see those various datasets. We can use our enriched with AI. Purposely, not doing that just for timing, but it would actually go and try to fill in good examples, and we can edit them or approve them and that kinda good stuff. And if we had things like use as examples, we could do that. Now it would take me a little time to build that out. Now if I did that as a service, of course, I can just import it, but I wanna kinda show you what, it looks like from the UI's perspective. So I might just go into a previously created one of these. Hey. There we go. I see some questions already. Thanks. Man, I'm gonna take a look at that one in a minute here. Here's one I I already set up. I called it Air and Space. And as it suggests, it has NASA data space that that in the US Federal Aviation. So there's details there. You know, there's still not a lot of text here, but links out, all that kind of good stuff. In those additional links I said, like I said, if you have your own internal doc system you wanna point to, absolutely, you wanna point to known issues. There's those datasets that we talked about. And the datasets here, good news, they already have quite a bit of information filled in about them, describing them because this is a weird table, and it needs some help, including, like, what the whole thing is. It's called an astronaut's table, but it's so it sounds like John Glenn would be in there. No. John Glenn's in there two times because John Glenn went to space twice. So it's more of a fact table about events, missions, individual persons going to space. That's what this table's a lot about. And that's great. That additional context is gonna help an AI agent do some useful stuff. Just for fun, there's another one, Mecca, if anyone knows a little bit about Robotech. I love to play, with Robotech. So this is a mix match hodgepodge, but I wanna show what what it might look like with some more details. And then my favorite screen is this beautiful page here called uses examples, where someone can come in here and say things like, hey. I'm gonna define something called the mission time percentage across whatever, and this is, you know, astronauts, how much time they spend in space. Again, highly documented explaining what's happened. And this is for humans, but you're gonna see well, you probably won't see my demo today, but all this information that we package up in our, data products is information our AI tool and ultimately via MCP server as well, your own AI tool if you wanna use our MCP engine, then leverage this stuff as well, not just table column names, column types, schema name, table name, that kind of stuff, which is kinda where the kind of generation, is today. So let's do, let's show an example of one that's done well, done bad. Let me go back to data products. I look at the time. Quite about fifteen after, something to be alright. I went ahead and, and I've got my query engine slowed down. Let me see if it's awake here. Let me run that again, see if it I think I talked too long. Yep. While that's spinning up, I'll go back to data products and say data products. I created one called a without context data product. Simple schema. I have no details in there. I just imported it like we just did a minute ago. It only has one table. It's called customer. Okay. You can look at this and go, yeah. Comment, address, count balance. You know, an AI agent could probably deal decently well with this. No uses, examples, or anything. So I'm gonna say launch our AI agent. So we have a thing called ADA, AI agent, data AI data assistant, ADA. And, you know, you can build your own, of course. Ours is targeted to a data product, and then by, you know, what kind of role you wanna be in, of course, under the covers, you can have it, run different places and that kind of stuff. Now I'm gonna go ahead and say let me just pull up. I ran these just a bit ago. Rick, you can track lineage, and I'll show you that if I got time as well. I just said, hey. Tell me so an AI agent, typical, ours as well, but it's using that enhanced metadata. In this case, this one has no enhanced metadata. Give me a high level summary. What I wanted to show you real quick is let me go over there and show you that before I do anything else, is that underlying table customer has a field over here called priority rating, p a p x y z. In fact, if it did equip group by, you're gonna see, okay. Great. There's a handful of these. It's like seven of these different weird values with, you know, different column counts. So if I went back to my data product, and let's assume I'm just asking it, you know, questions like I just said. Data product without yep. There's a 100 ways to get to everything. There's a high level thing that goes straight to this ADA. I was just drilling it down there. You know how software works. So, I said, give me a high level summary of this whole data product, and it kinda said details like where is it and, you know, all we found one table customer and it had this customer priority rating. But it pretty much told me, hey. Look. I don't know what this stuff is. You got some limited metadata. You know, knew a lot of other stuff. Like I said, it can interpret a lot of details about finance and stuff. The segmentations exist, but, again, doesn't know what priority rating is. And, ultimately, it kinda said, why don't you do some more work here? Why don't you do some profiling? When I say you, me as an analyst, but in all fairness, I'm being also the data producer. I'm trying to make it better for my analyst. So I went ahead just to make the point. I said, hey. Based on that priority rating alone, how how might we be viewing customers about favorable, neutral, you know, concerning, that kind of stuff. And it did its best. It based it. It realized I'm asking about those seven priority ratings, but it doesn't know much. A new account, but, really, it just used the information that it had, like balances and that kind of stuff. And that's not wrong, not a wrong way to look at it, but it isn't how this particular weird system was built. And I don't know about you. I've been doing this for three decades, and, inevitably, there are these columns that end up having these smart values and stuff, and these are great example of a data product. We all explain that. So, ultimately, it said, look. You only got about 5% are favorable. Most of your customers are normal. Very few concerning, and it really based it on those averages. That's all it really could do. I played with a little more asset. I think I stopped there. I said, hey. Look. You gotta figure this out. And it pretty much said, hey. You know, it's my recommendation. Go go explain what things like PP mean. Because is it is it past priority? Is it payment problem? All that good stuff. So it didn't know. So what did I do? As a data owner in my data product, I'm gonna just dupe the second one just so I don't have to clean it up in on the fly here. But what I end up doing is give it some more detail. Here's that same kind of overview, and I really called out that table, priority rating, and I said, look. Here's what those columns mean. AG, assume good. AP, an angry person. F r, fiscally responsible, gave a description, and then I gave this high level classification, you know, the plus or minus or kind of in the middle. And most most of these are kind of neutral ish in nature. I guess, maybe not most. Actually, only three of the seven. Yeah. But, nonetheless, gave it some details. I also made sure the datasets there's only one dataset. You know, at least it has some details about all the fields and stuff. Most of these are things that AI could probably interpret, but I really called out our guy down there. Hey. It's a multiuse column storing predefined one to two character codes, which each could be considered positive, negative, or or or I should've said positive, negative. Oh, there positive, neutral, and negative. Yes. I could've put all that other information that I put at the highest level here, but in a fairness, I have a lot more prettiness there to do that, and so I went ahead and packaged it in that space. Again, we can use the, the enrichment AI if you really wanted to. But I wanted to show with just that, I ran those same general questions, a few few minutes ago, and the answers I got were a little better. Give me some summaries of this, this data product. Same kind of stuff, blah blah blah. And then it kinda said, wow. There's a key feature out here. It's called this priority rating system. And it described what I told him. Great, good, hard stuff. Actually, you know, rename things. Like, hey. Hard to please customers. I don't think that was the way I described it, but okay. Picky person. Maybe it was exactly. I'm not I can't even remember now. And then went there and looked at lots and lots of good stuff and everything. You know, gave me the mention of priority rate rating system a few times. And, I was like, good. And then he mentioned some other things, like overlaps with other classifications like market segment. That's fair. So I asked the exact same question. Tell me look at the customers. Look at these priority ratings, and tell me what's going on. Give me some feedback. And his feedback actually was, a little different than I expected, like, we're all gonna get from AI. Right? Not all you asked. So you think it's gonna give me so simple of an answer. Sometimes it gives you something to think about. So we went back and really just started talking about those high level classifications. I thought I wanted to know, tell me about those as concepts. So I built some rationale based on the docs I provided. So that says made up, you know, interpreted, generated information, and that was really cool stuff. We broke it down about how many codes and everything. But, ultimately, I kinda said I kinda wanted this. You know, would you like me to query the actual actual distribution of customers and the buckets? And, yeah, that's what I wanted them to do. And I went ahead and I could've just said yes to that, or I just wrote the question more succinctly. Tell me how many customers are good, bad, or problematic, you know, favorable, normal, concerning. Again, using different words than my classification did just to make the point that, obviously, you can kinda figure out words that are similar to understand what's going on. Started out with the same high level stuff, but it gave me what I wanted. You know, where are my breakdowns? Most of my customers are in that 70. If you remember that earlier one gave me, like, oh, there's probably, like, 4% favorable at about 1% concerning. Again, I looked at the finances. Didn't look at anything else, and it was a pretty good off. If it was a percent off or something, that'd be cool, but it was pretty pretty far off. And then just rattled through and gave me more details and tried to give me some suggestions and like you always want. You know, give me some next steps or suggestions about what I might do next from query or enhancing. In fact, I love the last one. Deep dive. My suggestion was to deep dive into those, normal that big, big bucket of normal customers and kinda figure out, is there some additional kinda little bit more favorable, a little bit more negative kinda things? And it made me start to think about what's next. So, really, what I wanted to show, and it's 21 after in that fifteen to twenty minutes is a demo that said, yeah. Bringing business context, putting in some kind of semantic layer, packaging and pointing an AI agent against it is gonna make those answers better, especially when it's not things that are so simple. Names, phone numbers, sure. AI is gonna do a great job that it really can. But when you get those interesting datasets, those interesting values that, you know, have institutional knowledge and, you know, unless you find the person that really has that institutional knowledge or getting baked in a report or whatever, you know, an AIG is not gonna figure that out. So we have to do this. I got a whole bunch of thoughts on this whole thing about semantic layers and how it's been around for several decades now. This may be the time because people seem to want it. We've been building semantic layers or having the opportunity for decades, and, I'm getting kinda opinionated. So I'm gonna stop and let you watch some other stuff about this. But this may be the time if we continue to lead on these AI agents that we actually do build out and maintain that business context. Alright. I'm gonna go look at some questions. Okay. Couple questions. Ben said, no. Maybe there's some other questions before that. Make sure. Nope. Nope. Nope. Alright. Ben said, how does star Starburst catalog compared to the Databricks Unity catalog? I'm using Seth with Unity catalog because we have the same functionality. So the word catalog means a thousand it's one of those overloaded terms many, many different ways. When we talk about Unity catalog, generally, we're talking about something like, Iceberg Rest catalog, a Hive Metastore. To me, Indie catalog is still in the classical, kind of, abstraction kind of catalog. Hey. I got a table, and then help me find where things live and where the data lives on the data lake and that kinda good stuff. And the Starburst catalogs, this is a data product or data product catalog that's above that. It's even a little bit higher up in that semantic big, big bucket of semantic layers. But, absolutely, we have our own catalog. We we can use Unity as you're doing, use administrator. Sounds like you know these options. I would say that Starburst catalog, that term means, hey. What's our, what's our recommended or soon to be strongly recommended? You know, if you have nothing else, you wanna start from scratch, where would you go? It's our own environment. And I would say very similar to what you see in Unity. But Unity has a lot of movement already, and Unity does give us a quick way for multi product customers like everyone really is, because I have a feeling you have some snowflake too, not just Databricks and not just, even servers. Absolutely. You know? Think of them as synonymous. Again, that's a little different than what I'm showing here, but I think that's your question. Rick, can you track some lineage? Rick, you sure can. I'll just give you a quick glance at that. We'll move on. Let's see if I can kinda, I don't have anything super interesting to show you off the top of my head. Let me see here. So I'm a go back under catalogs. Do I have anything cool to show that's already set up? Probably not. So I'll just show you something boring, and then we'll go, we'll go from there. Let me just glance at my, testing aviation. Yeah. Most of these are pretty boring. I was trying to find, one of my, previously built kinda, pipeline set setup. So it looks like I got rid of a lot of them. So we are just gonna look in something. Generally speaking, the lineage is gonna be under what we would call that. Call that oh, let me drill into a table. I'm in a table. That one has no tables, so that was a great one to look at last year. Alright. I'm just gonna grab the one we're looking. Aviation. Look at astronauts. You're not gonna see a lot of interesting lineage because guess what? This thing had what it is. So I can if you'd like to see that, Rick, hang on, I'll try to give you some good, pointers and that kind of stuff. But this is our our same kind of open lineage style looking where does it go to, where does it come from, double click on each line. It'll show, you know, what created it, what SQL state, or whatever. But our lineage by itself is what we see. You know how lineages. We can only capture we can only see what you can see. So if you need end to end from a lot of different other products, maybe either exporting this out or put an agent here. So So you got several other options if you need to get, you know, full on Apache NiFi to Spark to Starburst Enterprise to Snowflake. We need to see that comprehensive thing. We can't buy ourselves, but we can we are play our part and ship the same information out or some agents, you know, dig in deeper and pull it themselves. So the answer is yes. I didn't give you a good view. Man, the data collected at the table and the column scripts from DBT, Yamaphos. I don't think that is true. I don't know that answer, Minh. I would say the we don't have an import of a base, DBT YAML files, but if, DBT's pipeline is building things, and putting those those descriptions and all that good stuff when it creates a table or alters and that kind of stuff, then we would get it that way. But we don't have, like, a direct connect per se that says go dig in there and all that rich metadata that you might have stored there, bring it here. But as long as it's part of the data definition language, then I'd say, yeah, we'll be able to leverage it. Just not straight from the raw YAML files for sure. Nicholas, do you see any integration between this type of semantic layer? Hey. Remember, you can raise your hand and come on and join the video with us if you're interested. We'd love you to do that. Do I see any integration between this type of semantic layer and with open data contract standard? So I don't know about open data contract standard, but I do know there's, I hate it. Everything, three letter acronyms are always get reused. There's a thing called OSI, I think, open semantic something interface or something, energy you know? And that's something Snowflake we're a part of, others are a part of that's trying to really get our handles around a standard for semantic layer so that we can exchange all good stuff. The argument there was, do I see an integration between this and that? Definitely, that standard, I see us trying our darnedest to be a tier one player in that space as that standard evolves. And then absence of standards, because that is kinda how the world works, sadly, at times, is integrations. So our data catalog, a big point of our system as a whole, is all about optionality, about where the data lives, about where the catalogs are, about where the permissions are. You know, we we have our own everything, but we can use everyone else's everything. So you have lots of optionality, and I can only just tell you the road map of our data catalog is kinda be a catalog of catalogs too. Kinda what we already do anyway, but really stepping up in that higher grained, classical data catalog that people think about, which is the the real rich glossaries and all that kind of stuff. A lot of what you see right here. So yes. Hey, Aaron. Take your time. Thank you for time. Sure. Do a couple of us here in SF. I love San Francisco. I know a lot of people have troubles with SF lately, but I love me since San Francisco. I think we're hitting a time, which mean means the event may hang up on us. And if it does, so be it. If you're here, though, I'm gonna hang around and watch the chats. So the like I said, if you're watching this on demand, it might cut off right about now, but it may, also keep any additional information. I think I hit all the questions that I saw rolling in in the, in the, chat there. But if, yep. Yep. Yep. If I've missed you, type it again or shout out really quick. And while I'm looking for those shout outs, last minute questions, I will just say, again, thank you one gazillion times, and know that, you can reach us from a whole bunch of ways. We have a ton of content and a ton of topics. This is my job, is trying to share this information, trying to help you out. A whole bunch of ways to get a hold of me, but last mile, if you have nothing else, you have devrel@starburst.io, and you'll find me well, you'll find me in lots of different ways. And, hopefully, you'll find me being responsive to you. So I thank everyone. Quincy, I think that is it that I see in the chat. So we will, sign off on this one. And, if you're watching this on on demand, thank you for doing that. And if you're here live with you, double double, thank you for doing that. Y'all take care, and have a incredible, beautiful day. Bye now.