Video: Hands-On Workshop: Building AI Functions with Starburst Galaxy | Duration: 3440s | Summary: Hands-On Workshop: Building AI Functions with Starburst Galaxy | Chapters: Introduction and Agenda (7.28s), Starburst Overview (221.605s), AI Architecture Overview (524.575s), Exploring Starburst Galaxy (755.59s), AI Model Configuration (849.955s), AI SQL Functions (1223.94s), RAG and Embeddings (1800.425s), Storing Vector Embeddings (2681.595s), Real-World AI Application (2830.72s), Concluding Thoughts (3126s), Conclusion and Feedback (3182.625s)
Transcript for "Hands-On Workshop: Building AI Functions with Starburst Galaxy": Hey, everyone. My name is Lester Martin. I am coming to you live here from, Atlanta, Georgia. I am a developer advocate here at Starburst. And, you know, there's a title of our webinar today. It's a workshop. In fact, I'm gonna show you where you can, do these instructions yourself. Anything that I do today, document it, put it out there, on the web for you guys to check out and do those kinds of things. As again, I said I'm Lester. Just a quick connection before content in case, you like what we talk about today and you might wanna reach out to me again in the future. You can find me here. I'll put some, stuff there. But, ultimately, if nothing else, devrel@starburst.io will find me personally. There's only a handful of Lester Martins out there on, on LinkedIn, so you won't have too much trouble finding me. Yeah. I went back to the top just to see it. Using AI functions with Starburst Galaxy. So I've got an agenda slide. And, like most of the times, I will do a few minutes of a Starburst introduction. So if you're already a pretty you know, if you're an established customer of ours or you know Trina really, really well, that's awesome. Great. Bear with me. Maybe you'll hear a few things you haven't heard. If you're kinda new to us, again, quick introduction. I won't give you anything everything you need about us, but I'll give you an awareness. And then we'll jump on into it. I'll show you where to find the hands on tutorial. I'll make sure you have that link. In fact, it's in our docs tab up there by the chat. There's a link there that says, instructions or something like that. Welcome to check it out. And then I'll jump into the main three sections that I'll run through. Show you how we can, configure and administer or secure models for folks to use, so embedding models and large language models. Then I'm going to roll into that and say, Hey, Greg, now that we have those installed or configured, what can we do with them? What can we do with them from SQL? So functions that are available from SQL. So we'll try a variety of things there. And then I'll kinda finish it up with a bit of, rag or tag and augmented generation kind of workflow, kinda classical, goal, go get some additional information. When you go to the prompt and again, the whole goal is to do all this, within SQL. Do it inside your normal SQL editor. As it says on the bottom left, for those outside of The US, the joke may not be as funny, but here in The US, see, sometimes we say, I'm from Missouri. So we'll do a lot of show me. Missouri is the show me state. So we'll try to do a lot of demos. And, again, the demos we're referencing that hands on tutorial that in just a moment, I'll I'll pull up on screen. But you're welcome to kind of pre look by looking in the docs tab and hit on, whatever's listed there. Let me see what it's called. Well, I think there's one item there. Can't miss it. Let's take a look. Yeah. It says AI functions tutorial. Can't miss it. Hey. While I was there, I looked up at the chat and, I don't see any messages. Maybe a shout out just to know everyone can hear me fine. Maybe just say, hey. It's it's Freddie from, you know, Calumseo or something like that. So if anyone wants to shout out that you can hear me, and if you wanna give me a quick intro of where you're at and what you're, you know, physically hey. Thanks, Sarah. Hey. If you anyone wants to let us know where you're at, what part of the globe you are today, it's awesome. Again, I'm Atlanta, Georgia. It's just North Of Florida. I think everyone in the world kinda knows where Florida is, at least figuratively. Alright. Starburst introduction. Let's get back into it. So I do a lot of Starburst introduction, and as a dev dev advocate, as a techno knowledges, I often kinda go nuts and bolts. I'm gonna kinda do a high level picture and just say a couple things. So in the very, very center, that's a Starburst logo, but we have another logo there. It's this old bunny named Commander Bun Bun. And what that's trying to tell you is that Starburst itself is, made up of the folks who created this open source project called Trino. Originally, it was Presto, the forked. The same people that created are on the side of the fork, if that makes any sense. And it's CoreDry Engine. We're an open source plus environment. So we're gonna do some enhancements to Trino, and then we're gonna do some additional things. In fact, if you think about, Trino as a data, analytics kind of framework, and just kinda push out left and right. AI maybe on the right and then maybe ingestion on the left. So we're just trying to encompass a lot of different activities. And what's really unique about maybe not perfectly unique, but really a selling point of, Starburst and and Treos fact that we can interoperate with a ton of different data sources out there. So we have a connector plug in connector architecture that allows folks to bring any kind of data system. Now not everyone's out there, but lots and lots of NoSQL's messaging platforms, relational systems, of course, or a lab, data warehouses, but other weird stuff too like search engines. And then absolutely, data lakes, object stores, HDFS, things like that. So we wanna be able to get access to all that good stuff. We wanna let folks then collaborate on it here via this single point of access. It gives you a single point of collaboration. So it's it's not just you know, I can go look at system one and system two and system three. I can actually merge those together with a join and do a federated query across that. We also bring forward a couple other things. We bring forward our own Starburst AI agent. We're We're not going to discuss the AI agent today, but it's in the last slide. I'll point you to an on demand webinar where we went through that in great detail. We have an MCP server. I'm not going to discuss that today either. We're going to focus on these functions. But there's a lot of activities there that we can do. And, of course, we wanna be able to if you bring it all to one place, you wanna own it one place, not just you know, governance often means security or it often means lineage, and I'm talking about both. We're gonna be able to offer lineage and security, even monitoring all in one place in one tool set. And then, you know, if you if that was the kinda where we can walk across the door, I'll just fill this out and say, yeah. If you think about it, we're finding datasets all place, and in fact, that discovers one of the elements that we did a lot more work outside of the core open source. TriNet itself, we can bring additional thing details. We use the phrase data products. You don't have to use a data product. A data product is really, marketing, messaging, curation on top of existing datasets and schemas and that kind of good stuff. Again, another longer deeper conversation, then I'll move on. And then we wanna let folks use it. These middle two icons are what we've probably seen people use historically. Analytics, absolutely. We've dealt we we many different companies are embedding our framework in their offering itself, especially on the things that especially they do with a lot of reporting and those kinds of activities. We wanna do some of the AI, ML stuff we're gonna talk about. And, of course, bring your own AI agent. We believe that, good data access, good data quality, good data governance are fundamental to good data AI use. I mean, this is the data you want your models to leverage. And if you look at it and maybe take a out of the AI out of the picture and even take an ingestion out of the picture, just say, hey. What are we really, really talking about? We're talking about what I've been saying. You know, use object stores or HDFS, things like that. Often, we would like to separate, storage and compute, and our engine itself isn't is a compute engine. It's not a storage engine. So we're gonna use any storage device you want. We work really well with data lakes. Open file formats will talk, you know, parquets and orgs. You maybe maybe you have heard those terms. If you haven't, you might have heard of things like iceberg versus delta lake. That's the table format. That's the space that Apache Hive created, in all fairness, the original, table format. And then sitting on top of all that is Starburst. And that watery line kinda suggests outside of the data lake, there is data in a lot of other places. And my favorite things about working at Starburst, came here about four years ago, is it's the first company I ever worked for that I've ever seen, in fairness, in my thirty some odd year career that really says, hey, customer. Guess what? You didn't do anything wrong. We're here to help. Usually, a custom vendor is gonna tell you, hey. Cool. I like what you did, but here's this will be better. This will be better. So we're gonna definitely give you opportunities to be better, but we're absolutely gonna work with what you got to start with. And then mask when you wanna make changes and move things around. So I think it's pretty powerful stuff. And to wrap up my, got two slides here. We'll get on to the AI functions or one, get on to that AI stuff. We run this thing just about anywhere. So, Starburst Enterprises are classic software that you will deploy, run wherever you want, on prem, in the cloud, bare metal, Kubernetes, whatever whatever. Starburst Galaxy is our software as a service model. We'll be doing my demos today out of Starburst Galaxy just because it's easy and because my instructions I'm about to show you make sure it works perfectly in that space. And then lots of other ways, marketplaces. In fact, we're in a Dell appliance, and we actually have a few other hardware vendors that, we recently announced some partnerships with. So I would say in the future, look for a similar concept, that you see with Dell. Look for us to do maybe some additional, hardware vendor kinda integrations already, which is kinda nice. Alright. What did we come to here today? I'm gonna pause before I explain anything about this picture because I'm gonna say I'm gonna explain it later. But I'm gonna glance at the chat, and, I think of the folks that said hi and let me know where they live and where they're coming from even including the South Korea. Hey. Hey. And my big statement about the chat is I'm trying not to stare at it. And when my head rolls like this, that you know and I'm glancing at the chat. And right now, all I see is a bunch of hellos. That's great. I encourage you to put anything you want, in there. You can use the q and a tab too, but, preferred, just drop it in the chat. You know, everything's kind of a question probably, unless you're just saying, yay, Lester. Awesome. And I will be glancing. I'm definitely gonna stop at the end and have time for questions. But if you have a question, jot it down. I'm gonna be trying to be peeking at them, and it because sometimes it's better just to stop, you know, and focus on that question before we get further and people, you know, lose sight or maybe they just can't move forward with what they have. Alright. So what are we gonna talk about today? We're gonna talk about kind of this AI architecture. So all those things I kind of been talking about a minute ago, you can kind of think of them as that purple band that you might see right about here. Starburst data platform. So it connects to all these kind of data sources on the left, store. And when we say connected to, then we're talking about ultimately storing here. I I show some stuff across the bottom. We we really stored here on the left, but wanted to call out the fact that we're gonna leverage, you know, that as well very specifically with some of the, AI activities that we have there. And definitely, what you're gonna see on the far left and the far right is the same general bit that says, we aren't building models. Absolutely not. We're using the well known models that we all have access to, and we're gonna configure those. And you could run them wherever you need to be. And that's really gonna be a benefit of using Starburst because, in the cloud, no big deal. You know, there's lots of partners and providers, and you may get a great contract where they don't use your data for retraining and selling and all that stuff. But there are plenty of companies that wanna run Starburst Enterprise in their own, realm, and they might wanna have some kinda air gap ish, you know, or tightly controlled, model that they're not worried about. All that stuff would be is perfect for us. We'll we'll we'll live well with all that. So we're gonna see this picture a little bit more, as we continue on. Alright. Went the wrong direction. Alright. Show me. Show me. What do I wanna show you? Oh, I'm gonna show you, this. Very exciting. I would say, exit. Full scroll. Hey, Robert. Let's do it that way. Yeah. That's what I'm gonna show you. So go to starburst.io. Whenever you have some time, play around. But, ultimately, what I'm gonna say just for a second, the first show me is, hey. Click on something like Starburst Galaxy and get going here. And realize this phrase start free doesn't mean it's a free solution for a short period of time. You can create your own Starburst Galaxy environment. I have mine right here, lester dot galaxy. And lester dot galaxy gives me, the opportunity, to keep this thing alive. What the what the what the two what's the start of freeze gonna give you, I think, is about $500 in usage credits. Website describes what a credit is, but, you know, if you have a 10 node cluster, it's 10 times as many credits as a single node cluster. You get it. And what's really cool, and maybe this name might tell you here, is you can use a free cluster. And I'm gonna do all my demos today, including the instructions you have using a free cluster, so we won't even burn you those credits. The only real caveat is size and scale. It's one node, and it likes to shut down after about five minutes. So it does notice you're not playing with it and we save money. We used to let it go out to about an hour, but someone decided maybe that was costing us too money too much money, and that's fair. I get it. Alright. Next section, hands on tutorial. What what instructions are we gonna use throughout this? So there's a QR code here, but this is the bit that's in that docs tab. It should show you something that looks like what you see there. In fact, Show Me would say, see if I can show you. I got it pulled up right here. Here it is. Utilizing SQL based AI functions with Starburst. So I got about a one handful of steps here. This is, every single thing you need to do, every single thing you need to type, click on, it's gonna baby you through that with the exception of some of the prerequisites. One of the big ones is you need to have access to the right stuff. Well, I'm gonna make you do another tutorial if you don't have your own bucket. If you wanna use one of our s three buckets, that's great. We have tutorial. I'll give you just a few seconds to set that up. I won't I won't click through that today. I'll let you explore that on your own. And then the other part is you need some providers, models up and running. So I suggest they don't have to be OpenAI, but all my screenshots are gonna be around, you going to OpenAI and getting an API key and funding it. And to be honest, I point you to them to kinda get that set up. And I'm more than glad to help. Done that a few times myself. So dev rel at. And all these have a it says a reported mistake, but this is also just another, you know, link at the bottom, that launch an email that'll find me, as well. Alright. So there we go. Let me use this tutorial as we go along. Now I'm gonna keep hitting this button because I don't want my, five minute timer to die on me. So that's all I'm doing there. Let's see what else we can show. Slide show. I showed you. Let's talk about the fact that we don't have models, but what we've done is elevate the concept of a of a model almost like any other thing that's in our environment. What does that mean? It means we have tables, of course, and views, and all that that we can do role based access or attribute based access to. But we also have concepts like clusters and catalog. So we already have this, access model kind of created around things. So we just said, great. We have another noun, models, And you can set those up. So, you know, why not just show me? Why not? So I'm gonna run back over to my starburst environment. And right here on the left hand side, we got a big old button called AI models. And guess what? I've already created some here. These are the ones, just to show you that the instructions will cover this. So we'll do a prepare environment. I've already done this. Got them all set up. And then we're gonna just these instructions, I'll just show you. They're gonna say, hey. You know, hit this button called connect to to a model, and then we're gonna tell you what to plug in. I'll show you what those looks like. I'm gonna configure two of those. I'm gonna show you the ones I went ahead and preconfigured. I think that's a easy way to do it. I created for today I believe I created, this text AI yeah. I set them both with AI functions. So if you look here, this one says embedding, that one says LM. Either way, when you say when you say I wanna connect to an external model, you really have a couple options here. We definitely are highly integrated with Bedrock, and and Amazon Bedrock then lets you do all kinds of models, or you can use OpenAI's approach. So you don't have to use OpenAI, but you have to have an open a a p OpenAI API compatible service. Pretty pretty easy to find out there on the Internet. And then once you do that, you just name it something, you know, in fact, in the description. And, after that, you know, they'll be out when you pop in there I guess we should have tried to do it. Embedding, OpenAI. It's gonna ask you for depending on what you're doing. If it's open API OpenAI, we just really need an endpoint in that magic open API key that I'm saying you have to get and kinda point you to to go find it. And then you absolutely need a model name, and you need to visit your vendor and see what they call it. Ultimately, I'm gonna close this and say, I created one of these right here. So let's edit it and see what it looks like. So I created a embedding model. So if you don't know what embedding or vector embeddings are, two things. One, the lab is gonna give you some overview in here a couple of times. In this doc, you'll see this what is RAG. It's five, ten minutes, maybe probably ten minutes, in trying to talk about and it does introduce vector embeddings and what we can do with those in a vector database as a concept. I'm gonna introduce it again today here. But if it if it's new to you and the bits I give you today aren't enough, go back here and watch that video. If those are not enough, you know, either start looking on the Internet, come to find me, and we'll help you make sure you get things. But, ultimately, I set up one, and I called it you know, this is just the name I give it. This is the little fancy handle that I'll reference this by. And I just said from my lab to say I function, so I function embedding, and that all worked. I also have the other kind of model that we support today, which is an LOM. Same thing, I went to OpenAPI. I use g t p g p t. I always say g t p GPT five nano. Why? Because I'm cheap. And nano is, like, the cheapest, per token kinda thing, and it'll work just fine, for our lab today. We could spend all day talking about models and whether the best models and why they might be the best. And the reality is we might get right back to where we started because there's a lot of opinions and stuff. What it really means is you have to kinda find out what works best, with your dataset and those things. There is a lot of testing that people will go through here. Now I'm not gonna go in there and create a policy on this. I'm gonna do everything. You see my user in the upper left. I'm doing everything as account admin, but you could set policies on who could use these various models. And you haven't seen how to use them yet. That's what we're gonna go to next, basically, via functions. But you could say, oh, I got the expensive model that I don't want anyone to use or three people to use. I got the nano model that's pretty cheap. I'll put it in a group or role called. Most people can use this. And then maybe the average user doesn't get access to any of these things. You can control that activities and not just, access to something, but you can control how much, how you can meter them them or meter them by time or meter them by tokens. Lots of things you can do there. I thought there was a third thing I was going to mention. I guess the functions themselves that you're about to see, you could actually slice and dice those. You could revoke a lot of those and then only specific ones. So that said, let's see some of those functions because that'll make more sense. That's what we really wanna talk about today. The time check. Okay. Looking good. Show me. So let's talk about these AI functions. Alright. So I really didn't explode this diagram too hard yet, and and we won't heart go harsh on too much. I will just say that what we did is offered, as the the lab will point you to the docs page and everything, But we offered some some SQL functions to that. What are some existing SQL functions? You know, like, concat two strings together, date format. You know, these are just the things that SQL already offers us, trends and all that kind of fun stuff, rounding. Those are SQL functions, so we added some more SQL functions. And we're gonna tackle these spaces here, and I got a slide for each one of those. But I'm what I mean really close again is I'm gonna embed them, their functions, and you use them in your SQL statements. So we're gonna see all this in practice in about 10 about thirty two seconds, maybe maybe a minute and a half at most. There's a there's a simple example in screen. Sometimes the stuff on screen doesn't always work. This one would work just fine, I think. Hey. You know, run a function called prompt. You know, what is the capital of The USA and, you know, only provide the name of the of the capital. And then the other argument that we have in all of our AI functions is what's the monitor? Monitor, what's that little personal name that we applied to it? What function should I be using? Because we could have a plethora of those installed. You know, that will say something like, hey, probably Washington DC. Prompting is what you're doing already in your chat tools. If you're so if you're not familiar with the word prompt, but you are using, you know, some kind of AI chat tool, you're prompting all day to Sunday. Now we have a handful of other functions that are gonna do things like classifications. You know, hey, we're gonna offer up some text because these are actually what these handful of things, we call them task functions. What we're really doing is we're behind the scenes doing some prompt engineering for you. So you could just think of, hey. I just want some data classifications. I'm gonna show you an example of each one of these in a minute beyond what you see there. That's that says, hey. Pass me in by now. And these could be things that are coming from your database. They don't have to be hard coded like you see here on your your tables, you know, your columns, that kind of stuff. So this phrase by now, is it spam or not spam? You you pick what you want the classifications to be called, you know, any number of them, that kind of stuff. And he says, yeah. That's probably, you know, probably some spam. We'll look at we'll look at that one. There's, you know, sentiment analysis. I love the starburst. Is that a positive or negative analysis? Sentiment analysis is gonna give you positive, neutral, or negative. I think that that's the three answers we'll give you. Definitely, translations could be useful to you, especially if you have multi client multi country clients and, you know, you're trying to maybe, you know, turn everything to a certain format first and then try things. So convert to a variety of things. Grammar correction may or may not be somewhat useful. These are just, again, falling into that task based kind of things that we kind of helped out a little bit. And then I think a really powerful one is data masking. Now I'll talk a teeny bit more about that because I wanna get into the show me on these functions and show you that they're out there, they're real, and we can use them right now. Alright. So I'm gonna go back over here. I'm gonna go back to my query editor. And looks like my server went down. So I'm gonna hit the run button. We're gonna wait a few seconds here. I did bump the font up to about one twenty five, so it may or may not be the easiest thing to read. But really what it says and I'm using for now what you saw on those screens. I'm just kinda hard coding something. You might really say select, you know, prompt, and then instead of hard coding, you know, what underlying tech the Starburst use, you might actually be taking, you know, some concepts or some data out of a field or merging multiple fields together or a big text string or something. You know, this is just an easy way without data. As we build the rag workflow, of course, we're gonna build a table, and we're gonna show it try to show you something more, a little bit a little bit more real world than just, tapping the prompt. So the prompt's gonna return any moment now. I shouldn't have let it die and shut down on me, but it did. And it's gonna tell me something like, you know what? Looks like Starburst, as Lester told you, at the top of the hour, is, is, built on top of the open source project called Trino. And, said that really long. There we go. So we can finally come back and say that. And there it is. Short answer, it's using TrinoSlashPresto. Here's some details. If you need some more specifics, you know, this is just your normal, what are you doing all day to Sunday? Nowadays, even if you just go to Google and type something, you can take AI review before you get anything else. You know? Alright. But here's here's the more fun ones. You know, there's the exact one you saw. I love Starburst. Give me the sentiment on that. If you wanna try a bunch of other stuff, you can. I want you know, it gave me positive. I have a little bit more a little bit bigger one here. It just simply says, tell me what this guy says. So this one I pulled from I do a lot of subreddit followings, part of my job, and somebody, like this sort of quote. They were really happy with, you know, blah blah blah. Most impressive does this, does that. If I'm gonna start from scratch, I'm gonna make sure this is there. Blah blah blah. And I already knew it was very positive. I just wanna make sure the tooling said, yeah. That's positive. And you know what we use this for. We use this for chat logs. We use this for tweets, whatever we're capturing. And, again, the intention might be that that is a column in your, one of your tables, and you could pull that value out and say, hey. Is that good or bad or right or wrong or somewhere in the middle? Help me, you know, get get a fresh start on something like that. You saw the on the classify, the screen when it said, the buy now, and it said, is it spam or not spam? Well, I use that exact same ramble and said, is that spam or not spam? And it was nice enough and said it's not spam. That's good. I guess it didn't have all the code words. But, if I can take that same query one more time, and my main point about classifications is that it's not a spam classifier. It's a general purpose classifier, so you just tell me some things you're looking there. Talking talking our language of choice. I'm using English. You know, I called it cursory, average, or exhaustive. You know? So, really, this the engine should kinda figure out, you know, is he just saying it looks cool or he writes a novel or he kinda says, yeah. Looks like so that was an average kinda, leap is what I was kind of implying, and it seemed to kinda figure that out for me. Again, that's the power of the language model. It has nothing to do with us. We're just trying to make sure that stuff is available from SQL itself. And then, of course, sticks grammar, use the string right in Reddit. So I'll sub Reddit to slash the data engineering, subreddit. And if you spend any time in, any subreddits, you would know that not always, is everyone out there a ninja grammarian. Half the time, they can't spell, and I'm not being mean or facetious, but there you go. And, of course, it's a it's an AI response. It's got that that that dash that we all love to hate and all that good stuff. But for the most part, it, you know, kept the spirit of the tent, did a little bit of light cleanup. And if I read it again, it might actually look a little different. That's something we need to remember about these models. In fact, I call that out a few times as we work this lab. So in the lab, you kinda made it all the way down here to AI functions. You're going along, and I think I called out yeah. I mean, I'll mention it there, but I guess I'll just say it now. Large language models, in general, are deter probabilistic engines. Right? You you ask it a question, you get a response, you ask the exact same question. One second later, it's not guaranteed that the answer is gonna be exactly the same, especially when we're talking about this kind of free form world. I would it is fair to say that the more rigorous your data is, the more nice model we have likely gets a lot more, deterministic. Right? What are the average sales by region codes last year? Well, if that's coming off at a highly structured table, and we'll talk about that next in the table augmented generation concepts here, then you should start to expect it to be a lot closer, but Lester can't guarantee that'll be perfectly, deterministic, by the nature of, ultimately, these are large English models. We have to find that. And so they're still human in the loop at least, from some kind of advisory level or something that we should always be thinking about. Alright. Couple of things. Do a quick translation. Same long message translated to, what is that? Z h t w s, Chinese as spoken in Taiwan. I won't be able to read that when it comes up, but, it's saying when the phone can read that, I should could've made it South Korean or something. I didn't. There we go. There's the translation. Yep. Yep. Yep. What does that translate to Spanish? Easy stuff. And then I said, you know, you could you chain these things together? You could. So I'm gonna chain, there we go. I'm gonna chain the fix the grammar first. So take the English, clean up the grammar, and then I'm gonna say, okay. With that response, turn around and, you know, translate it to, to Espanol. And the answer here, you know, is likely different than the answer when I translate it pre conversion. Kinda like a double neck double probabilistic here, the probabilistic. Yeah. And my comment was a minute ago, an additional comment. I'm sorry. Additional comment would be maybe in that scenario, that might not be the best answer. Chain them because we had to go talk to that model, and we have to go talk to it again. This is where while we're trying to help you by not making you do or become a prompt engineering expert, that one probably will be less expensive to write the prompt, yourself. Oh, there we go. I thought I did an example. So So you might just use prompt on that one and say, hey. You know? And the fact this is getting us to what we're gonna show next is kinda augmented generation. I'm gonna say, correct the grammar in this next text and then translate that corrected text to Spanish. Here is the text. There should have been a colon there, but it'll still work. Here is the text and then the message, and then he comes back ultimately with something else. And if I compare what we got here with the chained one, it may not be exactly the same for all those reasons I said. Alright. Moving on. You got that. I don't think I'm running that underground. Alright. Last one of these basic functions where we do some, rag and tag stuff, and it's masking. And I said that earlier, I would mention a little bit about masking. Masking is something that absolutely data platforms do already. We do that in Starburst. We can do column level masks. We do those with permissions. We can do all kinds of cool stuff. The problem with that or the the rigor of that is, well, it's designed for structured datasets. So if we know we have a phone number field, we know we have an address field, we can apply all kinds of cool masking. But what if the field is just a big, you know, text chat, you know, dump of a conversation or dump of something else? So this is a way to, you know, once you get that to say it. And don't forget to turn around. And plus it's super flexible. You can tell it whatever you want. You know? Hey. Find phones in there. Find addresses. You can put HIPAA. You can put whatever term that PII, whatever term you think makes the most sense and make sure the engine that the model you're using can determine what falls into those categories and then turn around. And our general first pass of this doesn't offer you opportunities to make up masks. We just have masks. I would look for that as a future thing that you could do or you could, you know, unwind it a little bit and, and do the prompt engineering yourself, ask us to do it for you or something like that. It's pretty darn valuable, the masking, above and beyond just what I talked about earlier. Let's look at our slides. I covered the basics of these kinda task based things, but in fairness, probably I'm not sure what's there we go. Show me. In fairness, what I really want to show you is, I got a thumbs up for something. Thanks. Is this notion of, you know, it's a I hate to call it an old term, but in the AI terms, rag is probably, like, the oldest term around. Yeah. It's still pretty useful, and we'll talk about here, and I'll introduce it to you even another alternative, maybe tags, augmented generations. So what do they really mean? You saw a bit of an augmented generation a minute ago. It really just means bring more data to the prompt so the prompt can figure out a better answer. And in our enterprise worlds that we think about, what we're really saying is bring data that I know that isn't that the model wasn't trained on. That the so if I was a telecom, and a customer is asking about their you know, let's just say there's still some old DSL connections out there or something. I'm in Kalamazoo, and I have a DSL, and I got a problem. What we wanna do is go find some details, some documentation that that we use internally, some prior customer resolutions, pull that together and say, hey, large language model, knowing, you know, this is the doc internal documentation of how we work and debug issues, knowing these are some cases, that were resolved well, Help me with this question, and also by the question that the customer asked, and we should get a better answer. And this is that space where probably, in fairness well, that's a pretty picture. In fairness, this is probably where a lot of people really are when we talked about in production. They look at at call centers, help resolution, things like that, and trying to find a way to, you know, get better answer, be a little bit cheaper. But I hope, at least in today, my hope is that they didn't just leave that autonomous all by itself. I'm hoping there are still humans in the loop And, you know, those that know support centers and call centers and all that, usually have those levels. And I'm hoping, at some point, there's a human level that's being a supervisor and watching this stuff and chiming in and making sure things are getting better and better and better. Well, I promise not to be a genius and explain vector embeddings, but the problem is this. When we talk about structured columns, favorite color, blue, green, yellow, phone number, +1, 2345, address, Easy breezing. We have very tabular data we can work with that well, but often we have these big chunks of text. They could actually be a whole document, a PDF full of text, but it could be just a column, you know, as well. And the reality is these engines work on this concept, called, embeddings or searching embeddings or storing vector embeddings. And, really, all this picture is trying to say, very simply, is we take words or word or words, in our case, words, probably a sentence or a paragraph or a chapter or a conversation with a customer, and we push it against some kind of mathematical model, an embedding model. And we say, generate me a mathematical representation of that. And then here's another one, and then there's another one, and here's another one. And we store these in a concept called a vector store, a vector database, or something. But when we do a bunch of them, what ends up happening is things that are closely related end up being closely similar in that vector space as they call it. So what I was just trying to show on the right maybe things you eat that have bread involved, sandwiches, hamburgers, and hotdogs kind of fit together. But the color coding looks a little different. Like, sandwiches and hamburgers look a little closer than a hotdog. And again, this is gonna be wouldn't be easy to do from the words. It'd be that had to be done visually to get there, but but net net net. And it would be the same thing about customer conversations. You work for Omnicore. You worldwide. Maybe reality is just doing all customer chats. Not even I try and explain why. They actually will start to kinda group together based on what kind of conversations and activities that are kind of occurred. Alright. That's a lightweight thing. Mathematical representation of chunks of words, hopefully, relevant chunk that makes some sense so that we compare it to other ones that they will naturally kinda clump together. Why you wanna do all that? That's what we're trying to get to. Alright. So that means, unfortunately, we have to kinda do that ahead of time. We have to kinda look at that data, run it through that embedding stuff, and store it in some kind of vector database. There's other terms of their parsing and chunking. I gave the my what is rag video, if you watch that, it'll give you a little bit longer version of this, and hopefully, it'll make some sense. But there's some prep work, and then there's the runtime work. And the runtime work is the RAG app. You know, query comes in to your application. In fact, we're gonna build a simple workflow in SQL that's gonna do the same thing. We're gonna ask this query a question and it's gonna go, well, let me go see if there's data that looks kinda like that question. And then let me take that and then supply it, stuff it, contact load it, whatever term you like, and prep the LOM with, hey. Knowing this now, in addition to what you already know, what do you think a good answer is? And then hand it back. And there are many pictures on the Internet that make it look a lot more complicated than and it absolutely can because nothing is golden here. But main thing I just wanna make, there's a little bit of an effort we have to upfront. I'll show you what that looks like in our little simple workflow. We'll get it all ready, and then we'll leverage that vector information. Alright. Let's do ourselves a quick favor. Let's go back because I know it's probably timed out by now. And we're gonna run, yeah, a mask query again just so keep it alive. Alright. Let's get into it. Let's do a show me. Okay. So, arguably, prompt, you saw it, is a function and all those other functions are just using prompt. So, really, we have two functions we really need, create or find embeddings and, you know, prompt. And, it gets about as simple as that, in all fairness. So in that prep work, what we're gonna do is we're gonna use an iceberg table. I'm gonna set up a table. I'll show you that in a second here. And we're gonna alter that table. We're gonna add a new column that we can actually store those vector embeddings lakeside in the table along with the data. And for those that know a lot about this, you might say, hey. What if it's says PDF you showed me over here, Lester? Absolutely. We don't solve that problem. You you need some kind of preprocessing. Now if you already have data that has naturally logical chunks together, and my example will, So maybe you have the table full of FAA docs or something in by chapter or something like that. It's a good example that, our marketing team the rest of the marketing team and product team love to show. And, and each chapter has, you know, book name, chapter name, blah blah blah. And then it has a text field with the chapter in there. So what we need to do is place that somewhere. In fact, we'll use a function called generate embedding to calculate that, and we'll store it in the set table. And then what do we need to do? Well, now our ETL side is done because the hard work would be taking something like a PDF and getting it into those logical chunks. There's still a lot of work there. And, again, that's not our, that's not our sweet spot. Now we're trying to tackle, plenty of pre work systems and I could point you to a handful of them off the top of my tongue here. But, ultimately, what we want to do is once that stuff is out there, we want to say, can we find data looks like? And the good news is you're gonna see in a second that these embeddings, really all they are, are an array of numbers between minus one and one. So it's a multidimensional array, 512 dimensions, 2,000 dimensions. It gets it can get pretty interesting. And in fact, we use historically, we use a special kind of database called the vector database to store those things. It doesn't calculate them, doesn't figure them out, the embeddings, but it can store them and see differences. Again, we're putting ours in an iceberg table here. If you really, really wanna use a vector database, you still can. But for a lot of work out there, by by racking it down to maybe a 100 or a thousand rows, taking that data and just ripping it, pulling it out of a out of an iceberg table is gonna be cheaper, less complicated, that kind of stuff. And solve probably 90% of your problem. So we're gonna use existing tree of functions just to cosign similarity. I'm gonna show you all that right now. And then lastly, we're gonna say, hey. Call the prompt. And as this real casually says, hey. Supply the user question plus that additional context and get an answer. And, let's go see it. I should I should have saved this until later. I'll just say it real loud and then in case it comes back up. Table log minute generation really is the same thing, except that your data is very well structured in your tables. Grab columns a, b, seven, and q, and nine, and then use those as the augmentation data here. My example is going to be again on the free form text. Alright. So my, example says, hey. We got a use case. I'm just showing it because everyone that knows who Fred Flintstone is, they'll see that picture and go cool. And, the use case we're gonna have is Fred Flintstone's writing a journal, and he's using a digital version on his phone, not a hard hard thing in there. And, really, what we wanna ask is a question like this. Does Fred Flintstone like to send and receive mail? Let's go ask that question. Here it is. According to, g p t five nano, does Fred Flintstone like to send and receive mail? And the answer probably is oh, there is. There's no indication in The Flintstones that that's the TV show for those that aren't familiar with this character, cartoon character. There's no there's nothing that really suggests that he likes to send or receive email and just rambles on and, you know, if you got a particular thing, you know, just prompt it. It's like, I don't know. Tell me more what you're looking for. And that's okay. That's what I expected. So what we wanna do is what if we had access? So this is a terrible use case because Fred is thinks he's privately storing his secret most thoughts in his diary online, and we're leveraging them. But, use your example. Use the use the user manuals. Use your customer patient notes from your hospital user, your, tech installation, techs that you roll out when you install Internet in residential neighborhoods, that kind of stuff. Alright. So we're gonna build a table. We're gonna call it diary entries. Again, all this is back in that example should you choose to, play with it yourself. I hope you do. We built a quick table. It's just called diary entries. It's got an owner, a date, and some text. And then I went ahead and made up not too many records, handful of records that said load up Fred. And I kinda backfilled the dates ninety nine days ago, ninety eight days ago, so on so on so forth. And they're things like this, you know, very first one. Hey. It's a new year. I'm gonna keep a diary. So what things are kinda boring here in Bedrock. You know? But my family is not boring. Just noise, stuff about his day. Good stuff. And, again, you see 13 rows, magic 13, lucky number 13. Now I said, this is a scenario. Great. We already have data kinda parsed and chunked in logical entries, diary entries. This is actually perfect for what we're talking about here for the use case we're gonna have. So I said, I wanna store our vector embedding lakeside. I wanna store them in this table. So I just added a new field, and I could've added that on the create table. You know, it's and I would probably would have done it. But I called it entry text embeddings. The field is called entry text. This one's entry text embeddings, whatever you want, but it's nothing but an array of doubles. Number minus one to one. And we don't care how many. It's just a big array. And then I wanna do that kind of ETL work. So what do I need to do? And you can do this whenever. You can do this as you're loading it likely. But what am I gonna do? I'm gonna say, hey. Run me an update, and I'm gonna set that embedding's field here on line one zero seven to the value of take, you know, the entry text, calculate those embeddings, and then jam them in there. And I just went ahead and said, hey. In case, you know, anything that happened already done, you know, I don't wanna do them all. You You know, I just wanna catch up the ones that are, you know, missing. So you see what you see, ultimately, if I click on this new field, the entry text embedding is there's that mathematical representation of, hey. I'm happy to get started journaling. My family is not boring, but my life is blah blah blah blah blah. Okay. So questions come up, shoot them out there. I'll take a look at it. So what do we have? We don't have a vector database. We have a iceberg table in your data lake storing the embeddings. Kinda interesting. You know? No different than using a separate one. But how can we start to leverage this in a real world example? Well, I'm gonna kinda piecemeal this just to help us get there. Let me give myself a lot of room here. Maybe I can I guess, can't bump the text too much here? So I'm gonna say something as simple as this. Hey. Go out there. Let me run it, and I'll explain it to you. So there we go. Boom. Boom. Boom. So I used the concept called a common table expression. But, ultimately, I'm what am I saying? I'm saying, hey. Give me the entries where this phrase, this question, does Fred Flintstone like to send and receive email? When I look at if I take that, calculate it and embedding on that, because that's my question, and then do a similarity cosine similarity because ones and zeros look like a cosine if you remember that. Find those other mathematical representations that are really close to that and then rank them, and then ultimately limit give me the score that the rank the the similarity creates a score and then rank them by their score. So what do I have in the bottom here? I have I asked for the top five. These are the top five. There's only 13 entries. Entries that say, hey. It looks like the entry relates something to do with that. And if you look kinda close, you might notice, even right here without looking at the whole text, you might see something like let's see here. Hey. Postcards. Interesting. We didn't say oh, it actually says in the mail, but some of these other ones, it might just say, hey. I sent and received a cup three postcards. This one, I think, has a postcard. There it is. Postcards at the bottom. All these because Fred's talking about his postparting, all fairness, that's what he started doing. And the trick is that the large language model didn't have to look for the word mail. It looked for, you know, the stuff. It does what the model knows how to do well. So he found some good text entries or journal entries that will be useful to us. Alright. I realize I'm running quite long. It's already forty nine, so I'm gonna put get to the punch line. What can I do with those? Well, I could turn around. And so I'm gonna run the same query, you know, does for Flintstone, calculate embeddings. And then I just said that just to show you a little bit as I go here, I can then say with those results, I just build a JSON object. Because when we send stuff to to as context to the large language model, putting it in a pretty format helps. That looks pretty ugly back in the example over here. I think, I think I burst it out and said, it looks like this. You know, journal entry and then the journal entry. So I'm saying, okay. Great. So I'm just formatting that a little bit. Why? Because I really wanna put all this together. So, again, I'm just running a little bit at a time. Same thing. Do the cosine similarity search on the embedding that we found. Package that up in a nice JSON object, and then that third part is really the new bits that I was talking about here. This is my prompt, And this is not the world's best prompt engineering example by any means, but it is a quick example. I'm concatenating the string. Hey. Using a list of journal entries provided in JSON. In fact, by doing that, I'm kinda even telling ignore the Internet and everything else. So a better prompt was, like, you know, make sure you leverage the the channel the show and la la la. But does Fred Flintstone like to send and receive the emails? And what's the answer? You know, using the data I supplied? And the answer, hopefully somewhat readable, says, yeah. Hey. The the entries kinda show that Fred is actively into sending and receiving mail, specifically postcards, and here's some key points. And overall, he he clearly enjoys sending or receiving email. That's good. That's what we wanted to do. And as I said, I could run it again and see if the answer is the same. I'm not gonna do that. Especially for time purposes, it would be fun to do it otherwise. That's it. That was a that was a rad workflow. So in other words, this query is really something that someone could implement, man, a very this is a trivial example in that simple flow I did, but I think it's a valuable answer especially when you have the, you know, data that's out there could be useful. Now who is this targeted at? I'm not suggesting that you don't spend your multi zillion dollars and get AI one zero one branded company come help you do stuff, but your data analysts may have access to, you know, patient records, may have access to all those things we talked about for telecom, for a retail store. They might already do analytics on seeing how customers spend their money and how they did it. Why wouldn't you want to let them, especially in their exploratory query, use tools like this to start leveraging that more free form data that you have? If it's already in a structured column, you know, they can, for the most part, resolve it well, or they can do what I said in a minute. They can kinda package things together instead of getting out of here, but, I think I'm muddy in the water with that one. Alright. So just for fun, we got a couple more minutes. I will run, just a few more. I'll just run them all at once. I went ahead and, oh, I won't do it now. I did a couple more. Like, how does Fred Flintstone feel about his public transportation? Does he go to the movies? And, did I put public movies transportation? Oh, I think I duped it in here. In the, instructions, I give you a couple of a third question. Public transportation, hobbies, and does it go to movies? Hobbies, movies, and and, I gave you the examples with the basic request, with the enhanced request, remind you of a lot of other stuff. You can do a better job. Oh, I'll remind you that, this is a terrible use of private information. Don't really do this unless your customers know you're doing that. But with that said, I think I will pause. It's about fifty three after. The webinar, my audio and video is gonna shut off in about seven minutes. So I'm gonna see if there are any questions. Before I do that, we have a I'm gonna tell you a couple of things back in the slides. Quincy, there we go. She mentioned at the end of this, I don't know if it automatically pops up or when you leave or whatever, but it's gonna ask you two quick questions. Any good or not? And what else can we share? What are the kind of information are you looking for? And So if you got a moment, it's only two questions, we'd love you to help us out. And then I'm also putting on the screen here this slide while I look at the questions. As I promised, there are other things that we do. We have our own AI agent that works really well, very highly special guys, classified engine, chat interface that works really well against the data that Starburst knows about. Great webinar with the people that created that and the product managers as well. And then on the bottom there, you say, yeah. Yeah. If you're interested, join our our, community newsletter. If nothing else, you might get that cool shirt that I still haven't got one of. Still getting frustrated by that. So I'm gonna look to questions. I hope today was useful. I hope it was valuable. I hope you do tell us one way or the other so we'll know. And, with that, I'm just gonna stare at the screen, stare at the chat for a few minutes. And if it goes dry for about thirty, forty, fifty seconds, we'll just call it a day. So folks that are, had enough, thank you again for joining us. If Quincy has any other final thoughts, she'll type in the chat using the vaccine, helping out, keeping things moving. But I don't think I saw any questions other than folks, nicely letting me know, where they're at. So if you got a and and and they can be technical questions about what you saw, but they definitely could be questions like, I got a scenario like, would this be good for this? Because I feel this is really good for eighty, ninety plus percent of use cases out there. And because we're not trying to be the specialty boutique provider and all this, I'm not suggesting you still can't spend millions and millions of dollars on your AI projects, but this is a scenario and approach to bring that those concepts, those technologies just to your average person, your data scientists, of course, but your data analyst. Your data engineers can start using these things in the workspaces and the tooling, that they're already familiar with and using. Thank you, Robert. I'm glad you found that so much interest, and thank you so much for that. Awesome, Thomas. Thank you. Thank you. Alright. Y'all all very nice. I appreciate y'all. Yes. I agree, Dirk. Is you like everything else, there's more you know, always more as you get into it and understand it, but at at at its highest level, that's it. That's what we're talking about. Again, when I take that query with the two CTs and put it in a custom bank UI and give it to every you call call agent out there? Maybe. I might, you know, because that's a very specific scenario. I might do something that, you know, probably cost me more money than that and gets perfect answers and have, you know, a lot more time and effort involved in there. But these general purpose approaches are exactly all we're using behind the scenes in our AI agent anyway. We just interrogate the metadata about the structure, about the data product, about uses examples. We just have a lot of information, that's working there for us. Alright. Alright, Quincy. I think, from what I'm seeing, that the question there are no no questions are rolling in, but, we we're getting some nice thank you. So, again, if you can, the thank yous, especially, wink wink, if you hang around, when you walk away and, answer those two questions, we'd sure appreciate it. Alright. Alright, Quincy. I think I'm just gonna pop out here and, go off stage, and, hopefully, that'll shut it down here shortly. Alright. Thanks. She's stunning in my backstage access. Alright. Thanks, everybody. Have a great day, and enjoy the holiday.