Video: Starburst 101 Workshop - January 21 | Duration: 4820s | Summary: Starburst 101 Workshop - January 21 | Chapters: Introduction and Overview (10.8s), Trino's Evolution (288.38s), Open Hybrid Lakehouses (560.215s), Query Engine Architecture (825.395s), Trino's Data Connectivity (1456.595s), Federated Query Benefits (1613.215s), Data Lakehouse Explained (1922.235s), Starburst and Trino (2322.975s), Security and Access (2420.02s), Data Products Explained (2572.985s), Hands-on Query Demonstration (2647.93s), Data Layer Architecture (2754.295s), Schema Discovery Process (2875.51s), Querying Table Data (3027.985s), Building Silver Layer (3148.665s), Iceberg Table Features (3282.965s), Gold Dataset Creation (3466.79s), Governance and Lineage (3763.72s), Data Product Promotion (4202.4s), Conclusion and Thanks (4689.42s)
Transcript for "Starburst 101 Workshop - January 21":
Hey, everybody. It's Lester Martin. I'm here from Starburst. I'm a developer advocate, and thanks for showing up for our webinar today. It's gonna be myself today only In the background, I have my colleague, Quincy, will be out there kinda watching and making sure if we have any audio video troubles or just some questions I'm not noticing, that kind of stuff. But it'll be primarily primarily me today. Let me kick this thing up in a slideshow so it'll look a little a little easier to read. Again, my name is Lester Martin. If you're interested at all, in finding more about me, there's there are a handful of Lester Martins in this world, but, probably the biggest tech one in technology. But there's some details about me. If you wanna catch up with us, not just Lester, but, you know, developer relations as a whole or questions and answers, that email address there at the bottom is super useful and relevant. I'll try to put it in the chat sometime today. Devrow@starburst.io is a good way to find, myself and, a few other folks who also received that email. They won't be just me. See if we can get this going. We got about ninety minutes allocated on the calendar. I don't know if that's what you signed up for, but that's what I got. I will probably use a good chunk of it. And I'm gonna break it into two pieces. The first is, this, some slide where some architecture, that kind of good stuff. And I really you know, it's a one on one. I wanna do I got a a little bit of time, so I wanna make sure you understand. It depends where you are ready, but where you understand what this framework or this query engine called Trino is, how we factor into that, how we sit on top of that, how we support and build that as well. So my company Starburst. And just kinda give you a a slew of what all we can offer, what Treno and Starburst ourselves can offer in this data analytics space, this federated query space, this data lake, lake house kind of space. And then part two will be I'll get out of these slides, and I'll go to one of our platforms, Starburst Galaxy, and we will build in a short period of time, a full, you know, medallion architecture based solution. We'll build the pipeline from ingestion all the way through, you know, your bronze, silver, gold kind of stuff. I might use different terms like land structure and consume, but, you know, same kind of general concepts. We'll even put some data products, perspectives on the front of that and then leverage that end to end to end. So it'll be a half and a half kind of thing. I'm gonna shoot for trying to be done with my talk in, forty minutes maybe. Forty forty five with the high end, but, hopefully, forty minutes or so from the start of our session. So I'm already five minutes three minutes into that already. And, again, those are the kinda chunks and sections that we're gonna be talking about. Over the chat, I'm saying, hi. I'm Lester from Atlanta, And I'll even put USA because I do have a lot of friends overseas, and then when I say Georgia, sometimes I think of the nation Of Georgia. So, hey, if you wanna shout out who you are and where you're from, that's awesome. If you don't, I understand. No worries. Glad you're here. And I also wanna make sure you know that chat is, at least today, the way we're set up today, about the only way to kinda contact me, chat, or even that q and a tab, primarily the chat one. I am gonna not be staring at it, but if you keep noticing, I might peek up like this. That means I'm looking at the the chat window to see if there's anything in there. Likely, if I see something, very likely, I will pause briefly at least to say, hey. We'll cover that shortly or we're about to do it or maybe pause and address it right then and there if it seems the most appropriate. And we should have time at the very end, if there's some questions in there that I was unable to capture or address, maybe there's some questions that are beyond the things we're talking about today. Those are everything's okay. Any question you want. Ask me anything, preferably in this domain, but you can always ask me, you know, how long does it take light to travel from the sun to the Earth? I think it's about eight and a half minutes, they say, I think. I thought it was seven, but I think it's about eight and a half. Alright. Let's jump on in and get rolling. What what agenda? Getting it old, Trino. Getting a new Starburst. I guess if you can talk about tree node and Starburst, you really kinda just have to take a moment, and I'll be brief about it because a lot of folks know this. But, you know, sometimes, hopefully, a little I've been doing this for thirty years. I've been doing all this for the last ten, twelve, fifteen years, but it kinda happened in that Hadoop era. Kinda happened at the height of the .com era when all of sudden we realized we have a lot more data than maybe we were used to doing. If you think about every search history, every collection, all the things on the Internet, and then try to find relationships between all that. It got to the point where the data solutions that we had in place just weren't really scaling or you couldn't afford to use them. Maybe they actually worked, but they were so expensive. So this system called Hadoop, this open source Apache project came along, and it started solving problems. That's great. But to be honest, it was doing it. You had to do it with I'm a Java programmer. I think it's interesting, but you had to do it in the earliest days in a Java API called Java MapReduce. And that's fun, but it's not really business friendly by any stretch. So our good friends at then just Facebook, now I guess Meta, decided, hey. We got all these SQL folks that understand SQL. So they built a veneer, a wrapper around that MapReduce engine that let people start thinking thinking of all these giant datasets out there in these giant data lakes, these repositories as, as tables, and it worked pretty good. Bad news was it kinda was a little slow. It was slow because it was trying to be fast. It was trying to be successful. It was trying to be resilient. It was trying to make sure that, if you had a three hour query or a three day query that if something went wrong, it continued, on its own and resolved the situation and all that good stuff. So simultaneously, a lot of people at Facebook were saying, this is cool. Now it works. Go faster. So another team there, and you see some names of that first major bullet, created that, this framework then called Presto. So you may know Trino by a under named Presto. And in fact, there was a fork at some point in time, so Presto still exists in Trino. Presto kinda traversing on their own little ways as we go there. Those folks who built this system, they stayed there, they got it working, Got was really its sweet spot was interactive queries. You know, people or processes are waiting for an answer because they're gonna ask another question, ask another question. So it wasn't the fire forget and come back a day later. Now the truth is that engine also could be, even though it wasn't trying to intend to be, initially at least, used for longer running queries or, you know, what we might call transformation jobs, that kind of stuff. Things that sometimes nowadays people might suggest, oh, you can only do that with something like a Spark or something. You can do that with SQL often because you can do that with Spark SQL, which is either SQL or an API that looks a lot like SQL. But nonetheless, the story I was gonna say is I just wanna make sure you know that Trina was a very fast query engine, runs SQL statements against data lakes and some other datasets we'll talk about shortly. But even in its earliest days, folks go, hey. This is so much faster than Hive. I will like, start to use that for my ETL processing or ELT processing, that you think about there. So good stuff there. Fast forward, you know, look backward in time. It was fair to say that that engine, that open source Trino query engine, ludicrously scaling, ludicrously fast, ultra scalable environment is used in a lot of places. It's used in a lot of companies that you already think about and know about. Those numbers you see on the screen, most of those are pretty dated because most of these companies don't like to give you too many details. But when they write something down or share it in a blog post or maybe talk about it at a conference, you you scratch those things down. So these numbers are all multi years old, but these companies are still actively heavily leveraging Treno. Now these companies are your giant in this giant shops that have really impressive, engineering teams, and they often are gonna run open source directly. Absolutely awesome or absolutely cool. You just need to often most started back in Hadoop space for sure, but almost all kind of big data, data lake kind of frameworks do take a little care, do take a little little, care and feeding, a little bit of love, a little bit of knowledge that you have to, you know, buy, spin up, build up through scratch scratching your knees or whatever is appropriate. And maybe a bank, maybe an insurance company, maybe an automobile manufacturing company wants the same kind of engine, but maybe doesn't wanna invest that big of a software team or that big of a platform team to master all those intricacies. So they might be looking, for some help. They might be looking for something help in those extensibility things, that you see down there. That's where my company Starburst kinda steps in. So my slides. Wow. Yeah. Didn't really fit my story, but, I think those those trends on the left probably do make sense for a lot of folks, maybe even when left that AIML data demands or something that's been early twenty twenty six, something we hear about continuously constantly all the time. But many of those other things, you know, being able to run-in a hybrid mode, little bit of little bit of a combination of on prem, little bit of public cloud, or maybe only one of those things or multiple public clouds. All that flexibility, all that optionality is super important to us, super important to a lot of folks out there. And, you know, you gotta be careful. And, obviously, I just said we're a software company. We're a vendor. We sell something, but our core root our roots are built around that open source engine called Trino, and that self gives you lots and lots of flexibility and freedom to either move to another Trino based vendor. There are others, or even self hosting if you want. We don't want you to do that. We'd love you to stay with us, but we gotta earn your business. We gotta show you that we have a lot of value, to bring to bear. So we're gonna definitely tackle all of these kind of bubbles that you see over here. Some of these things are like the top, things you're just wanting. You know? Maybe you do want that open source engine for the reasons I just said. Maybe you do want an open source engine because, it couples nicely with a, hopefully, very economical way to store lots and lots and lots of data, object storage or HDFS on your premises or NetIO or something. But we are still on the performance of those prior EDWs and the current days cloud cloud data warehouses and that kind of good stuff. And we absolutely you know, as I mentioned a minute ago, we're gonna run anywhere you wanna be. We're gonna talk some about Iceberg today. We're gonna use Iceberg. It isn't just a dedicated Iceberg conversation today, but we, and I would not say we were a late comer to this party. I joined, Starburst a little about four years ago, a little over four years ago, and that was my intro doc inter introduction to Apache Iceberg, and and Starburst had already made a big strong statement that, yeah, Iceberg is that future direction. But it doesn't have to be the only thing. It'd have to be something used at all. It'd have to be the only thing used at all. We're gonna make sure you have, as it says, their full distributed data access. So other table formats and even other data environments. So, yeah, an open hybrid lake house is really kind of what we're kinda pushing. I guess not kinda pushing. It's what we're pushing, what we're advocating. Now the good news is that open lake house, as I fill in all these stuff, doesn't necessarily mean you did anything you're doing today is wrong. Anything you wanna do tomorrow is wrong either. You'll back in I'm gonna use this word optionality a few times today to kinda drive the point home. This slide just kinda driving again the point home that you already saw. There's there's quite a bit of usage in this framework that's existed for a solid decade now. It runs, it runs at ludicrous speed and scale. If you're new to Trino or even new to Presto, welcome to the party. And if you've been around for some time, you know about these frameworks. You already know what I'm kinda saying. This is this is the place to be. So if we take think about what Starburst is, absolutely, as I said a minute ago, we are rallied around, built around. In fact, we are the people who built this query engine at Facebook and ultimately kinda brought into a a different company here. So we got the core query engine. Now you might ask and no one said it yet, but I'll say it. You might ask what is a query engine? Think of a database, but without a a a formal, it has to be stored in. So if Oracle's a database, it's a query engine with all those things like a cost base optimizer and all that, it also has its own specific way. It persists data. A query engine says, hey. I'm gonna decouple myself a little bit from that. And as long as I have a a good way to, you know, abstract away from yet still physically get to things, I can act like a database for a lot of different datasets. Now it's gonna act its very best on the data lakes itself, these big repositories such as Amazon's s three object store, MinIO's on prem kind of thing, HDFS, something like that, or Adele's, you know, s three compatible storage at somebody else's object store. So we're gonna bring that query engine. We continue to support it. All of our engineers, you know, commit back to that core project as well as things that we might do. So this is the kind of classic play that most folks have done. How do you bring an open source engine to an enterprise? How do you make a lot of that cross cutting concerns, activities, and integrations a lot easier? So we've done that. That's at enterprise adaptability and stuff. The bottom piece, actually, half of this is open source and half of this is us. So there are absolutely query training on a show here is built around this notion. If it's a query engine, it doesn't have a back end. How does it talk to a back end? It is a connector architecture. So there are tons of connectors, tons, 50 ish, not not thousands, but 50 ish connectors to very, very popular, very well understood, known data systems like data lakes, like databases. I'll cover all that in a few minutes. I'll slow down here. And then we build on some additional ones that we don't necessarily put back in open source for a variety of reasons. Sometimes it's just there's significant licensing issues with that other vendor, And so we try to create something that makes sense for them and for us, you know, we can support it and those kinds of things. And we tackle we actually do our own ingestion, sets of tools as well that are not part of, open source trio. And then we do everything we can to make that not just work, but look easy to consume. We'll talk about data products toward the end of this, talk. We'll see it toward the end of the demo as well. And then anything and everything that can make your world go faster. So platform plays, caching plays, engine plays, all that good stuff. So we're really about making sure a super fast engine already goes that much faster. That's how we're gonna differentiate differentiate ourselves. Let's yet still let folks have the flexibility to say, cool. It's still right on Trino. You may have done something to make it a little faster, but doesn't break someone else using this. Meet another Trino engine or more appropriately, maybe even another completely different technology. You know, using these open data lakes like Apache Iceberg, open table formats, excuse me, like Apache Iceberg, lets us have many different query engines or processing engines against that same dataset, and that's super important, super critical. Alright. Big slide just says you know, it says we're in all kinds of vendor, the vertical spaces out there. We're absolutely in the booking ones you see here, the telcos, the the banking finance world. I would say we're at everything else too and and beyond, but I would say probably, it's fair to say many of our customers have been in that space. And we did that purposely because those are long hauls. Those are long, commitment cycles. Those are long, getting folks excited and onboard and that kind of stuff. In fact, folks like Citi, see that turned around invested back in our own company, which is awesome as an investor. And then you probably see some other banks, I guess, with the big, big, big one that we have a gigantic deal. Don't see it listed there, but I think that's because it's an older slide, so I won't say the name. But, we're in all the big banks. Absolutely. Alright. So the query engine itself. No questions in chat. That's awesome. Feel free, like I said, just to shout something out. This engine itself. Now I guess I'll take a moment and look at something while we're here. Uh-huh. That might be a fair thing to do. I'm gonna talk about what is Treno, you know, what is Starburst, and you might think it's this. You might think it's just GUI here. I can come and see, you know, all kinds of cool or maybe unexciting looking. Hey. This looks like, you know, my classic kind of query editor tool and that kind of stuff. But what's really going on here, this is a control a visual element of our control plane, and it's letting me set up all kinds of catalogs, all kinds of clusters, like, instead of ingestion jobs, data quality jobs, data maintenance jobs. This is just that front end to what really, really is out there. What really, really out is out there is the fact that, you know, this is not unique to Treno. This is pretty normal in the these open source big data or data lake or data lake house frameworks, they run a bunch of machines one way or the other, a bunch of nodes or servers or whatever you want, however you do those, and they often follow a pattern that we use, which is a centralized brain and a a very scalable set of, you know, workers that can crunch and do all the heavy lifting, and they can scale up and scale down as appropriate. And so we take that approach. And not to you know, this may look like a lot at first glance. If you slow down and look at it, you say, okay. It's not that crazy. In fact, I'll just take about a minute or two of your time and click through it to say that, hey. What's really, really happening is this. You know? There's that cluster we talked about in the middle, the coordinator and its workers. That's our names of our, stereotypes. And people or processes, applications, AI agents, humans, whatever, need to ask questions. So they're gonna send them to us and and primarily SQL, we do have some Python APIs. They are in open source. There are a whole bunch of other that we don't necessarily support at Starburst because, you know, if a customer asked us to, likely, would, but there are a bunch of different clients, Go clients and dot net clients and that kind of stuff, that that layer on top of these tools just to make it quicker and easier. But net net, in the day, basically, we're issuing some kind of structured query language. And, of course, you know, you see authentication, all that. We're gonna do all the checks and balances. We're gonna do that with pluggable solutions. If you got authentication systems already, awesome. We hope you do. If you have authorization systems, cool. We can plug into a variety of those. We've got our own authorization system. We we'd love you to use ours, but, you know, we'll do what's appropriate. But if a question comes in, we make sure it's all valid, and then we really check with those systems, those sources on the left over here. And what are we really trying to do? We're trying to get metadata. We're trying to get structural metadata, of course, what's the table, what's the columns, and the data types, and so on. But we also want size and scale. We want the footprint metadata, how many records, how many uniqueness, what kind of ranges. We're really just trying to get a size and scope and an understanding of how that data looks so that way we can build a plan. We can build a query. And those have been around databases have probably heard, like, a query plan or something like that. Absolutely. These query engines create you know, they all do. They all create query plans. They create execution plans. How best to solve this? And then, like I said, we don't wanna bottleneck, or if I didn't say it, I'm saying it now. One of the benefits of having a brain and then workers are, you know, usually, we can figure out the work pretty quick, sometimes like lightning fast. And then we can say, okay. Good. There's a lot of work to do. Go do it. And how do we go faster? Usually, we put more machines at bear, so we issue off and assign work into those nodes. And those are ones that are reading and writing from the various data sources you might see on the left. And, ultimately, they're gonna figure an answer out, and they're gonna give that answer back. And we send it back to the coordinator by default because we had clients or by default connecting, but we do have some, multiple generations of how could that client technically behind the scenes take advantage of the fact there's a bunch of parallelization in there. So, yes, when you have gigantic result sets, there are concepts that let us not just bottleneck that result. But, you know, if there's 10,000 answers, you're doing a, you know, aggregation on your sales department codes or something. There's 10,000 of those great or 10,000 combinations those time a month or whatever or millions. You know, we can still very quickly go back to that coordinator. But we have a solution that some you know, if you just need to do something that has awareness of that's of that, you know, cluster enabled set of workers, that that the parallelization more directly. And when I say all that, I I ramble. I went off the script and told you about stuff you probably don't care about. I just wanna make the point that they're always when you look at these diagrams, people always have a question. What about single point of failure? What about scale? What about what about and the good news is when you have a ten year old plus framework, you have time not just to make it work, make it work fast, make it work all the time, but to go beyond that. What about this? What about that? You can start addressing not just corner cases, but, you know, the big issues first and then get down to the corner cases, so on and so forth. I I have been in this Hadoop, Hive, Spark, Flink, Storm, NiFi, Trino space for about fifteen years of my life. Lester, personally, he don't know me, but I'm telling you, I'm telling myself, I love this framework. I love what it does. I love how it works, and I know how the others work too. And I'm not suggesting any of those are wrong. They're not. They're all really good technology, but there are places where some things shine more than others. And 100% Trino shines at doing ludicrously vast, super high scalable querying. Does it work? Yeah. That's that's the sweetest spot, in the world. What's the secondary sweetest spot to me? The fact that a whole bunch of words over here on the on the right, the fact that we can connect to a bunch of data sources. So I've been suggesting the data lake. Again, what's a data lake? A big repository, a big file system somewhere, arguably. Or maybe it's not really a file system like an objects object storage, but we can visualize it as a folder, set up solutions, that kind of stuff. But we don't only work with the with, the data lakes. We don't only work with ADSL and Google Google's, object store and MinIO and stuff and others. We can talk to real data systems. When I say real data systems, I'm really talking about databases via OLAP or OLTP oriented databases. I'm thinking the NoSQL databases. I'm even thinking, you know, things that don't look like databases. Well, query engine I'm sorry. A search engine actually kinda looks like a database. Okay. Could we talk to search engines? We sure can. Can we talk to messaging platforms? We sure can. And we talk to applications. So these are just trying to give you the point that that connector architecture, as long as someone's willing, either an open source, as us producing one for you specifically, or you yourself building your own specialized data source connectors. As long as you can follow those rules, what it'll do is offer to the core engine a SQL interface, a table based interface to those datasets. So now someone could write ANSI SQL to just about anything, that's out there and absolutely do you think we have a connector already existing for? So, again, about 50, 50 of those in, in practice today. A quick consequence or benefit or interesting thing is to point out that by doing this and I think when you look at it, sometimes it makes more sense here. So then pop out of this mode. By looking at this, my little free server, shuts down every couple minutes, because I'm not using it. But if you look here, I got a server or a cluster. Bump it up a little bit. One, two. I have a little cluster of your call, the free cluster. That's what I'm using. And then I have a list of things. These are most those catalogs. These are instances of connectors. And, you know, what we've done, in the last year or three is actually do all kinds of fancy icons that tell you right away what that is. Oh, that's a that's a snowflake. That's a post grad, so on and so on and so forth. And, that's an Amazon s three bucket based framework. What I'm really trying to say is this. When you look into something like this, my cloud so there's gonna be my cloud dot, I think I got one called robotech. Robotech.mecca. So what I'm really trying to see is you kinda see your own lines two and three and six and seven. If you want to fully qualify the table in a Trino world and a Starburst world, you have to reference all three of these things, not just schema table that you would have to connect it straight to SQL Server or Oracle. Can you parameterize all that stuff? Sure. Can you use a use command or toggle a user interface to say, hey. I just care about, hopefully, that came up the right page. I just care about the students and instructor schema or whatever. You know? And it sounds silly to it's you know, you know? Yes. You could do all that and then just start writing, you know, select from instructors, select from, you know, a table name directly without mentioning all that stuff. But, likely, when you parameterize all this and your data pipelines and stuff, you probably are gonna, you know, again, parameterize that fully qualified name. So I will probably just lock it down to one or another. So what's the benefit of having all these catalogs? Well, it is fair to say that probably the first benefit is this. I can, in one connection, do something. Again, you can use our GUI tool like I'm doing, or you can use your BI tool. You can use your homegrown application. You could use DB visualizer, you know, whatever tool you like to run. You know, we we do more than just a query editor, but I'm gonna keep all my stuff in there. You actually have, from that one connection, access to lots and lots of different datasets. Of course, access controls are at Bayer. You might that cluster might have even more, but maybe I can't see them. I'm currently in an account admin role. I can see everything the way it's set up, but it's a single point of access to a lot of different datasets. Now you might argue, can't I just make another connection to all these things? I could, but then that's great for just Lester and there's just 20 sources. But what if there's 200 or 2,000 people that need access to those 20 sources? Have to do some serious automation to make sure everybody and their brother knows how to have all the credentials and all that good stuff. So we're saying single point of access also gives us a single point to control governance, to control security. Remember I said earlier, we can connect to a variety of different authorization systems or use our own. Looking at the time there, 12:30. Good. Hurry my hurry myself up here. Super valuable. And even more cool than that, and I'll just wait to show you the demo instead of jumping back, is that next thing. I cannot only individually run a query on Elasticsearch and then one on Kafka and then one on MongoDB. I could write a fourth query that is a join across those three different data sources. So a query a federated query. And in fact, not something that's like a bolt on. Oh, that's neat because, thankfully, after companies like ours and there's a few others, I've been advocating for this for some time. You know, pretty much everyone's getting on board, but it's not a it's not a bolt on. It's not like some weird connect single pointed single source connector thing and that kind of stuff. Our system was built around being able to query a bunch of different engines at once. And the benefit, of course, is you don't have to go to all these different data sources and bring data local either every couple hours for everybody or my myself bring into my local desktop or my local server or something. I write a SQL statement, and I let that Starburst slash Treno engine use all that metadata we talked about earlier to figure out a very efficient, scalable performance oriented way to resolve that query. That's a lot different than go get everything from table one and system one, table two and system two and table three. It's really gonna try to do that in an optimized way. Alright. I wanna get to the demo because I got thirty two after in the clock. So let me see what I got left in here. I'll bring through this real quick here, you know, the dates of terminology that we use because these are used these are terms in industry that probably don't have a definitive. Everyone agrees on that. I'm just gonna keep it kinda simple. The data lake meant generally speaking, it meant something like Hadoop. It meant, yes, absolutely. We can size and scale data. We can throw a bunch of programming languages at it, so on and so forth. But for the most part, it was around time series, immutable data. Data just keeps growing, never changes. In fact, that's still a significant amount. A high percentage of the amount of the data is transactions, ETL things, events that happen in your car. You went fast, you sped, you turn it, all those things that just keep happening. They have a time stamp. They occur. We wanna record on them. Why? So we can do analytics on those. But if we use the word data lakehouse, think of, okay. We want still want all of that. We're probably part of the do but didn't get into it was bringing together storage and compute. There's a really good reason to do that. But the reasons to turn around and go back to before what that was before would separate storage and compute is really about economies of scale and about changes that are maybe not as tough as they were in the older days around networks and these things. So we can scale up storage and compute independently, which means we can price it and pay for it much more independently. So we want all that. We want the performance, the scalability, the speed, all those things that you think about when you go to a data warehouse kinda tool. And, absolutely, we wanna let you be able to start changing that data, have some mutations, update statements, merge statements, that kind of stuff, because not all datasets are time series immutable. There's plenty of things out there, and they've always been out there. But there have been hacks and things in the past that makes that how you solve it. Nowadays, you just treat it like it's, I don't know, a database and do those things you wanna always do. And we really, really, really wanna be able to get more. We love the good two dimensional tabular data, absolutely, but the world isn't only that. We wanna make sure that the lake house really, really solves both those problems really, really well. Alright. So, one more thing I'll throw at you is Apache Iceberg, something I mentioned earlier. Probably at one, yeah, one ish, two ish slides on the iceberg. I will say this. That original SQL on these data lakes was called Apache Hive. Very prominent, still around. In fact, the tenants the architectural tenants that it introduced are still at play today. In Iceberg, in Delta Lake, in Hudi, those are the primary kinda next generation. We refer to them as table formats. We can spend another we have plenty of other webinars and documentation tutorials on why Iceberg is, in our view, the best one or the best one in most cases, we definitely gonna, again, support all those other table formats as well. So think of Iceberg as just the next generation of if you heard of Hive, think of Iceberg as the next generation of that. Think of it of having the ability to change datasets in a very consistent way with with versions and snapshots and rollbacks and time travel, all kinds of very exciting features. Again, if you need some more information on that, ping us in the chat so we can make sure we can get it to you before later. And we're always doing a new where where webinar. I think we have one next week. Quincy's there. I think it's next week of performance optimization for those already in the in the Apache Iceberg path. Alright. So that open data lake lakehouse, so I can get to the demo here, says store the data somewhere. Those are just some popular, public cloud object store systems. Use file formats. We didn't talk about file formats in CSVs and JSONs, but there are more analytic oriented analytical analysis oriented frameworks, file formats, excuse me. Two real the most popular one on them, just gonna mention is Parquet. Parquet works really good with Delta Lake, works really good iceberg. The table format is really about the metadata that tells, hey. I'm a table. My and it also has to know about those files, has to know about how many files, how many partitions, all kinds of exciting stuff. You wanna couple a tree an engine. Again, if you do all those things below that Trino line, the good news is Trino could be one of many engines that can access and utilize that same stack. So always thinking about you should you should be always thinking about what's next. Am I gonna get from here to there? And I and I say, good news is you can do it all with TreeNote. But if you need other frameworks or you have them and you're using them, we wanna absolutely work alongside and with them. And then we wiggle the little line that suggest water, and we think of that as that lake or lake house on the bottom. And then above that, we said you got a lot of other data still out there. And those data systems that we mentioned earlier in those connectors, databases, data warehouses are definitely places we have lots of different data. We wanna be able to integrate all that into one universe, into one galaxy, I guess you can say. And then, of course, applying all those cross cutting enterprise concerns in place. Fernando asked the question. Is the Delta Lake format available? It is. It absolutely is. So if you went to, just to show you how people find these things, you say something like, I'm gonna say Starburst Starburst connector, connectors. Maybe I can get away with that. And often, they're gonna have, like, two of them. One's the marketing page. It's gonna tell you all kinds of stuff, big, pretty pictures. There you go. Delta Lake right there. Tell you all about it. You know, wanna learn more, go click here. It's gonna take you oh, take you to a really nice PDF. The other alternative, of is just the raw doc pages. Yeah. Tell me very specifically about yeah. I'll just type it Delta Lake. Boom. Boom. Boom. Delta Lake connector. Absolutely. We support Delta Lake. Yeah. Awesome, Fernando. We also support Hootie, but I'll be unfair. We only support Hootie today in a read only manner. So and that kinda makes sense. Those that know a little bit about Apache Hooty, it's definitely more of a coupled pretty heavily coupled Flink, little with Spark Streaming. But if you have Hooty tables, you can use them in an analytical way. We can't update them or add to them via via Trina or Starburst today because no one really released that code, not because it can't be done, but that's why. It's because that's why. That's the one guy says. Yeah. Thanks. Alright. I think we're really close. Boom. This don't look at this, but look at this. On the far left, remember I showed you those little steps? This is just saying in each one of those things, every one of these layers itself are very you know? When I say very complicated, do you need to know all this? Probably not. It depends on what you're doing. Platform architects might need to know some of this. Your data engineers are looking for super tweaks and performance might wanna start learning some of these things. But every one of these, I'm just trying to give you maturity, complexity, richness, not, oh my gosh. I have to know how to do all that stuff. For the most part, most of us know how to write, create a table, select from a table, insert to a table, do a merge on a table, join on a table. If you write SQL, then Starburst and Treno is your friend, and you can plug it to, like we said, lots and lots of data sources, and you can use a variety of interface points that are coming in. This kind of couples interfaces with those cross cutting concerns. They're arguably both plug ins, but I would say they're kinda two different things. Alright. Now what are the security? Villa said, what are the security mask and polish all that was defined? I'll just show you. They're gonna be defined depending on where, what authorization engine you use. If you use something like Ranger, Muda, private server, you can do it in that system because we're gonna integrate with it. If you use our own built in access controls, you're gonna come in here and what are talking about? Column masks? Yeah. Column masks. You're gonna come over here and, you know, define a column mask, master screen, show the first four. You you set up whatever your definition is, and then you're gonna go create a security policy that uses said mask in some way. And we usually do that, things like masks, filters, that kind of stuff. And just general queries, we use not just classic role based access controls. Right? Yeah. Yeah. I got a bunch of roles out there that people are working on roles and privileges. But we also do a first pass of attribute based access controls today. We do we do a sub of that called tag based or label based access control. So if you apply PII tags or top secret or Lester tags or whatever, just like I said, you could associate that in a policy. You absolutely can say, hey. Because, only the HR team can see the, data that has been tagged as, you know, employee confidential or something like that. So, yep, all all in our tool here are in depends where your authorization system is built in. You're gonna do it in our tooling. If you're integrating with a third party system, you can do it in the tool you already know and invest. And so if you've got, again, privacy and a meter are two good examples or Apache Hivey Apache Ranger even you spend a lot of time and use that for a lot of different things already, then, yeah, likely the answer is let's integrate with there. But if you're this is your first time to bring all these things into one, I highly suggest use our built in access controls. Let those connectors use a super user kind of model and then use the roles here. Again, the roles could be still inherited. The groups and roles could be inherited from maybe an authentication system as well, or you could build your own custom role, so on and so forth. It was an easy question. I rambled on about it here in the tool. And everything has a REST API as well, just like all those third parties. Good. Good. Good. Good questions. Alright. 43. I need to get to the demo, so let me just do this and then we'll jump into the demo. Data products. Data products. I'm gonna show you data products so it'll be just as useful when we get there. But data products are really about pulling to get curating our data products, not the phrase data products. Our tool, our offering called the data product is a way to curate or bring curated datasets into a centralized place with a nice, pretty UI, nice, pretty REST API so that people or processes can say, hey. Tell me about the the the big the the real important things. Now don't show me the fact I've got 600 connections. Show me business oriented datasets. Show me use cases of those datasets. How do I use them? Give me other example query, so on and so forth, and create a marketplace basically where I can kinda go even if I don't have access to them yet, maybe I learn a little bit about them, and then I learn how to you know, maybe I really want that. So I'll go ask someone for access. So I think the best way to show that in a minute is go show it in the GUI. So I will do that at the tail end of my yeah. Yeah. Killing my demo. Alright. So I wanna do a hands on. What do we wanna do in this hands on? Well, first thing we wanna do while I'm talking is pop out of this mode and go hit there's my cursor. I'm gonna go back to my query editor. You see I'm using a server called free. I'm gonna run a query on free because I wanted to wake back up again. The beauty about this tool, it's called the Starburst Galaxy. There should be a link in the docs section to go there. We often call it a free trial, but I wouldn't call it a free trial. I would say it's a free service, a free cloud based offering of Treno and additional tools that we bring to bear. And why do I say it's free? Because you can use resources, compute resources that are free. They won't cost you anything. The consequences are not very big and usually one node. And historically, we let it keep alive for a while. Nowadays, I think the max you can keep it alive is five minutes. So it does quiet us down pretty briefly, and you saw there it just takes a short bit to crank back up again. But it's not the classic serverless like you might think where you push a button no matter what, it will always turn back in instantaneous. We need a cluster to run, so we start that cluster back up. Okay. Why did I tell you all that? I said all that. I did all that because I wanted to run. Alright. What am I gonna do? We're gonna build a solution. We're gonna say, hey. Someone has already loaded some New York City. You've heard this before. The old New York City taxicab data. Right? Uber data, taxicab data. Someone has written a system or ingestion process where that stuff is landing on s three already. So we're gonna step away from that. The picture here suggests it was done with done with Flink. Okay. Maybe. Not sure. And then what I'm gonna do is say, hey. Can we take that data and can we reference it in a raw layer or a bronze layer? Can we just simply say, hey. I have a table that's pointing to that ingested data. Because those that know a little bit about this, we don't wanna change the data. We want it as received. Then we're gonna build a little bit of a transformation jobs. It's gonna be very simple because I gotta do all this in a short bit of time that we're gonna build. We're gonna turn that raw table, that that bronze table into something the silver table, Something we can say, hey. This is the core data that we want to report against. It's our structured, well defined, well oiled data. And probably along that way journey, we might realize, hey. There are some other datasets that live in other places. We're gonna find out there's a dataset we like in Snowflake. It's highly related to the New York City area, and we might be able to show you that query federation joining those datasets. And then that will then not lastly, but almost lastly, we'll go, hey. But people do have specific questions, and maybe we need to do a little bit more work to make those dashboards, those human questions go faster. We might do some materialized views. We might if they're not a performance problem already, we might just create a view on top of things just so it looks and smells and tastes easier. And then those things will take those and say, hey. These are my curated datasets. We'll surface them up as a data product. Alright. So let's get on into it. Make sure my guy's still running, Spawning back up. Alright. So what am I gonna do? I went ahead and I created this cluster called free cluster. I created a connector here, catalog as we call it, to a to a to a data to s three location. There's nothing in there right now. These system and info scheme are just boring boring schemas. They have nothing there. Well, they have no business data. They just have kinda awareness data. And I also created another one down here, connection to start to Snowflake. I'm calling my taxi zone lookup. So we'll see this here in just a minute. But first thing, if you remember from my picture, I said someone loaded up some data over in s three. So I went to my catalogs view. I found my New York City Uber rides catalog, uses Amazon s three, and I'm gonna use a tool we call schema discovery. I'm gonna say, hey. I need to run a disc I want someone's told me there we go. Someone told me in this big bucket, in this folder, there's a bunch of data that's showing up. Cool. Now didn't ask them anything else. I didn't say what's there, what's the format, or anything. Said, let me just go use this tool and see if it'll help me out. So I'm gonna run a schema discovery on that location, and it came back pretty quick and said, hey. Look. I think what you see in there what I saw in there is just probably a table. He's gonna call it year month. Why? Because there's a folder called year month. We can we can fix that in our silver tier. But he did notice there's subfolders and things like that. So for those that know about partitions, it actually says, oh, there's probably some partitions in here. So let's say, yeah, let's do that. That's that sounds good. So we didn't really see it yet, but we're gonna see it, right about here. I was gonna say, usually, that should be done already. So one, two. There we go. And, ultimately, it said, cool. You want to put it into a schema that doesn't exist? I'll create your schema. Well, more appropriate is this. It said, oh, okay. You know what? What I really can do for you is I can build you that table, your month. And here's the datasets. Here's the columns and types I'm gonna see. And you might notice a lot of VARCHARs. This is kinda normal pretty often. Doesn't mean this tool can't go further, but often when it seems like CSV, it kinda shies up and runs to VARCHAR. Sometimes when it's something like a well, I'll leave it at that. Sometimes it gets more detailed. There's different arguments of thought about there. But long story short, it's built what we would call, for those that know a little bit about all this, this is a Hive external table pointing to that location, and it went ahead and create partitioning and stuff. So I said, yeah. That's exactly what I want. I believe it's already run it. And in fact, I'm gonna go look at it. I mean, I can see it from here here, but I wanna go back to my query editor and kinda go see it in more details. And my hope is when I toggle this, there it is. There's a new schema. I use the default discover schema. You probably have would want a better name. And there is that table year month as you see listed on the left here. And in fact, why don't we just go run a query on that table and make sure it is really there. Select all from my year month table. And there I am with all my three part naming conventions, covering it with double quotes, let the system generate that. I would not normally put those double quotes. That's if you did something silly like put spaces and stuff and things. But nonetheless, there's some of the records. What is it? This taxi got dispatched from here, still with that. Was it this location this year? Know, this is cleaned up data. Doesn't have all the gory details. It really pretty much has a word I picked this person up and when did I pick them up. I thought it had duration there, but I guess not. Thought we got kept that. And then lastly, just kind of when it was because all the characteristics really are important against this. Okay. I can also just for giggle say because sometimes someone might have done that for you. You might show up and go, where did that table come from? Where's the DDL, the data definition language? You know, always you can wanna show create table and get that information. That exists for anything in the data lake world, not just Trino, not just this discovery tool. I just wanna make the point that it's just simple table. Now what I might argue as I run another query is that that was just one table. Couldn't I just done that by hand? Sure. It was five or six four, five, six columns, simple table. What if you were walking into 10 tables or a 100 tables or, you know, 20 buckets with unknown tables? This is a pretty good tool to at least go see what's out there. Make a stab at this. Again, this is something that back four, five years ago, people wouldn't have done they would have that a little differently. And here, you you still can do it any way you want. This is just offering some tools to kinda get you there. Alright. What did I do? I looked at that. I just said, hey. Tell me. I think I did a I did a distinct on your month. There was that column, and there's my six column. So those these are my partitions if anyone remembers those. Alright. So what did I do? I built a very simple raw or bronze layer. Just pointing to something. Let's build this structure out. Now the reality, what's gonna normally happen let me do a few things. So I'm gonna build another I'm gonna build another scheme, and I'm gonna call it demo, I guess. New York City Royce demo. Yep. Sure. Why not? Might even show up here. It might take a second. There it is. Demo. There's nothing in there. This is where I'm gonna do my, silver layer. I'm gonna build that. So let me run a few things, and we'll talk about it. Create this table, insert that, and run a query on that and describe it. Why not? So what do I have here? I said, look. Normally, what happens is you go investigate that table. You learn about it. You do a lot of quality checks. You you you put your data engineer hat on. You figure lots of things out. Ultimately, I did that, and I found out, guess what? Data's in pretty good shape. The date and column names are pretty good. Maybe a few things like location ID used to be in a VARCHAR. Maybe I wanna call an integer. Maybe the time stamp was really a time stamp. And maybe some other things like the way Hive does partitioning isn't as slick as the Iceberg does partition. So what I ultimately did at the outcome of all that work is I said, here's the table. This is the name of my silver table, my well structured data. And if nothing else, even if it was in perfect shape, we're likely gonna do a technical transformation. We're likely gonna get it into a file format such as ORC or Parquet, a a table format such as Delta Lake or or Iceberg. But I did all that. And then if you look here, I'm just doing an insert. This might be something you might do ultimately when you're all finished with DBT or some other SQL mesh or something like that Were you saying lift that stuff, throw it in there? As you see, my names are all good. I just did some light casting. Everything was in pretty good shape. And then I said from all that, show me. It'll run a little bit. Let's see how many records. It was about 12 mil records. It wasn't a whole lot there. Did a select. There's the data. And I just did a describe just so you can see those datasets in there. Alright. So that was super fast, but arguably, it got us pretty far in that silver tail layer. Now I'm gonna also just for a moment, I won't explore this too hard, but I'll say this table is an iceberg table. And like I said, I don't have time in what we have allotted today, but, I have lots of materials, and I like to talk a lot about Iceberg. I just ran a couple what they call metadata table queries to kinda say, hey. Guess what? There's a couple versions in there. Iceberg is gonna every time you change the table, often we're talking about the contents. You know? But even if you change the structure, we create a new snapshot ID, a new version of that table. And those versions are really good for two important things, three important things. Time travel. Maybe I wanna see what it looked like a minute ago or two versions ago, something like or a week ago if you keep that many versions around. Or more likely, maybe I messed up in production, and I wanna do a rollback. Oh, no. I really messed up. Rollback to where we were either before I did this or point in time, whatever it is. And then we can use this for branching tag, you know, kinds of cool stuff as well. Again, outside the scope of how much time I have today, but I wanna kinda wet your appetite. There's a lot going on in the iceberg world itself. Now if you remember in my picture, I also suggest there's something else up here, some snowflake stuff. So what I did just in the background, I said run a query from this guy. Snowflakes, taxis on lookup, the taxis zones, table called. Zone lookup, brother. You said the name three times. And, that data is right there. A little VARCHAR. Let us get all that. A little decimal VARCHAR location borough blah blah blah blah. Now what is that data? Is that bronze data? Is that silver data? Is that gold data? You know, it all depends on what the definition all those are. It's gonna be either bronze or silver, and I would say it's likely very silvery. It's very shiny. So from me knowing my shop, knowing what's there, me doing some investigation, I realized, yeah, that data is really, really good. And what that data is trying to tell me is what you see here, the borough, these locations are tied to those location pickup locations, drop off locations. They're tied to, some of the boroughs or the boroughs of the New York City area. And then I think boroughs have zones, if I remember that right. Yeah. There you go. Manhattan, Yorkville West, Yorkville East, W WTC, that kind of stuff. So it's just helping us. This is so classic kinda lookup table or maybe maybe dimensional table if you like that terminology there. So I would say that table is also a part of my silver zone. So my silver zone can be very logical, of course, because not only we cross schemas, we might cross technologies and that kind of stuff. Alright. So single point access, got that data, my silver's spread across a couple places. But what could I do with that stuff for fun? Remember I said we can join the datasets. So I'm gonna join the, iceberg table in Amazon s three with the Snowflake table. What did I do? I just took that location ID, and I packed on the burrow and the zone to that. We're like, that's pretty cool. So, really, the truth is I might go a little further and say, you know what? I have another table, or or maybe I have a gold my first gold table. How about that? My first gold table, I'm gonna call it, rides by zone. And what am I gonna populate it with? The contents from that join we did just a minute ago. So it's gotta crank to the 12 mil, not really tiny servers, give it a few more seconds there. But when it finishes, the good news is we can abstract away from, the complexity of that join. It wasn't that complex, but we can just say, hey. If you wanna see rides, when you see all those 12,000,000 rides, every time you look at them, if you already wanted all that information, just run use our rides by zone table because I made it a actual table. Could have made it a view and let it happen at runtime. Could have made it a materialized view, but I, you know, chose to make a table. And, again, these are tools and choices that you might use in conjunction with depending on how you put your pipeline together. Many people still use or have used and probably still use DBT as a as one of those tools that can kinda put these models together. So I say build all these, array understand them in a SQL framework, SQL tool first, and then I prefer to go back and kinda backport those into some richer set of tooling. Other people will say do it right there. You know? That's okay. That's your tooling choices for sure. Alright. So now we got at least if I went over here back up to New York City, see demo, I have at least two tables down. Got my ride pickups, and I got my rides by zone. Now I'm putting my gold and silver together in the same scheme and just to speed this up. You might choose to do something different, but, ultimately, those are my first two, my silver and my first gold dataset. And remember, my other silver one is sitting in Snowflake proper. Could I build a view here, point to that so someone didn't have to even think that much further? Sure. That wouldn't hurt anything. It wouldn't cost anything. Alright. What else do we wanna do? People might ask us questions like, I didn't show you this in the doc, but they really all this effort was for is that my preview or is that the query? Yeah. All this effort was to answer a question that someone said, hey. What I really wanna know is what's the most popular borough for for taxi rides by weekday? So lump you know, aggregate all the, bureau aggregate all the rides by the borough of the destination, as if they're being picked up picked up or dropped off, which or I think it's picked up. And, and then from there, which which day of week is the most popular for each one. So what did I do? I said, hey. Create me a view that is equivalent of this table. Select this guy. I did a if you're familiar with windowing, did a windowing function on the boroughs and then use a function around that called ranking, and then I just trimmed out anybody that wasn't, the top one, the most important one, the very first one right there. Rank column is one. Why? So someone can write a query that says this. Instead of writing this ten, fifteen line query that's not that complicated, but still hiding from that. And since that was already performing oriented already, I just made it a a basic view. I didn't do a materialized view. So I don't have refresh concerns or anything like that. So I was gonna get a quick answer. And then people ask me, why not? What's the most popular month, you know, by borough? Easy enough. I just tweaked it a teeny bit, made a a month version of that as well, And you can kinda see what those look like. To me, they make a little bit of sense because, maybe month admit month made a little sense because it looked like a lot of them around May, and I always think maybe maybe people are going to the airport, school's out, people grabbing their kids, going on vacations, and that kind of good stuff. The day week ones, they looked a lot like people going to EWR. It might have been like consultants flying home on Thursday night, you know, but not doing too much analysis here. Just giving you some ideas. So what do we see over here on our demo schema? Again, this is the combined silver and gold. We have our silver rights pickup, our gold rights by zone, and then two more gold datasets here. Now I'm gonna take a quick sip. Oh, good. I sped up so fast. We got some time. I didn't see any more questions yet, so I'll, I'll just go back to this picture. We believe it or not, in a very simple dataset, these taxi rides actually did build ourselves the medallion architecture. Raw or bronze. I always like to say other things like raw and land. Bronze, silver, gold, you know, land or raw, structured, and then maybe our consumed tiers and that kind of stuff. And we showed single point access. We showed query federation across that. What we could do, two more things to talk about. What we could do is go create roles, such as marketing and operations, and could we do our classic RBox? Yep. Could we do our tag base? Yep. Could we do column mass? Could we do all that stuff? We can. And I but I'd rather do is hold those, get through the rest of the demo. And if we do have some specific questions that will be most usefully answered by doing a quick interactive demo, I'll be glad to do that. Then to me, those seem like table stakes. You know? Yes. I need that, and you do need that. And, again, we will integrate with your partner, your your authorization system, as I keep suggesting, we have our own. We'd encourage you to use our own. We really wanted to make that core to our business since, you know, that single point access really means single point of governance. So we've tackled governance. And, again, governance or not again this time, but governance to me, at least, is is absolutely security, but I think it's I mean, I see. I think it's also, you know, things like lineage and whatnot, meaning something like this. I went to the the catalog view, the heavy metadata view of this world, and I'm gonna look at my demo schema. And I'm gonna look at those views, like the borough most popular weekday, and I'm gonna learn all about it and set up tags. I already have it inheriting some tags like city and j just to kinda help things out if I want to later. But what I really wanna show was this Lineage tool. I'm gonna pop it up and, and this isn't something that we invented, of course. Lineage has been around for some good amount of time. I just wanna make the point that included in as well, you can use your third party data lineage tool. You can use ours to do similar things. You can work with open telemetry, open lineage and all that good tools, you can do integration in in or out of this framework. But by itself, what can I do? I can say, hey. There's that view we talked about. You know, where did it come from? What is it? All that kind of stuff. You can click on and go see it. That's just a definition of it. You can kinda say, where did it come from to get this transformation? And the fact it came from, as you saw just a minute ago, Lester Martin, running that SQL statement. I built that just a bit ago, some basic stuff there. And it also shows where it came from. That transformation consumed rides by zone. And, again, same same thing. I can just keep walking that rides by zone with made up of, you know, that join we talked about earlier that there's those two tables. Zoom in a little bit. Now zone lookup, probably that should be the end of that ride. Yep. Because that's where it started. But rides pickup actually came from, you know, that generated your, your month that we saw earlier. So, again, there's our there's our bronze. There's our silver. There's our other silver because it came in silver or whoever created that. They did a good job of making it reportable. There's our first gold tier goal goal dataset, and then there's another derived goal goal dataset on top of that. I don't know if you like all that stuff. I like all that stuff. I think it's pretty cool stuff for sure. Again, we didn't invent any of that. That's all stuff that exists, that in the marketplace, but we didn't definitely didn't have that stuff four years ago when I got here. Took us some time to build build all that out. And we go down to what we could call them level lineage. I'm not we'll just save that for some deeper questions and whatnot. Okay. So what will we do next? Well, I would say a couple things. We got this far. I would probably come to that demo space. I'll come back to again, back to my catalog, get back to that New York City catalog. Yep. Yep. Yep. And I would go drill down to that schema, and I would probably start talking about this. I'd probably I think it's right here. I would probably start editing. What's this scheme about? You know, this is a mined silver and gold layer for New York City taxicab ride, taxicab rides, ride data, something like that. And have a lot of fun, fill it all out, Maybe apply some more tags that we talked about. You might as well create some if there if you don't see them here. The interface won't let you create them on the fly here. Don't believe you get to go back and define them upfront. Maybe add some more contacts. Maybe add my son's email for some reasons in there. Populate a few links. New York City taxi conglomerate con conglomerate. Yeah. Forgive me. And then, you know, a URL to it. I don't know what it is. And then I can say, cool. I got all that stuff. Save that stuff. What did I do? I just added more metadata about this demo thing. I can drill down into those tables or those views and start looking at each one of these things like the borough. Maybe the borough is not a big one. Location ID. Well, is location ID the starting or the ending? Well, I could probably come over here and say, the location for the start, you know, of the, of the ride, that kind of stuff. I can start applying additional details via tags, via descriptions. And in fact, we if you have a configured model, because we do have you know? I didn't wanna go into that too much here, but we can go configure in this framework models. And now today, we do embedding models and we do large language models. So these could be cloud based provider ones like you see here. These could be on prem, private, you know, firewalled off kinda hidden from the world models for whatever reasons, privacy or for cost reasons, that kind of good stuff. If I use these models, then I can actually back in that catalog space, there's this notion that says, hey. Auto fill some of this stuff out for me. So we can use AI to read through these schemas just like MCP and other things do and then start making good guesses at the at the definition I'm trying to pipe in there by scratch, and they'll be annotated that came from LLMs, and then you can kinda fine tune it, agree to it, disagree disagree to to it, it, all that good stuff. Why would you do all that stuff? Two reasons. One, the more metadata you have, the more, you know, as a human, we probably won't look this stuff up, sadly. But an AI agent will absolutely leverage as much information as you give it. So we wanna fill all that up with lots and lots of good information. And then what else do you wanna do? Well, you probably wanna go back to your query editor. I think we should yeah. We I guess we could have done it right there in just the catalog view. Let me do it that way. New York City. When I'm looking at that schema, to us, I wanna build a data product. So what's a data product to us? Whoops. Schemas. A data product, as I said a minute ago, is a highly curated environment. Now this is where our product is different than how you create that data product and works a little different in in our Starburst Enterprise as it does Starburst Galaxy. That was by design a while back ago, but the current eventual design is one approach, one approach only. I would say Starburst Enterprise actually has more features, more more complexity, more interesting things. But Galaxy would say, hey. Look. What I wanna do is take this schema that I really think is in great shape. I wanna share it with the world, so I'm gonna promote it up. Grab all those links and stuff, and I'm gonna say, okay. I'm gonna name it a data product. This is this is my new data product. That's the name of it. What does it really do? New York City, taxis. Right? Something like that. And it carried forward the kind of information you already found. We can enhance that, change it, delete it, whatever. We could build lots of more details using markdown too if we wanna have a lot of formatting, that kind of stuff. You can preview. What am I kinda building? I'm kinda building like a Wiki page or a or a Teams page or something that makes a SharePoint site or something like that. I might even say, hey. If someone's browsing this in a few minutes, if they wanna preview the data, what system would I run it at? I could add or change those schemes, more contacts, add Lester, a Starburst contact list as well. And then I'm gonna promote it up. What did promoting do? Well, promoting elevated up to a data product. So I'm looking in this. Sometimes this might look a little better. Let's zoom back out. 200. There we go. So I only have three data products right now, at least three that this person can see. A Pokemon GO and Air and Space and the one I just created. Now if I clicked on this, this looks pretty boring. You know? You can you can probably start to get the vibe that, okay. Cool. Someone can come and learn about the various, which one did I put a tag area? I thought I thought I put a little comment. Maybe it was right by zone. You know, all those details that we talked about earlier, we can start to see that kind of information. So if you wanna change that level, you will create it. You can create concepts like usage, examples, overviews. And the best way to do this instead of me particular pretend to do it is go back to the data products and say, hey. Look. I stood one up a little while ago, and I called it air and space. And the air and space one is what, you know, it's datasets. It's a data product, a collection of datasets. Datasets are tables or views, materialized views. The dataset is linked to a data product is linked to it's a wrapper of a schema. So that should start to suggest, okay. So this is a marketing thing. Tell me about it. Show me what's there. You know, table has individual, this, that, the others. It's got a couple of views, so on, so on, so on, so forth. But what I think is really cool about it is this. We have this concept called, excuse me, the usage examples. And this is where not only do I want people to know about those five or 10 tables or views, but I also know that they probably have further questions, and I don't wanna make a view times a thousand. So I might offer up, hey. If you're looking for, you know, common airplanes on, you whatever term you're looking for, you know, we can kinda go look at this thing and understand it a little better. Now, ultimately, what this thing does is and go grab these queries. I can hit go run it directly. I was just gonna make the point if I if I say query this data I just probably shouldn't have picked this one. It might be linked to a server that's not running. Oh, there it is. What it's really doing is this. At the end of the day, that's a marketing site. That's a drugstore to walk in and walk the aisles and learn about something. Ultimately, if I wanna run a query on it, the good news is it's still a schema with tables and views. So even if you don't know someone marked it as a data product or even if you decided you like it, you're still gonna go directly with your BI tool, with your query engine, whatever, and run queries directly against that environment, against these astronaut tables, against this airport table, and that kind of good stuff. Not sure what it what it didn't do there for me, but let's see I can get one to run. There we go. Looks like my server didn't start up. Okay. But I guess that's what I wanted to say about the data products. It's a mechanism to create not just necessarily for humans like I'm doing here. Go look at my data products one more time. Absolutely. We want humans to use this tool, and it's all integrated with our comprehensive search, start looking for things. But this is equally or maybe even more powerful. What did we say earlier? People are writing data prop data applications. Some of those data applications might even be AI or generative AI oriented. And you probably heard about this thing called an MCP server. Do we include MCP? We sure do. But we also include our own agent. And we have one in production day. We're offering a few more very shortly. But you could build whatever agents you want, but, ultimately, what I'm trying to say is if an agent could see this data product and that data product was action packed full of what this data is, what each one of these datasets are, not just, you know, a lot more descriptive details and a lot more examples of, you know, what am I looking for? Percentages by the HP tables. That AI tool very likely is gonna get better. The results are getting better because it knows more about your dataset. Humans can be better because they know about more about that and so on and so forth. So I see I'm in here in thirteen minutes, and the whole concept of what we can do with AI is definitely something I would stop and say that's a whole another great conversation. But I would argue that at heart, what do we do with AI is we take other people's models. We let you integrate with them. You bring them under the same governance just like a table or a cluster would be, our bot. We put uses amounts on those things. We can meet them like like it suggests. We can hide which ones you can't or can't do, all kinds of interesting stuff. We have multiple webinars recorded. We have multiple tutorials. I'd love you to reach out here or later, through devrel@starburst.io or anywhere else and try to tackle some of those good details. I think with that, I'm gonna stop my demo now. We got, seventeen after. And like I said, we have the time allotted for about another ten more minutes. I doubt we have ten minutes of questions, but I did wanna make sure if folks had some questions about what they saw, maybe around the stuff I was alluding to just now or maybe just have something completely different. I wanna give yourselves a couple minutes here to make sure if you have those questions, you ask them. But if you don't have any more questions, again, I appreciate you, as I said at the top of the webinar, for your time. I hope you take away from this something interesting, and and it'll be useful. And I'd say, again, million ways to reach out to us. Alright. First question. Hey. Look. Can I get the presentation report? You can absolutely get the recording because the system, when we're finished, we'll send you an email pretty soon. Once this all gets staged that, you know, we'll also send it you know, if you had a colleague that registered, they'll also get an email how to find it. So that'll come to you if it does it very shortly. You know how to find me, so you get that. The presentation deck, I don't have it posted there. So maybe the quickest, easiest way, if you really, really want it, is I'm just gonna put my email my my direct email there. Rooster martin a star first data dot com in addition to, you know, the DevRel that you saw earlier at starburst.io. Yeah. There's nothing in there I don't mind sharing. I probably should have just put it up on the docs and let you guys download it if you really wanted it. Alright. Again, if you're taking off, thanks for your time, and I'm just chilling just a little bit longer and make sure no more questions roll in. So, again, thank you for today. Okay. I'm gonna shut it down now, and I'll thank my colleague, Quincy, for backing me up here. And thanks for the great questions. Thank you, Milun, for asking those questions And everyone else as well, you all have a beautiful day, night, whatever. There we go. Squinsy. There we go. Some more yeah. Yeah. Next week, optimizing. Thank you. Optimizing Apache Iceberg performance. So if you're already in that world, come see us next week. We have plenty of news in that space, not just in that webinar, but in that whole space. That's a focus area for a number of us, including myself. So thank you again, and I will hit the lead stage and let you have your day back. Thanks now. Bye bye.