Video: Hands‑On with Apache Iceberg: Build, Evolve, Operate | Duration: 5840s | Summary: Hands‑On with Apache Iceberg: Build, Evolve, Operate | Chapters: Introducing Starburst Workshops (17.935s), Creating Iceberg Tables (425.57s), Iceberg Table Creation (1789.79s), Metadata and Files (2068.16s), Snapshots and Traceability (2470.71s), Query Cache Management (2733.425s), Analyzing Data Snapshots (2812.645s), Customer Reported Issues (2883.48s), Iceberg Update Operations (2963.095s), Time Travel Basics (3294.65s), Time Travel Queries (3368.415s), Data Versioning Strategies (3562.495s), Rollback and Recovery (3652.845s), Advanced Features Overview (3794.855s), Creating Aviation Database (3892.56s), Evolving Database Schema (3973.705s), Partition Scheme Changes (4353.765s), Small Files Problem (4533.02s), Compaction in Trino (4632.005s), Snapshot Management Considerations (4727.105s), Conclusion and Teardown (4914.755s), Q&A and Conclusion (5084.325s)
Transcript for "Hands‑On with Apache Iceberg: Build, Evolve, Operate": Hey, everybody. How y'all doing today? My name is Lester Martin. I'm a developer advocate here at Starburst. And today, those of you that are here live and those of you watching this later, are actually gonna participate in something we're trying, something new trying for for this, year. That's another it's a series that we're gonna in many ways, it's nothing new, but it's something we're gonna try to put as a series. We're gonna call it the Starburst workshops, and we'll try to do one about every month initially. Maybe we'll get it a lot faster if it makes sense, if there's an audience for it, etcetera. But the goal here is to make about a ninety minute, live webinar, give you some instructions, give you some hands on, exercises, make sure you have an environment to work it in. And I don't know. We do something. We do something together. So a little bit of a, you know, hands on lab or set of exercises that we work through together. So we're gonna start off with one that's near and dear to me, near and dear to, of course, the Starburst and Trino and the open source data community. It's about Apache iceberg. Alright. So let me get my cursor to move forward. There she goes. I should've put it up there on screen a little bit about me. Again, you heard probably all the most important stuff. My name, and I work at here at Starburst. Down there at the bottom is an email address. You're welcome to maintain and hang hang on to devrel@starburst.io. That's a way to reach me. A few other folks too, but don't hesitate. You can say, hey, Lester, if you're reaching out to me or just, hey. What's going on? And then, of course, we have lots of good assets on the website, and I'll share those as we go along as it makes a little bit of sense. Now before I jump on in, I wanna say a couple things. We have some materials. So if you look over there in the by the chat and stuff, there should be something that says docs, and the docs have a couple different things there. One of them says artifacts. So if you hit on the artifacts and let's just let's just go over there. Artifacts are gonna take you to probably this page right here. So it's GitHub project. Nothing special. Just trying to use a GitHub project to cutting put assets for folks to get ahold of, etcetera. And this is the instructions today. Today, the good news is a lot of these instructions a lot the instructions we'll be using are something that I produced before and published in our tutorials pages as well. We'll just kinda put them all together and work through a few of those together with each other. But this is a good place to come back, especially for anyone that's watching this after the fact. So definitely, if you have trouble finding I'll take it back. It's gonna take you all the way to this one. It's gonna take you to let me get there. Should've clicked on it. It's gonna take you to this one, this workshop name. After today, when I update this, this this will change to view on demand or something, but the rest is the same. There's the presentation slides I'm about to do, some links, etcetera. But more importantly, probably, you know, other than, you we're gonna talk about Iceberg. We're gonna do some full table full table life cycle create update, manipulate, change schemas, change partitions, all that kind of good stuff, and even do some table maintenance just to really make sure your hands are hands on keyboards and getting some hands on experience in iceberg. And if you've already done a lot of that already, feel free to hang around too. Not maybe maybe the steps are something you've done already, but maybe you have some questions that make sense at different, points in time in the conversation. And if you do, don't hesitate to put them in the chat. And I think there's four or five messages I need to look over there and see what they say. And I think Quincy is probably letting folks know or will let folks know. We actually are gonna allow folks to if they wanna do more than chat and ask questions to raise their hand and and come on audio video with us and ask some questions. So be as comfortable as you see. It makes some sense. I see all kinds of cool places, Northern Ireland and Southeastern USA and good old Texas where I spent most of my young young my childhood and my young adult life. I live in Atlanta, Georgia now, which I love. Okay. So what do we wanna do? When you get here, if you find it, and I'm hoping you can find it. If you can't, I do want you to scream out in the chat or some way to let me know. If you do find it, ultimately, I'm looking for you to click on that word instructions. Instructions take you to another markup page that really say a couple things. Today, we're gonna walk through some of the environmental setup for this lab, for this workshop. In the future, we might make that kind of like a prerequisite. If you do it today, you'll already be set up for tomorrow for next time. But if I click on in there, let's just do that. Or let me just tell you what we're gonna do. Do that. And then these other three links are formal tutorials that I have, and I'm gonna show them in the background to show you that you could do those in yourself or share them with your colleagues, etcetera. So for the environment setup, what am I suggesting we use? Well, we need a we need a Trito environment that is set up to work with a writable Apache Iceberg catalog. And how are we gonna do that? You can do it any way you want. I have a lot of alternatively things in here. But if you wanna follow my instructions and and make it really easy for yourself, the best thing you can do is get set up with Starburst Galaxy. How do you do that? Well, there's a couple links in there to get signed up and all that good stuff. But, ultimately, there's a little guide here, also a little tutorial. It's a, probably a little over hold your handy. You know, this is pretty straightforward stuff. And in fact, I'm gonna click on that link right now and paste it because I want you to go ahead. If you haven't done this and you're working with me live or you're walking you know, maybe maybe you choose to, watch today and decide afterwards, you'll do this and do something else. And then next time, you know, after we're all done, you might say, cool. I really wanna do that. Let me go back and set up. But if you are gonna plan on do these steps, please, right now, go to this page you see on my screen. If you don't already have, a login to a Starburst Galaxy instance or if you have step or something else, that's fine too. Whatever makes any sense for you or just bare bones credo. This is just some of the setup for the catalog are gonna be a little bit more arduous than the simple stuff that we do here. Alright. So I I went really slow, hoping people clicked on that free trial because I'm gonna do it myself live with you just so I can make sure the timing all works pretty well. The notes say, hey. There's another thing, and this other thing again is this lab over here. I'm gonna just make sure you know I'm not gonna look at this one because it's pretty self guided. So let's do it. Let's create a new account. Hey. My name is Lester Martin, and I work at this company here. And if you want my personal work email, there it is. Starburst data Lester Martin starburst data dot com. I'm gonna say I'm a data engineer because that sounds pretty cool. And I think from there, that's about it. I don't even think company names are required. I'm gonna put it just so they don't add me into the all the marketing stuff. I'm not a robot, and I'm gonna sign in. So, again, I wanna make sure you know this is pretty darn easy. Now what it's gonna do next is send you, guess what, a code, the old sign in code. So that's gonna be very exciting. Let me pull my browser for you. It's something like that, and it got there that fast. You just need those six digits. Yours will be different than mine, and I'm gonna put those codes into it. I want you to be doing this yourself if you haven't. If you haven't already got set up, that's great. Alright. So it's gonna verify that. Now after it verifies it, again, scream if I'm going too fast or too slow. We'll try to adjust. After it does that, it's gonna say, pick a domain name. As you see there, it's gonna be something.galaxy.starburst.io. So if you want Lester I already have Lester's. You can't have that one. L e s t e r. It's already gone. Well, I should have told you it's already gone. Oh, maybe out of Western Martin. Whatever. We're gonna do I'm gonna do what you do whatever you want. I'm gonna do dev rel work shop o one. That's what I'm gonna do. So it's gonna say, cool. And if it is able to get that domain, no one else has already taken it. I doubt anyone's taken that one. You know, put your name, put whatever feels comfy. Make it as you. Make this personalized to you, not necessarily your company or entity, and then set a password. Don't set the password I have, but set your own password. And then to be honest, it should cross our fingers, log us in, lock us down, get us loaded, we're in. In fact, to show that we're really, really, really in, they even kinda queue it up with some SQL to say, run this, and then set up to, already have access to a catalog called sample. You might see it on the left here, and it's just gonna query these this bank data. Now we're not worried about that bank data. That just says the system's up. We're actually gonna set up what they call a catalog, and it reads a catalog that points to a s three bucket. I'm gonna share information to that we can write to. This is a readable only system right now. We also have one more big problem, and it's not a problem. It's part of this lab. So this lab, I just went through all these steps. I created a galaxy count. You know, I ran the first query. And if you are looking at this now or later, you can kinda stop on this lab. This lab actually has some cool stuff. It sets up a read only to a Postgres, another catalog, and does a really cool federated query. We join Postgres with the data lake data and stuff. Great stuff. But we don't necessarily need to do all that except for step number seven. We need to create another cluster, meaning that this is a list of clusters that we have here. And the problem is, the s three bucket we're gonna use in a few minutes is in US, AWS US East 1. As I hover over this free cluster, it says it's in US East Ohio, which is US East 2, if you know if you know all those silly AWS things. Why is that a concern? Because we're gonna use the free service. We wanna be nice players. We're gonna make sure we put our compute and our storage in the same availability zone, all that stuff, and don't charge a lot of money and everybody be happy. So how do we do that? The instructions tell you it's pretty easy. It's a wizard. You just kinda go over here to this admin link over here in the far left. And if you open that up, there should be one called clusters. So admin clusters in the far left. And guess what? There's already a cluster. We're gonna create guess what? A cluster. Create a cluster. Okay. Here, make it your name. Instructions tell us call it, AWS US East one three. Okay. Might as well. And then it goes ahead and says, if you want to add some of the data, I'm gonna add some of the static, some of the standard datasets that'll be available. And here's where you gotta get a little funny. You gotta make sure you say just type if you type East, I think it'll be the trick. Oh, maybe the AWS East. Well, you gotta look there until you find US East North Virginia. Where does it say all that? It says it right here. Create a cluster. Create. Create. Create. Somewhere in there, he tells us. There it is. US, East, North Virginia. Again, these are just AWS isms. And I'm trying to go let me look at the notes, see if anyone's raising their hand. If anyone's following along and, you know, needs a little more time purposely kinda not pushing it, just rambling a little bit as we go. If you got that far, the red you can follow those instructions, but the for the first for the most part, just leave everything like it is. I will go ahead and preemptively tell us a problem we're gonna have later. The problem is this. We have a cool feature called query results caching that if you run this inquiry exactly as it was a second ago and you have this turned on, it's gonna give you the same results. And, of course, you can set a session variables. Ignore that. Makes a lot of sense for those BI dashboards pulling that same materialized view consistently, constantly, you know, you know, the I'm gonna leave mine turned on just so later I'll have that problem, and that'll remind us there'll be a problem. So I'm gonna create the cluster. US East 1 free, mainly make sure right there, North Virginia. That's what we wanna peg. Create the cluster. Alright. Cluster has been created. It's starting to spin up. Everything's super duper copacetic. In fact, what I can do, I'm gonna go back to my query editor, and then I'm gonna toggle that free cluster up here to the one we just created, AWS US East one. And I'm just gonna run that first query again to make sure it really, really works on the cluster I wanted. And you if you if you look closely, you're gonna see that we have two two clusters here. You can have a a variety of clusters and a variety of sizes, a variety of auto scaling, a variety of, how much caching, how much fault tolerance, all kinds of neat, cool stuff, locations, different cloud providers, blah blah, all kinds of really awesome, stuff. But so to recap, where did we get so far? We got halfway through these instructions. We set up Starburst Galaxy. So all that's here. If you can't find these instructions, if you're joining late, you know, ask a question. Quincy's here to help us out too. She can probably get you going, at least point you to instructions, and and catch up. Alright. The second half is we wanna need we need an s three bucket that you have permissions to write to because you're gonna create tables and do all kinds of fun stuff. So guess what we do? We're nice people here at Starburst, and we say, you know what? If you click on that, we're gonna send you to yet another one of these cool tutorials. And this is another one that might have maybe a few more pieces of instruction than we need, but it also has the credentials that are gonna work today. Might work tomorrow, but they get changed pretty regularly. So I'm gonna say, hey. Let's come over here. And if you find that, it's basically the first few slides are like, oh, yeah. Get an environment. Let's get your Starburst Galaxy environment. We got it. Be an admin. We got it. And then it just like we did with the cluster, it's gonna make it kinda simple. Hey. Find the catalogs page and hit the create catalog button. So let's go find all that in our runtime environment. K. Keep checking over there, seeing how you guys are doing on questions and answers and that kind of stuff. Okay. So I'm over here. Where is that at? It's right under here on the left, under data, catalogs. Alright. And my screen's kinda crunched, and there's a lot going on in this one column, so it doesn't look all that pretty. But I'm gonna go ahead hit create catalog. And I'm not necessarily gonna go back here and show you step by step here. I'm just gonna kinda say it's all there. We'll scan it. Why not here? So we're gonna create an Amazon s three one, and we're gonna name it this name. Temp cap temp temporary catalog is really what we're trying to do. And there's some notes. I'll copy those because those are something I wanna say a few times today and definitely try to mention it again, by the end. Alright. So I'll go back over there, create a catalog, Amazon s three. We wanna call it. And, again, just like the server name or the cluster name, could you call it something different? You could, but the instructions are kind of geared around these names. I'm confident you can work through it. But if you look really close, it's kinda hard to see that warning there in the description says, hey. Delete this catalog when you're done playing with it. And I'm gonna tell you, delete it before the next time you come back. So if you walk away without deleting it, come back to this environment the next day, and all of a sudden things act kinda funky. Don't try to debug them. The problem is this. This catalog is the data is going away at the end of the day. You're gonna have a meta store point, and this stuff isn't around anymore. We do that just so we guys don't, you know, fraud us and make us look bad and run up our bill and all that good stuff. So the short and sweet, simple answer is just get rid of the catalog, and realize it's ephemeral for you, and you can't just rely on us to have an open open ticket, open check on Amazon s three. But if you wanna build your own, create your own bucket and set it up, that's even better. That would be the best. That's what I do on mine, personally. Okay. So you're gonna call it tenant catalog, and we're gonna use the AWS access key and credentials. And this is what I was saying. We you know, this isn't really a great thing, for us, you know, but it's a good thing for you because I'm saying use these credentials that I'm giving you right here. Just copy and paste them in, And we rotate these a lot, so don't expect them to work very long. But today, they should work. So we entered those cast credentials. We'll verify in a moment if they work. What else is in there is basically we need to select, a metastore, like a Hive metastore or something like that. We're gonna use the bundled one inside of Starburst Galaxy, and we need to point to the bucket. Alright. So let's run over there. We're gonna use a Starburst Galaxy, not Glu, not the Metastore, and the bucket name is Starburst tutorials. And the default project name, that's once, and you wanna get all this right is project. So under that bucket, I'll have a little pass. And those instructions are gonna tell you to turn on both of those things. They say it kinda ugly here. Select, select, but I give you a picture. Tutorials, projects, turn on, turn on, and make sure it says Starburst Galaxy. That's what I have right over here. Yeah, man. If you need to click on yeah. I'll give you the I'll give you the page. The page is gonna be for credentials, for creds. See. And on that, you wanna be seeing the one that says step whatever that was, connected to MedStar back. I think it's gonna be called step three or something. There it is. Four. Step four. Step four. And you can copy and paste them out of there, Monica, or anyone else that's having that troubles. Again, I I purposely haven't gone real slow. Raise your hand if you are actively doing this live with me and you got an issue or something. If you're just watching, forgive me for not just blind through it because I do wanna I'm hoping some folks are willing or interested. Yeah. You bet, Monica, to to do this live with me like Monica's doing. So thank you. Alright. So what else is on this page? There were some stuff like fast warm up, leave it alone, default table format. I think this in this guide is gonna tell us market is hive. I don't know what it tells us anymore, but it says do whatever you want. We're gonna do, leave that iceberg. If you set it high, it's not gonna hurt anything. Okay. I'm gonna hit test connection, and I'm gonna look for a green message. Hooray. Any message that starts with hooray that's green is probably a good message, so it's all good. And that's what we should see in this tutorial. Right? Connect it. Test it. Then it says keep on going. So we'll connect to a catalog. And, really, what we need to do here is just come over here and add that catalog we had. If you notice the way this is set up with a free cluster and you're not you're not paying big money or anything, it's only gonna let you bind this catalog to the same avail cluster in the same availability zone. You've got more things, cross cluster, cross available zones, cross cloud providers, all that. That's cool. Today, we gotta keep it cheap. We're all getting trouble, so we're gonna say, yeah. Let's make sure both the bucket and the instances are in East 1. Okay. That's it on that page. The rest, the tutorial tells you when, where, why. Basically, just we're admin. We're just wide open. And the tutorial is also gonna tell you when you see this message, don't hit run discovery. I have some other workshops that we do this, and let's not do that today. Let's do that later, as it says. Now good news is if you're following along, you're doing well, I'm gonna hit next. It's actually gonna say, why don't we head over to our our our query editor? And I think we should. So where is that? A lot of ways you can just hit query data on the side of it. I'll just do it that way, or you could come over here to the top and say query query editor. Ultimately, you wanna make sure we're in the same cluster, AWS US East one. Let's go ahead and use the temp catalog. There should be no schemas there because well, there's the information scheme and system schema. But the next step, let's go ahead and we wanna do this one. It wants it says create me a catalog a schema, but let's change it. If you look really close, I'm gonna I know it's tiny, so I'll let y'all scream if I have to increase it. The more increase of the, you know, the real estate or Morty, if you know what I mean. It'll get kinda messy. It's really saying don't run that. It's saying change first name, last name, postal code to something that sounds like you. So my first name, last name, postal code is Lester. Let's just do it all over again. Lester underscore Martin, and my postal code is 30028. 3004. Sorry. Alright. There's my, first why are we doing that? If you really, really wanna know because under the covers, that s three bucket, under the bucket name, the project name, whenever you create your scheme, it will be a next folder in there. And if everyone types temp first, underbar, last, underbar, postal code, then we'll all be pointing to the same location. We'll all get very confused very quickly. Again, maybe not what you would have done in production. Works pretty good for us here. And if all that goes well, you should be able to if the metadata is caching quick enough. So I'll go this once or twice and hopefully see it. I didn't see it show. My server wasn't running, so it's it's spinning up. And this is the other consequence. We're about to get the iceberg, I promise. This is the other consequence of totally free. It is set up to wind down after five minutes in activity. So once you get started, it won't be a problem, but sometimes, especially for. Let's see if it shows up. There it is. So there's a new schema now. Temp Lester Martin with a bunch of stuff. And you won't see mine, I won't see yours if we do that right because we're all using a different metastore. We're using a different, you know, catalog metastore that doesn't know about each other, but we're sharing the same bucket. So it's that's the consequence of weird stuff. Alright. If we're there, you can walk away from this tutorial now. In fact, think, yeah, it finishes up. It says you're all done all through anyway, I'll close that out in case we need it. Unless we need it. Good news. I'm back to this. Again, this is the instructions and everything. If you don't have this, it's under docs, under artifacts, and it'll take you to the top page and walk down into it, see the instructions and instructions to to set up. Alright. We set up start merge gallery. We set up an AWS three. And all I'm gonna remind you on this page is some point in time, you wanna get rid of that temp catalog, and to get rid of it, you have to disconnect it from the cluster. All that all if we have time, I'll show it later. It's very, very easy to do. And if you don't do it, nothing's gonna blow up. You just might be confused. Alright. Environmental setup is done. Let's get to the instruction. Let's get to the meat of Apache iceberg stuff. Anyone with me? How about that? Let's do it the other way around. Anyone not with me? I'm gonna put you on the spot. Who's not caught up? I don't know. I'm not gonna do either way because I don't wanna put you make you feel there. But I would say again, if you need me to slow down or speed up, holler, holler. So I'm back on this instruction page. Remember, this is a this is the link that we have in the docs, and there's a link to instructions. Here we go. We're gonna do this these three labs. And for the most part, I'm gonna I am gonna have them up side by side because I want you to know they're there, but I'll do a little bit of what I'm doing here. I will definitely go there to copy and paste things, but I'll probably talk about what's in there live as well. Somebody's caught up. That's awesome. Alright. So let's right click, open up in a new window, throw on the left hand side there, and go to the tippity top. Let's create a popular iceberg. Now I think somewhere in my slide deck, I was gonna tell you something. Let's go back and see if I had a goofy slide deck. I did, didn't I? Was it important, Lester? You tell me. That's that. That's the the there we go. Let's see if I had anything in there. Okay. I did have a teeny bit, so let's just put those slides to bed. I guess I got ahead of myself then jumped, on there. Apache iceberg. The good news is I'm gonna give you, like, one, two, three, four, five slides, six slides, and I'll let you see them real quick. And I'm gonna go through them slightly slower than that, but not really. Because if you really wanna learn about Apache Iceberg, that first tutorial that I've just pointed you to is gonna have a nice video to get you started. There's lots of other links, other places. And while I want to explore this in great detail, part of the day is if you know all this, great. If you got deep questions about it, ask them. But it's just a quick overview to say, hey. What's going in Apache Iceberg? Well, it's a table format. Hive is a table format. It's a way to store data in the data lake, have some kind of metadata that lets us logically map tables to it and then run queries. Whoo. Easy breezy. What's different in Iceberg and Delta and these other ones is that they move that metadata out of that metastore, and they put it on the file system and put them on the data lake along with our data. So what this is trying to suggest is there's a metadata folder and a data folder. We know there's not really folders in object stores, but I think you know what mean. Simulated folder. Things look like folder name. And we're probably used to the concept of the bot, bunch of data files, and we're used to the metadata probably living in something, maybe even a relational system behind something like the Hive metastore. The reality is all that data lives as JSONs and Avro files right there also in the data lake. I'm not gonna switch to the s three u, let you see that. I'm gonna let you just have to trust or come back and do some additional, materials with those, additional learning, and that kind of stuff. And I'm absolutely not gonna walk you through in great detail the difference between a metadata file, a manifest list file, and a manifest file. But what I will say is this hierarchy allows lots of good information about schema, about location of files, about statistics of those files, mins and max values as a great example that we'll show you as we go a little bit deeper in there. Alright. So patchy iceberg, a a data lake, a lake house, table format, all good stuff. And I was suggesting there's all that metadata, and I said I'm not gonna go look at s three, but I will tell you there are a series of what we call metadata tables. So via SQL through Trino, we can ask a lot of questions about what's stored in there. And one of the biggest things that's stored in Apache Iceberg is this notion of a versions. Every time we change the data, we get a new version. They use the word snapshot, snapshot one, snapshot two, so on and so on and so forth. So we'll see those as we go along as we need them. And then the the old metadata columns that you may be familiar with from Hive are still around. And, absolutely, we're gonna get a consistent cross, storage engine approach with concurrency controls and everything, how to do transactions. And that so a transaction will build a new version because the transaction is changing data. Just a simple query for atomic consists of isolate durable or an acid test transaction. I will say we're still today with Iceberg and Trino right here in that single statement, single table transaction supported by everybody. There are a few people that have ventured in to multistatement, which would also imply multitable things. In fact, you'll see the standard eventually tackle that very clearly and probably eventually all storage engines do that. And not only can we change the data, we can change the structure. That sounds pretty normal, but I don't know about you. I've been in this business for this part of the this type of technology for over ten years. And sometimes when you alter a table, that may not do what you really wanted to do because alters didn't historically change the data that was on you to fix the data. Icebergs in it kinda act like a database. It only lets you do things that it can handle because it has to make sure that always works, those previous versions and so on and so on and so forth. And you're gonna see this in practice. I won't talk about too much. Another cool thing about Iceberg is we can actually change the partition strategy at any time. Now just because you should doesn't mean you just because you could doesn't always mean you should, but there you go. And I'm gonna show you the little caveat that's calling out here. It doesn't actually change. It doesn't rewrite any data. It just remembers version by version where we are and all new data. Anything new at new day new rights, new ads, new updates, whatever are gonna always apply to the current version's partitioning scheme. And there are tricks if you need it to go back backwards and, roll it up. And, of course, versions give us time travel, query as of, or even roll back to a point in time. There's even more. There's branching and tagging and all that good stuff. But, really, what we're to do is do the exercises. So thanks for letting me stick a few minutes and go through my slides. I put them together for us, so why not? That was also secretly a chance for anyone else besides I think Monica and Greg, I think, are working with me. If anyone else was actually trying to get caught up, I hope they use that time wisely and didn't listen to me just now. They just went ahead and did that. Alright. So we're at the iceberg world. Deep breath. Have a sip of Pibb zero. It just means that Doctor Pepper wasn't on sale. But alright. No product placement this last year. Alright. Overview. A lot of stuff about the scope, etcetera. That video, I promise, you can watch that as a for you know, if wanna hear some more visualizations and stuff. Again, lots of information that we as a company can help you out with backdrop and building up to Iceberg. Get logged in to Starburst Galaxy. Make sure in the admin role, all that. Query a new table. Here's where at this point. Oh, good. We're gonna create and query new table. He's even saying, hey. Go to that, cluster temp catalog, temp first name, last name, postal code, schema that you have or whatever you have. We have that one. And I'm gonna copy this to the clipboard because we're gonna build something. Let's do it. Now I know what also just happened. I waited too long, and my cluster died down. So I'm gonna say create me this table called my iceberg table, and I'm gonna I will go ahead and venture and try to make it a little bit bigger with the hopes of the being more readable for those who are reading and not necessarily just going back and forth. So if you were like me, your server acquiesced down and spinning back up. So we're gonna say create a table. We're gonna run a quick, well, it's gonna verify. We're gonna add some records, and then we're gonna query it. So let's go ahead and get that ready while spinning up. So once it gets done creating, we're gonna add a few records, and then, of course, we're gonna do a select all from my expert table. If t b l. Okay. Schema must be specialized with Schema. Okay. Cool. He told me flat out, dude, you didn't you didn't either lock this down like I just did and run it again, or you didn't type the full name, blah blah blah, you know, temp cat, temp lester. But I went ahead and adjusted it. I got my cursor in the middle of functions. I'm gonna run the statement and run it again. And my table probably gonna be ready about now now. About now. Anytime now. There it is. It's done. Alright. I'm gonna add a few records. Hopefully, those that can, read those know what my favorite book is. I'll leave that for the comments in the chats for the folks that are just looking for something more exciting to do than type these commands, which I think are pretty darn exciting. And then I'm a run. It says three records were inserted. That's the kind of clue in the right over there. And then I said select everybody from my iceberg pay table. I got late Lito late Lito. Lito, Jessica, and Paul know, from House of Trades, if you're familiar with all that noise. If you're not, it's a name and an ID in the description. It's all good. Okay. So we're tracking along. Nothing too nothing too hard. Create a table, add a few records, pretty easy breeze. The tutorial's gonna tell us a little bit about the background, the metadata, and all that living together. I'm gonna say, cool. There are those remember I said there are metadata tables. Just for time purposes, I'm not gonna go back and show you every little thing in the docs, but they're all out there, and I'm gonna reach out to you. I'm a I ran this, and it's actually this this one that's called dollar history. You see it on line 23? You have to prefix it with your actual table name, table dollar history, so you have to put double quotes around it so it didn't freak out with the dollar. And what did history tell me? I pulled only a few of the fields, but, really, history kinda gives me a list of, these snapshots that are out there. Snapshots are those version numbers. And as you notice, they're not monotonically increasing. They didn't go from one to two to three to four. They're a variety of numbers. This first one is the snapshot when I created the table. Has no records. And the second one is, the version that got created when I added those three records. And just some subtle things you might notice, some lineage. It says, oh, okay. I was built on I I I follow this preview one. So they're always gonna have this notion of who was my parent. Yeah. Alright. So I've got some peek into the metadata. They're not gonna beat it up too hard. Let's peek a little bit more. What else is going on? I said there there are files under the covers. We can go to the meta we can go to the the data lake and look them up, or we can query our table, our metadata table called dollar files and say, what's out there? What backs this table right now? And in fact, it says, hey. And and I did a lot of nasty stuff because I just wanted to trim the file name down a little bit. Like I said, I didn't even trim it down, did I? Yeah. Oh, yeah. Did. I got it down to just the file name. Pretty long file name. So this file exists and has three records, and then there's some details about it, some statistics. There are more columns in this, but this is what I wanted to kinda focus on. And I think if you look at there, here's some here's some details. You know, I think I had mentioned a couple of them. I will just mention there we go. We'll we'll hang our hat on this. There's the inserts I did. And if you look at a couple of these values, I'll just focus on the last two. It says lower bounds, one one hundred two Jessica three concert. It says for for column one because Apache Iceberg does a a job that says, I don't care what you call a column. When a new column is created, I'll create a monotonically increasing number in case you drop it on or add one. So column one, which is the old ID, the lowest number it has in this file is 101, and the upper as you see what I'm hovering over, there's a 103. Yes. There's 10103. And while there's only three records, a simple file, you hopefully can gather that knowing that kind of information is gonna help an engine later based on a query, based on looking at this metadata to decide very quickly, can I prune out files from ever being looked at? And this prior to the modern data lakes, we still had this data in the files themselves, parquet files, work files, whatever. But we burn them up into the metadata because we can read that even faster and even not even just look into the parquets and work files and find this out, which is still pretty fast. But if there are thousands and thousands, it'd be a whole lot funner to just totally prune them out before reading them. That's kinda what's happening there. Again, anything that seems weird or conversational, ask away. I'm gonna do a few more things. Let's add a couple records, a few more records. We already had three, I believe. So we're gonna add into my iceberg table. Thufra Howett and some other people here, the beast rebann and that kind of stuff. We added five more records. We're probably gonna just verify that those are there. There's my select all from. Yep. Yep. Yep. They probably are gonna ask me to then look at the files again. Yep. Let's do that. So let's look at select file. What are the files that make up the table now? Well, the files that make up are that first file that had three records in this new file that has these wherever they are, those five more records. And I can glance, and I'll just look at that ID one more time. The lowest ID in this file is 104, and the highest ID in this file is the 301, the 104302 that you see up there in the insert statements above. So if you wanted to go look, could we go look at the data link? We could. But to be honest, right now, it's okay. But when we start making lots of changes later, the files that are there are not always the files that relate to this current snapshot. Lots of things can happen. So these things actually can be your friend. Alright. So that was, and then the history, I guess they wanted to show that. Let's peek at that. When I say they, I wrote that stuff and just call it they. They did this. Lester did that. Same thing as before. We got another snapshot here. This snapshot ID, which is the parent of this. So first one was added with no entries. The second version has three rows, and the third version has the, you know, five more rows. What is that? Eight eight rows in entirety. Cool. Interesting. Neat. What else can we do? Now this one didn't go too far. I just did a real quick, you know, create a table, play with it. So we'll say that's nice. We can close him out, and we will go back over here and say, okay. Cool. But let's do some more stuff. Let's play with our open in a new window. Check our time. K. I think we're doing alright. So this one's gonna do a little more. We're gonna play with the as it says, we'll modify the data. We'll play with the snapshots. We'll do some time travel and those things. Again, this is for those that are never done this, this is a great experience. Just kinda get your feet feet wet. For those that are already down that journey, ask some questions because, you know, I don't wanna make you suggest you do something that bores you to tears, but at least scan it and see if there's anything new in here. Same thing. We'll give you a quick video. You're welcome to check all that out. We'll tell you to log in to that Starburst Galaxy environment. We'll tell you to get to that schema, all the backdrop. And then we're gonna do it again. We're gonna build a different table this time. So I'm gonna copy that one right here, and let me put it in my query editor. So we'll scroll down a little bit here, let's go to the tippy bottom. Now this one, if you look at it, I used to work in phone company when I was a young man. So we'll do a old school, like, a phone provisioning table out there. So phone provisioning has phone numbers, an action, and then when did that action occur? We'll set it up with Iceberg. And I went ahead and set up some partitioning. Not enough time to explain partitioning. It's a way to store data into certain folders. So when you do a where clause later that aligns with the thing you partitioned by, we can look at even less data. There we go. That was the shorthand. Probably made sense to those that know about partitioning. If not, it's a it's a pretty important strategy, and data lake analytics has been around since Hive. So fifteen, you know, fifteen plus years at this point. Okay. So let's go look at those snapshots. Now I'm using a different table this time. There's I used history earlier. I'm looking at snapshots this time. It's kinda interesting. There is history in their snapshots. They have different things, but but nonetheless. There we go. Some details. Now I'm gonna look at this one mainly because it has this other column called summary. Me Let see if they wanted me to address summary or if I wanted you to. I sure did. That's the way I wanted to show you. So this snapshot came along. Remember, what is it? It's just that table up there. But the snapshots do have a little, summary, little JSON blob. And the main thing, if you look really close, it kinda says, look. I got nothing. Whole bunch of empty records. And then it has some additional details that maybe other engines don't care about, and you can do this, and other engines can add some additional data. We kinda say, hey. What was this engine? Oh, it was four Treno four seventy nine using Galaxy blah blah blah. But what I really wanna show is this. We include every time we make a snapshot, we add this little bit of information. We say that Trino query ID was this. If Spark reads this table, it's not gonna look at this thing. If anyone else reads, it's not gonna look it's not gonna hurt. It's for us. And what is it good for? Whoops. It's good for this. Later, when we're looking for things, if we want some traceability to go see exactly that work, I can kinda go up here into our query insights, apply a quick filter here. Let's add a filter on the query ID and this paste that query ID. You could probably have found that just looking. It's gonna trace us back to that operation that occurred, create or there's a DDL here, but likely it's a DML manipulation link updates, you know, delete, merge, that kind of stuff. And for those that know a little bit about all this stuff, you you have access to the query plans and all that good stuff. This one didn't get a query plan. It was just a general general ad. So that will be there always for anything that we did. Now, like I said, if we pick up another version that was done by another engine that we don't have that, that's okay. And there might be some additional information there, but it does help us for most of us who sometimes want that traceability back. Alright. Let's do some work to the table. I'm gonna copy a few more. I guess I could've tried to make this look a little bigger too. Yeah. Alright. I'm gonna add a few more records. I'm gonna add some records. It's empty. Things bone dry, isn't it? Alright. So I add a few records, and those records say, hey. Go into the phone provisioning table. Add me some records. I'm gonna toggle this thing. You know, there are all the normal sweet stuff. Right? Show me the details here. There's all kinds of little checkboxes. They'll say, hey. You know, show me about 10 records or something. There we go. We got two phone numbers that just got added as you see up here. And I went ahead and just tricked it. I kinda said, hey. You know, predate these. They happened a week ago. Somebody ordered a phone. I'm gonna show you that my phone system provisioning system is slow because thirty years ago when I worked for the phone company, getting a new phone line was not very fast. It could take a couple days. But nonetheless so we entered those records, and we did a select. You see them. And if we looked at the the snapshots again, might as well, as you would expect, We should see two snapshots, parent is the previous. And I'm gonna click on that summary of the second way and look a little deeper and see if I see anything. I mean, it's gonna say a couple of things. Look, I added two records. Cool. And those two records were added in a single data file. So if I ran that file thing, I would see one file that had these two records and stuff. And there's a lot of other stuff in here, but for the most part, I didn't do anything too crazy. Didn't make a lot of changes. I'm just kinda hitting on a few little points to show there are, you know, inherent built in, traceability bits all over the place in Apache Iceberg. That's Apache Iceberg. We do it because the specs has to do it, and that's a good thing. Everybody else can leverage that too. Alright. Let's add a couple activation records. So we'll go back up here. We'll do another insert. We'll do an insert and a select just to make sure they're all there. So insert two more records. Take a look. We should see there we go. Oh, uh-huh. Why do I not have two more records? Does anyone remember? I I said it during setup. I said, hey. There's gonna be something that bites us a little bit later. What was that? It was that my cluster is set up for, query cache, query results cache. So I'm just gonna go edit my cluster. Now I could do it the session variable called galaxy query results false or something like that, cache result. I'm just gonna universally say, boom. Turn it off. Now if I go back and run that query again, I should be able to see. And I left that that way because if someone's watching the video and they didn't catch that, they might be crying right now. Hey. Why did it why did it do this? Yeah. Thank you. Five minute cache. Yay. Yeah. You think so? There they are. One will you know, I ordered the two lines, and then it looks like a day later on the thirteenth, and they were activated. Whatever the whatever the stage changes are. Okay. Things are happening. Cool. It's a slow phone company that I used to work for, and we looked at the snapshots. I think so. Yeah. I think so. Oh, let's we did it. We did it. Why? Because I didn't show you what you might see. Similarly, I look at snapshots. There's a third snapshot. I'm gonna look at that summary one more time. The main thing I would point out is the same thing. Two more records were added. One more data file was record, but I overall, I got about four records. I should have a total of two data files. And then, again, as I scan the rest, not so important. If I wanted to go see that, you know, query ID and see how it ran and all that great stuff, I can go check all that out too if it you know, that was something I wanted to chase after. Alright. I don't know. Let's add yet another record. Why not? Boom. Boom. Boom. Why not? Somewhere around here. So let's just add a a phone number one that was phone number two got called. Somebody and, actually, you know, the records are gonna say, hey. Somebody called in and said, guess what? There's a problem. AirReport customer reports unable to initiate car. So act ordered and activated was the good the happy path for our phone provisioning system, but the customer reported something to happen. There was a failure. So we're just catching up and pretending all that stuff happened. Okay. What else can we do? Well, we should look again, but I know we can look at the snapshots one more time. We'll see the same thing. One more record was added. Everybody's happy. Move on. Move on. Alright. Let's do some updates. And here's where we're gonna start to notice some weird stuff for those that don't know a 100% what goes on in the systems and see if I try to explain it somewhere on this slide or next slide or not. Guess not. That's that's big big boy problems, big girl problems for another day. Let's just do it, and I'll try to say it briefly what it is. I'll need to look at the time too and make sure nobody's raised their hand. I'm gonna run two statements. Do an update for one one one one, that says customer, update notes. Oh, what am I doing? I'm actually backfilling the notes. If you remember a minute ago, the notes were all set to null. And then this guy had a note, like, customer called in. My business users realized they wanna know for these default fields. Maybe that's a smart thing. Maybe it's not. That's what they wanted. I argued with them. They said, let's do it. So there we go. We have some we filled in these, everything on the right over here from rows two through five. Right? Customer requested or numbers successfully activated. Okay. That's neat. That's cool. Why what what do we get from that? Well, couple things. We definitely start to see something weird if we look at the snapshots over time. So I'm gonna go find the snapshots and run that. And this happens because well, a couple things. How many do we have a minute ago? We had four. We had build a table here. We had add two records. We had add two records. We had add a fifth record. And then I got two more transactions or two more snapshots. And if anybody wants to type in there and say, why that's great, I'll go ahead and tell you the answer. After I wait a few seconds, I have a sip. We have two more, two more two more snapshots because I told you earlier, we don't do multistatement transactioning. So if you look really close, I did an update and then another eight page. So each one of those is related to them. So if I looked on them, they're both actually kinda funny. They have an operation. We haven't looked before. We just called it a PIN. We have an override operation, and I'll just pick maybe this first one. And so I get a summary, and he says, here's what happens in a nutshell. When we do an update or we do a delete let's do a delete. If we do a delete in Iceberg, what happens is we don't change the immutable data files that are out there. We build a deletion marker. So we say, hey. Cool. This file has a 100 records, and I deleted two of them. Great. Build a affectionately referred to as a delete file. Delete file that says, in that file, these two records don't need to be around anymore. Usually, we do them as a positional delete. It knows which records in there to say chunk. And that creates a concept called, we we tackle that update in a in an approach called merge on read. So we we said we can quickly make that change, but we'll defer the consolidation to a reader. So now someone does a select, they read this file with a 100 records, then they read this delete file that says get rid of two of them, and then they say, here's your 98 files. That's a big conversation, has lots of cool things, but I want to at least try to throw it out there. Let it sink in. So then an update is even crazier. An update says since we can't do an in place delete, we surely can't do an in place update. So we do two things. We create a deletion marker for that record or records that are gone, and then we net new add them. So in that scenario with a 100 delete two, what if we updated two now? Well, instead, we had a 100, then we created a delete file that says two records go away, and then we have a third file that says, here's two new net new records. So the system will take care of that for us, purge pushes the burden, down the line to the select person. And if I look in here, this should give me if that made a little bit of sense, this should this should echo something like that. I'm looking for this. Added records to I'm really looking for we did we did an an update. Two an update. Something I did two updates. So it's actually gonna say it's never gonna say necessary that I update to. It's gonna say, hey. I deleted deleted two, if I can find that, added records, total and it's all those two. There we go. Total positional delete. It said, hey. In this particular snapshot, I actually deleted two records. Okay. It also says I added two records because it remember when I said an update says market deleted and doing that new ad, put it all together at runtime. I'm gonna stop on that because there's a lot that's a if need someone, it's worth thinking about, worth exploring, but that's just fundamentally how it's happened. And for those that know what I've been talking about, there's another strategy. It's called copy on write, which just says rebuild those files at runtime, and there's some trade offs that could make a big difference. So really about how often is the data change, that makes you decide which one of those makes more sense. But for us, we'll keep it simple. They're merge on read. We'll make a, some trailing information that tells about the changes. And we'll let a select engine, assemble that for us. Alright. I'm gonna move forward. Yeah. There we go. Totally leads. There was a I didn't point out, but there was a delete file in there, and there was a add file in there for those two records. And then we did get it again. Okay. Marching forward. Deep breath. Look at the time. I got a 12:50 okay. We still got about thirty, thirty five minutes. I think we're in good shape. Hopefully, folks are still interested and still having fun. And if you are, let's do some time travel because that's where it's kinda starts to be kinda neat. Let's play with that stuff. Now I've been talking a lot, so first thing I'm gonna do is come over here and hit select. I wanna make sure this thing doesn't go to sleep on me. There we go. So the next step talks about time travel. How do we time travel? We do it one of two ways. We either time travel to a specific snapshot ID or we time travel to a point in time. So let's start with the specific version, specific snapshot ID. So he's gonna say, hey. Go run the snapshots table again and copy the snapshot ID for the second snapshot. Alright? Let's do that. Actually, I think I have that query sitting anywhere right about there we go. So I'm gonna run this. I will say there's a this is a little this one should have been drawn around this one. Long story story, there's more yeah. So read the notes, they grab the second one. So I'm gonna go here. This is the one I care about. So I'm gonna grab this on the clipboard. I'm just gonna go put it in here because I don't wanna forget it. There we go. That snapshot is I wanna do some time travel too. So what does it say to do now? It says, okay. To do a time travel clear query, we just add this cool four version as of that. So tell me what's in phone provisioning as of snapshot number two. And what's in there now is just those two ordered records. Why? Because snapshot one was an empty table created. Snapshot two was those first two records, that we order. Now we can walk each one of those, grab each snapshot, go look at it over and over and over. And I sort of v three. We have row level lineage. You can do all kinds of cool stuff. But what do we wanna do? We probably wanna show, okay. Well, let's do it against a point in time. Alright. So I'm gonna copy this, and it's really gonna be in the syntax that you saw there. It's gonna say, for a time stamp as of. That's a quick way to do that. I said a hundred and thirty two hours ago, which I think is about a week ago, if that's my math is right. And it did blow up, and it gave me a good blow up. It actually said, hey, dude. There is no version history for that table before, you know, what is that, a week ago. Now I only did this. It was a more of a kinda make a big, big point that what we're talking about here is not this. So if we the table has a dataset, a data column with a time stamp. We're not talking about what that is. We're talking about what that snapshot time stamp was when it was actually created, that kinda stuff. So if that makes any sense, then we could probably say, well, show me it as of, you know, one minute ago. One minute ago. And a minute ago, it probably has all five of those records with the updates. Yeah. Yeah. Okay. That's cool. And if we wanna kinda go backwards, we I've been talking a lot, so I'm gonna make it, like, nine minutes. There we go. Nine minutes ago, I did not do the update that caught up the notes. You know? Wanna go back maybe eleven minutes ago, maybe it was, yeah, just the first four records, the fifth wasn't added, so on, so on, so forth. So time stamps, this is pretty cool stuff. I personally recommend as technologists, let's be careful with this because often we get excited and we start saying this is stuff I wanna get my business user. I don't have a problem you letting your business user understand this. I don't really have a problem. You making it part of a process and a procedure and all that, but I do have a concern that this may not and likely is not the catchall all things for show me the data as of a point in time because there's a lot of overhead to keep track of all these versions, and we ultimately will probably tackle that from an operational standpoint of how often do we trim these and get rid of these and that kind of stuff because they waste not waste. They keep a lot of space to keep all those history. Deeper conversations why if you want to get into it. But just I would say be mindful. So if you need to have a system that shows you what the day look like last week, two weeks ago, a month ago, a year ago, I would account for that in my data structure. Is that a slow change in dimension type two? Is that, just formally snapshotting and rewriting that whole version every once in a while? Sure. Sure. Sure. Those are good options. There is tagging that I didn't get into great detail. We could tag a very specific version here and never get rid of it because it's tagged. There are various strategies, but don't just be we run and go, cool. This solves all the business problems I ever had about point in time. It could, but it probably won't because you probably don't wanna pay that much money to keep that much forever number of versions around. So there, enough warning because people get excited by this and rightly so should, but this is really a side effect or a cool consequence of having versions. Alright. Soapbox over. You don't get the soapbox in the tutorial. So let's do some rollbacks. Cool. Copy this. Alright. So the rollback is, you know whoops. I copied the wrong thing. Where is that rollback? Oh, no. Here's what the rollback, it wants us to do something. That's fair. It says it says, oh, we'll grab the most current snapshot. Sure. Let's do that. So what does it look like today? The very last one is this. And oops. I did that, and I didn't. Go control c. So my latest latest latest snapshot ID is that, and it's gonna say, cool. Let's do this. Let's do a a delete of something. Let's delete phone number to do do do do do do. Run that. If we do a select from that, you know, we're gonna see it's gone. Right? Select 10 records. Yep. Yep. Yep. All of the to to to is gone. Yeah. Oh, because 222 had activated, ordered, and then error. So somebody said, there was an error. Let's just clean up the report by deleting the bad record. And then you and I went, what did you just do? And we said, let's fix this. Let's let's roll back and catch it again. So this is gonna be dependent, of course, on your compute engine, how you write this, but they all have to support this beautiful feature called rollback. Whoopsie. Boom. Boom. Copy that. And I'm gonna replace it right here. I'm gonna say, hey. For the schema called I think my schema is actually has a different name. It tells me to say temp. Lester Martin, Russell, Georgia 3004. So it's gonna say, hey. Roll back to the phone provisioning table and that schema to this version. And if I did that and I run my select one more time, magically put everything back in there. Now, of course, it also means if I did 17 other things and I did a roll back to particularly when I lost all of those too. So there are other strategies like creating a branch, doing changes there, morphing those back in. So there's a lot of other options, but but at it's hard. This is probably more for the data engineering pipeline folks that realize ten seconds later did something stupid, or they have dependencies on other third party systems that we don't really know, that massive orchestration, that other stuff fails, and they wanna go, how do I get this super, super distributed, emulated transaction? Why don't I just do it roll back there? So a little work programmatically, and then get there. They're back in business, and we saw time travel. And we saw rollbacks. Time travels and rollbacks. Is time travel and rollback cool fun stuff? If so, say yes in the chats. I know someone's there. I'm gonna cry. The folks watching the video later, I'm gonna say, like, 12 people, not just Ren said, yeah. So, cool. Hey. That's good stuff. Alright. What else can we do? Well, if I waited to activate instructions, we could do and this is the one we'll finish on here probably. And then for those that wanna do some iceberg v three stuff, I'll make sure you get started off running. Yeah. I didn't have to go back to the silver layer to get the time stamps. It's yeah. Cool. A lot of people liked it. That's good. I like it too. I might I'm always pragmatic about everything. So I'm always scared that maybe someone misunderstand it, but I think everyone here definitely got and understood what that was all about. So awesome. Let's do some advanced features. So what are we gonna do with advanced features? Well, we're probably gonna do things like evolve our schema, mess with our partition, evolve our partition, and, since performance observations. But to get the best performance observations, we really need to do a thing called table maintenance. We need talk about why it's important and show you how easy it is. And if I have time, I can put my sales hat on and say we have tooling in our thing that you could just make it something we tackle for you and just say we take care of it. If not, you build it as part of your pipeline, and that'll work. So let's do some of that stuff. All the next are gonna give you a nice obligatory video. They're gonna tell you to log back into that Galaxy environment and go back to that, schema that we created. And guess what? You know? I know that I created yet another civil silly simple, workset dataset here. We're moving on from the dune one in the first one, the phone provision in the next one to something. I also grew up around a lot of aviation stuff as a kid. So, that said, I'm not a pilot. My brother flies for American. If you hear, and this is your captain, my name is Eric Martin. Probably time to get off the plane. That said, FAA, let's imply he's probably he's probably legit, but I'm not sure. So he's my little brother, so I don't know if I wanna let him fly to me. Okay. So we got a table. A lot of commentary for sure there. Thank you, Greg. I appreciate oh, everyone's so kind. Thank you. We created a table of planes, and we added some some quick records in there. Just make sure we look at them, make sure they're all there. It's like all from planes, and, yeah, they're John Travolta. And these are all good real numbers. I looked them all up, tail numbers and all that good stuff. Unless John Travolta got rid of the seven zero seven or something, but, there we go. So we built a table. We're happy. We added some records. We looked at it. Yippee. We've done that a few times now. What else could we do? Well, what else can we do, Lester? Well, we can realize, hey. You know what? I messed up. Simple stuff. Right? Rename name to description because that sounds a lot better than than whatever it was. What was it called? Name. Yeah. The name of the plane. It's really a description. Right? John Travolta's Boeing seven zero seven. Yes. That sounds better. What else could we do? We could absolutely we just select on that. We could actually add some more records. I'm trying to think why well, don't ask why, Lester. These instructions you made a long time ago made sense then. They're probably still making sense now. So add, what, one more record. Make sure that's there. Some legit plane. There we go. Sykes 75 Bombardier jet, light jet. You know? And in this one, I probably realized, oh, man. We're getting crazy. We're evolving our scheme. We're getting crazy. We're overloading that description field. Why don't we why don't we break that apart? You know? Why don't we add some columns for all that stuff? Sure. Let's, alter it a few times and add various columns. Let me run a select again. And good news, bad news, it didn't do anything. Right? It just new columns and, you know, and all and scheme evolution rules really are what normal forward moving scheme evolution rules are really what Iceberg likes to let you do. Add things, make things that are easily up you know, casted in the right direction, right kind of data types, so insta big ints, you know, that kind of stuff. Thousands of a second to bill millions of a second precision, all that stuff. They don't like to go the other direction. That's another problem. And the way to solve that usually is to do a drop and then maybe an ad if you really, really have to get there. And that gets a little harder. You have to do something to clean it up. But but, nonetheless, we add a few things, and I might as well drop one more thing. I shouldn't think about that. We realized, you know, that color. Who cares? Color may could change all the time. So we'll get rid of that column. There we go. It's our big table. Now all that's null, so let's fix it. You know, somebody's fixing not just the schema. They decided to shore up all those records. I'm a do a three quick updates to the planes and do another quick select there. And let's make sure everybody looks a little copacetic. There we go. So it look like a real table. Right? Tell numbers, descriptions, class, year, make, model. Alright. And we can still have description, but like this one, we don't have a description left. We unwound all that and, you know, found a good home for it, something that we can report on and query on. Everybody's happy. I'm happy. Are you happy? Greg, are you happy? Because I'm happy now. Let's see if Greg is still I don't know if Greg's happy. Okay. What are we gonna do? Let's do so we evolve the schema. Not too hard. Not that crazy. Why don't we evolve the partitioning? Now does anyone remember if we had any partitioning? Well, we could do a show create table and show what we saw. We could scroll back. I'm pretty sure we did not indicate when we created this table any partitioning. Yeah. Wherever it was. Right? There. Yeah. We didn't call out any partitioning. Good or bad news is you always have partition. You just may have a partitioning called no partitioning or default partitioning or something. So there is still partitioning, but let's change some stuff. Well, another big one, I guess, I should do go into. I realized it's not really planes. It's a aircraft now because I really wanna you know, if I wanna select on planes, it'll say no such table. So, aircraft. There's the pay the data's still there. That's not a shocker. Here we go. I wanna add some partitioning. There are partitionings away that if you have a where clause later that used used a lot of times, that it can eliminate looking at all, like, at a mass level. So we added partitioning that said the the the column called class class was light jet, heavy jet, jet airliner. Those are classes. We don't have a lookup table or dimensional table or anything like that. We just have a preformed field. So we added some partitioning, but as I mentioned, like, at a 100 miles an hour earlier, I said, changing the partitioning strategy doesn't really change anything yet. So how can we verify that? Well, we can do we can do it all at once. We can say, let's add two more records. Yep. There we go. Insert into aircraft to insert two helicopters. You know, we can select our aircrafts, two more helicopters. And we can ask just like all those other metadata tables, there's one around partitions. So let's ask it. Tell us about right now, current version. Tell me the details that are happening in partitioning. And what does it say? Well, it says this. It says, well, remember earlier when you created six records? Well, you didn't have any partition. So all six records are in, arguably a partition of that time called nothing. You know? It's just kinda funny how it writes in this case, so that's okay. But it really showed us that, hey. You added two new records, and they happen to have a part you know, that there was a class. They all had class, but it mattered now. So it depends on what the what the definition of the partitioning is at right time of making sure it gets there. And there's a darn good reason. I think most of the folks on here would say, I think I get it, Lester. If I've got terabytes, petabytes, exabytes of data for years in the past and I decide time to change partitioning, that thing may never finish, you know, if you have to go back and rewrite everything in the new files and break break them apart. And that is the short answer. That is the answer. I will say shortly, you'll see it. You can nudge the system for something to occur to let the partitioning kinda get re rebuilt if needed. And sometimes it is, and I will go ahead and say it's probably by partition by partition so you can rewrite little bits, not the whole thing, but we'll save it for a little bit more. Alright. So there we go. We added partitioning. We had no partitioning. We had two records. We said what they look like. It all looked good. We then said, yeah. We said, okay. Wait. That was wrong. Help me out, Lester. Change the partition again. This time, market for year. Okay. And then we said, tell me what's going on. And we still have the same answer. Nothing new has been written since it's changed, so we still have two file two records are over there in a partition called classes helicopter and six that are, you know, just kind of in a default partition just lumped together. Okay. That's frustrating. I think we showed that there. Yeah. Yeah. Yeah. And then, of course, if we added a few more records, select your account from aircraft or by oh, guess we just want to see the breakdown because we're gonna we're gonna play with it here in a second. I bet you a dollar. So okay. Lump them. Let's see what we have because we know what we logically want. We logically want for the five records we have. They're each have an independent year. So, logically, we were thinking put one in each partition. Good strategy when I have a big data table with five records. So, yeah, have a little fun. Alright. So let's let's do this. Let's insert five one, two, three, four, five net new records. I'm gonna look '64, sixty nine, six oh, and they're all the same year, it looks like. Okay. So let's do that. Copy that and run that. I went in. These are all nullable, so who cares? I'll just put some baloney data in there. I guess they're not nullable, but values. And if we did a select, you know, you you know what's in there. Right? Select all from aircraft. Should have a bunch of records now. Right? 10 records, five, and then those weird five. But if we looked at the partitions now, what might we see? What might we expect? If I had more time, I'd make you tell me what you expected, but let's see if it made sense. New records were added, so they should have aligned with the new partitioning scheme, but previous records probably didn't get changed. And that is what we'll see. We'll see originally, there were those six records that we added when there was no partition that's down here, right there. And then we had two more helicopters when it was based on helicopter on class, and then we added, you know, new records that end up getting partitioned by year one, two, three, four, five. So looks good. I think it looks good. Looks right based on semantics of how this is gonna work. It also implies what I said earlier, just because you can do something, probably not a great idea. Partitioning strategies are probably one of the few that is the biggest, in my book, the biggest performance opportunity we have at tables of size and scale. There are other tricks, absolutely, but partition is an important one. It's the, to me, the biggest important one personally. And, likelihood of us changing that that many times is rare. Likelihood is we don't know what it should be till we get going. Then we go, this is what the big win would be if we adjusted. We see somebody wears this certain way. Alright. So they're all there. They're all replaced. They're a mess. What can we do? Well, let's let's talk we're gonna fix remember, we don't have to fix those. We're gonna fix them by fixing something else. What's really going on is this, especially in this scenario. We've created a bunch of very tiny files, and they're very tiny because we're doing onesie twosies. Right? But, arguably, no matter what, it is highly likely, that tables, especially they're ingested frequently with not the biggest velocity is high, volume is not so high. So what do we start to have? We start to have a lot of small files. If you're not familiar, with the small files problem, I have quite a few blogs and other things about them. I'll let you come find me. We'll talk about them. It's different than the HDFS small file problem, but similar. Ultimate, said, show me what's out there. And we have about what is it? Fourteen fourteen rows, there we go, out there. And as you could imagine, most of them have one or two records. So they're inherently gonna be pretty darn tiny. Right? 500 bytes and 1,200 bytes and that kind of stuff. These are just all the files. This is gonna happen. This is gonna happen in reality. So what happens? How you solve that? Yeah. I keep trying to get ahold of my my little screen here. My little friend. There we go. Well, anyone know the magic word I'm looking for? Starts with c, ends with starts with Com, pack, and ends with Sean. Thank you, Riley. Compaction. Compaction is what? Reading a bunch of small files and rewriting them as fewer larger files. How do we do that in Trino land? Well, we run a command that looks a lot like that. Now that command has a lot of flavors. I used it in the bare bones that just said, look at everything and reoptimize. It's already done. And what it should have it done, it should have went from 14 tiny files to fewer larger files. I still have some tiny data. Thank you, Monica, also. And is that true? I have less files. I only have five files now. Why five? Why not just put them all in one? Well, that is a question. If anyone has the answer, I want them to tell me right quick because I'll tell you in a second. If I can just get hold of my there it is. So it did rewrite them. These are the current files that make up our two, four, six, eight, ten, ten records. So it's five in those five. Yeah. Yeah. 10 records. And the reason why there are five files and not one file is well, for starters, in this small scenario, even if I did not have partitioning, I like what I got at least four just because of how our engine works. So this is this is really a good tool for I still have a big table with a bunch of small files, and I'm gonna in in practice in size and scale, it's gonna work out really well. And it's really, really tiny. No matter how small the cluster is, we're still not gonna perfectly put in one file. But what really happened here is that if we look at the file names, what's in the file name? Year equals 1969. So we still are having partitioning that play. So we actually said, go look at everything. Everything was a candidate to be compacted. And because it was compacted, it's being rewritten. And because it's being rewritten or written at all, it says what's the current, current partitioning strategy, current partitioning strategies on year. So it wrote them all, in those year buckets. In fact, I meant to run the file one, not the other one. Let run that one. There we go. Net net something different. It says these are the five files. There's only two. I would have still seen the year equals. And then I was meant to then go back around this one, which I think is what I had on screen. Show me what's going on by partition. I had to make five files because there was only 10 records each, you know, only two fit each partition, so on and so forth. So what did I inadvertently kinda give you a hint toward? One way to, rewrite data to the new partition strategies, make sure that so let be be creative with that execute optimize command so that it actually targets datasets that you might want to rewrite. I'll leave it at that for q and a if one wants to go there. If not, I'll just sound like I'm rambling. Okay. So we did some compaction. We looked at the partitions again, and there are other activities in here. There's four, five, or six more pieces of of let's see if we do any of them. No. We kinda stopped there. What else can we do? A big one is what we can do is we can roll off snapshots. We can expire snapshots. Maybe we don't want an infinite number of snapshots. Does anyone know why? I already said the answer. Why we probably don't want an infinite number of snapshots. Let me type in there. It starts with s I and ends with z e. So probably size. You know? See, I got a blog post out there. Yeah. Our blog post said that snapshots don't necessarily affect performance. They affect cost. They they affect footprint. That's not perfectly true, but it is true in its spirit of the what it was trying to explain and share. Good callouts. Yeah. Yeah. Yeah. Alright. We're gonna stop there. I think from the answers that a lot of people gave me, they know a lot about this stuff already, which is really, really awesome. And I wanna make sure we have at least some time for some q and a. I will look before I open it wide open and say there is one more step if you're interested in, in, iceberg v three. I got another tutorial. It looks a little different because it's on Lester's tutorial side. I didn't get it into our our tutorial side here. But in that same environment, you already have set up right now while it's fresh. If you wanna click through this, you already again, you already set up most everything. You can, kinda compare and contrast how delete files work with v two and v threes, the move to a thing called deletion vectors. Show you how we implemented the variant data type, and there were some examples, have fun with them, showed a couple other data type goodies that we did there, and then kinda ended with this notion of row level lineage. What you saw so far is we have version lineage, this version, this version. But if you really wanna know what exactly happened between version one and version two, What you can do is do some kind of diffing between a and b, and that may not get what you want, or you can rely on version three to say explaining great laborious detail. Bro, how did we get here? And I'll go ahead and steal the punch. It's full cool and interesting today, but I believe most of this work is, as I say here, foundational knowledge, probably to help with building really fast, really lightweight, incremental, materialized views, a space that Apache Iceberg still needs to make some progress in as well. So this is, I I think, a step in that direction. Science fair wise, it's a whole lot of fun. I hope you like the lab if you choose to do it, but don't feel obligated to it. And then lastly, do the teardown instructions. What are the teardown instructions? Throw that catalog away. In fact, I don't wanna throw them out away because you're gonna might have a question, but I would just go to the list that I would go to the list of stir of clusters first and then look at that free one because in there, you can't delete something that's not gone. So just come here and uncheck it and then save that, and then you would then you can delete the catalog. I think it might even nowadays let you delete it from here for your let me see. Temp. Let's see if it just gets smart and lets you notice. There we go. I think you can edit the configuration and then take it out. Yeah. Yeah. One way or the other, you gotta get it unconnected. So if I try to delete there you go. It won't delete it. I have to go pull it out. Oh, yeah. I can do it this way. I can say add the cluster, which is actually taking it out of a cluster. There we go. That's how you can trick it, and then you can delete it. Okay. Q and a time, comment time, all that good stuff time. Definitely, again, my name is Lester Martin. If you don't know me, reach out there on, reach out there on LinkedIn, connect with me or DevRel or any of the emails you saw in here, lester.martin@starburstdata.com. And Quincy is reminding us, if you do have some, you know, live who wanna come on screen, I don't know. It's it's something we just started using in our software here. Feel free to give it a try. We're gonna pause here. Just stay quiet for a few moments in case anyone's interested doing that. Again, don't feel obligated. Don't know how say it's in the There we go. Questions coming in. Yeah. So Riley says, how do partitions, relate to indexes? Are these two separate concepts, and our index is determined by partitions and iceberg? So that's a awesome question that usually we get when we start looking at data lakes from RDBMSs to the data lake systems. And I will go and start it with, there aren't true or similar exactly the same concepts that you see in indexes as in relational databases because, arguably, I'm not a I never wrote a database kernel, but, secondary index is usually basically, like, another table that's organized by that field, and then the payload is a pointer to those other records. I know it's probably more complicated than that, but that's fundamentally how I visualize it. Data lakes don't have that. They don't even have an index. They don't even have a a true primary key. And I think Iceberg thinks they wanna do that, but there's reasons why I think that maybe we should never do that in data lakes. It's about size and scale. So how do you relate there? They they do and they don't. There are strategies. So partitionings really are more of they're all about our filters, about where clauses. Partitions help us when someone does something like I almost always put a where clause that points me to like you saw there. Maybe I'm maybe I just do work on helicopters. So if almost everyone is looking at their class, then maybe a good answer is let's make a partition by class so that the engine at query time can quickly go, Only look at this folder or, you know, two of folders. That's not a perfect one because we do like low Courtney only feels low Courtney. We do like the concept. I do like the concept that maybe we add a new partition partition every once in a while, and maybe we stop playing with the partition. So the old classic time series immutable datasets like transactions from ATMs and sensors from cars often get partition at some kind of granularity of the time stamp, a day, a week, a month, something because people are often looking for windows in times, and it helps a lot. So so it's not in the a partition is not an index for sure. It doesn't really work the same way, but it is just trying to do the same thing, trying to it's trying to do less scanning. There we go. An index is often helpful to find a particular record, and there are other things we can do to simulate an index, and a lot of that is about sorting. Those files themselves can be sorted independently or they can be sorted holistically. And because the files have metadata about ranges, like, maybe we're we're ordering this thing by a customer ID. So when you're looking for a customer ID for 4,400 and every file the first file has, you know, zero to a thousand, the next one is thousand one to 2,000, we can start to at least quickly the same kind of thing. We're trying to prune out the things we don't wanna scan on, find something a very particular. So those ranging with sorting can get us the concept like indexing. There's also one called or rather called bucketing or hashing or clustering. There's a couple different terminologies where we basically create a hash on a value and then put it a certain way, and then the engine notices, oh, okay. You're looking for 4529? I know exactly if we have it, I know exactly where to go look, for that. So there are ways to get you there, but there's not the true index or multiple indexes. And all these tricks really usually have one approach. You can only partition one way. Can You only assort one way. You can only bucket one way. And sometimes, especially with sorting and bucking, they don't bucketing, they don't play nicely with each other either. So there are ways and tricks. I'd love to, connect with you and, you know, talk through this a little bit more or give you some resources if you want. Okay. Another question. And, Riley, you're welcome to say that wasn't enough. Monica, how do I remove a duplicate and keep only one? Well, it's, the the deducing problem. So I would say iceberg, just like hive, doesn't necessarily on its own resolve the duplication. So it depends on where you try to dedupe prior to inserting. You're trying to do, like, an upsert. So theirs talks about theoretically upserting. But to be honest, my take on that is that's kinda you kinda own that responsibility in your pipeline. Like it or not, often, people do if it's one thing to say, I'm getting file bunches of files added, but what if I'm catching up another system? Then what I mean by that is maybe every hour, we grab the deltas or a true CDC or something. Commands like the merge statement might help us not not introduce a duplicate so we don't create that problem by changes that are occurring, but it doesn't necessarily inherently just dedupe out of the get go. So there are functions that exist that can help us find those things, little win you can do windowing functions to find them all. That could be expensive depending on how big the data is, but so ways to find duplicates, but then you would really just need to decide you'd have to decide which one to delete things. So it's either catch it up front in your process or, you know, after the fact of the process. So that's a terrible answer, but it is the reality that I know, that I've done in the last decade plus. Our iceberg table is probably for appropriate for small datasets, one k to 14 k row tables that are costing you. You know what, Greg? I would say, Lester, and I think most people would tell you, that data lake analytics in general aren't a replacement for relational database systems. So I'm gonna answer very directly, but I'm gonna give you a caveat. Nah. Probably not. Now. Probably not. They're just not designed for if I was Amazon in my in my inventory master, I'm changing that thing constantly. No. Let's don't do that. If I'm a credit bureau, and I build these big credit repositories and people that can call in and by law, I have to change things. Great. So we are looking for a percentage of records that are changing and a velocity that makes sense. And this thing can go faster than I'm probably suggesting it could, but it is not a general purpose replacement for a classic database. So if it fits on your laptop, it still fits on laptop. Okay. Why might you say, okay. I'll still do it. Well, rigidity drives agility. The less tools we give ourselves, the more time we can focus on business problems and less on technological problems. So, absolutely, in any modern data lake, there are a lot of tables that are small. They're often dimension tables and other things. But the core tables, those big fact tables or whatever they are are often gigantic, and that's why we went to data lake. So, absolutely, things that are tightly fit within each other make greater sense. But if I just had a 100 or a thousand tables with a million and a half rows apiece, I don't know. Maybe Postgres solved that problem better than, any data lake solution. So I'm sure the sales team was like, Lester, what do you say? No. They they better say that too. Alright, Greg. That was my answer. Have Quincy. That's me. We have one minute left. That minute's probably gone by. Riley, thank you. Yeah. I ought to back up. Thanks, Vince. So I'm gonna ask the question of Quincy. Did we did it already cut off or we or if anyone's out there, maybe just tell me you're still out there. If so, I will try to continue to ask. You're still alive, says Riley. Okay. So, Quincy, I'm gonna keep trying to see if I can make sure I got all these taken care of. And then when they dry up, I'll disconnect, and, hopefully, they'll make it into the on demand as well. I'm gonna backtrack to make sure everybody's Riley, duplicate data, small tables, minute only. Great. Thank you, Riley. Lindsay, you also hear listeners from yep. How to backup? Thanks. Makes sense. You're still alive and perfect. How to backup? The only question I see left. So, Monica, my statement of this, just kinda like about everything's different. Right? If we're going from a database to this world, there is a difference. And I'm not gonna say you can't back up, but there's not inherent backing up concepts by that. What I mean is every time we version, you're back you're making equivocally, like, an incremental, dump or whatever you wanna not incremental dump. Backup incremental full backup incremental backup. You're creating a long trail of incremental backups, but you still are pointing to a place, meaning all kinds of things. Delete there do a delete all. Everything goes away. Well, good news is you can roll back to a second ago you're there. But if that whole object store that everything's sitting on, that whole repository disappears, so Amazon AWS East disappears, good news is while you may be in trouble, a lot of us are in trouble if Amazon's East data centers go away because everyone in the world's relying on them. But if you really, really, really have to, you really, really wanna put your on prem, MinIO and Amazon s three, There still is probably some third party tools that are trying to help you with that kind of stuff, but I I still think the simplest and best answer is own that in that pipeline and realize, you know what I'm gonna do? I'm gonna push two places. I know that may not sound like what you wanna hear, but that is still a very good classical answer. But Iceberg could do other tricks too. You could branch things and then, you know, for you know, make a copy of that somewhere else. But, ultimately, I would bake it. If you want a true 100% true old fashioned backup, you need to rewrite the data one more time, one way or the other, because, inherently, the backups are built on remembering all these immutable datasets and counting on them not going away. And in some ways, that can get your Doctor problems. If you can tackle the the at the data repository and solve your Doctor checks, your all that good stuff, then then your application gets checked off the box too. So hopefully, that helps a little. And I am I think I'm caught up on questions. I'll keep looking at them in corner of my eye, but I will say I really appreciate, I love talking about this kind of stuff. I like being a little goofy. I hope everyone that's still around had a good time. I hope everyone picked up something today. Maybe everyone didn't love every moment, but I hope everyone it was a good use of your time. And if it wasn't and I can make it better, ping me. Reach out to me. I'm glad to talk to you about anything, even mister Pibb or Pibb versus doctor Pepper. If you wanna have that conversation, I'm always great for that too. Other than that, I think Quincy and I will sign off, and we'll see y'all at the next if you wanna come to the next one, in the docs, there's a we're doing our office hours in two weeks. That's a little bit shorter. It's like a thirty minute session. We'll do ten, fifteen, twenty minutes of demo and then open mic questions. We're doing that on our back and ABAC and column, a lot of security stuff. And then that same week, I think we're doing another workshop about migrating from Apache Hive to Apache Iceberg. So keeping on the iceberg thing for the next one. So we love to see you one of those. Go to our events page. Talk to us. We we love to talk. We love to share, and I thank you for your time. Bye bye.