Video: Starburst Workshop: Migrating to Apache Iceberg from Hive, Step-by-Step | Duration: 4628s | Summary: Starburst Workshop: Migrating to Apache Iceberg from Hive, Step-by-Step | Chapters: Workshop Introduction (9.6s), Setup Starburst Galaxy (112.305s), Creating Temp Catalog (404.2s), Configuring Cluster Settings (576.29s), Exploring Trino and Iceberg (732.235s), Starburst and Iceberg (865.06s), Iceberg Table Demo (1064.175s), Migration Strategies Explored (1345.525s), In-Place Data Migration (1957.69s), Data Type Challenges (2073.345s), Data Type Conversion (2269.95s), Bucketed Tables Explained (2460.01s), Bucketing in Iceberg (2830.825s), Upgrading to Iceberg (3232.545s), Migrating Large Tables (3351.65s), Data Migration Strategy (3610.99s), Q&A Session (3948.81s), Community Resources Recap (4485.67s), Collaborative Problem Solving (4537.03s), Conclusion and Thanks (4564.75s)
Transcript for "Starburst Workshop: Migrating to Apache Iceberg from Hive, Step-by-Step": Hey, everyone. My name is Lester Martin. I'm, your dev advocate here at Starburst, and, we're gonna do a web a war a workshop webinar today. So this is, I think, the second in the series that we've started up recently. We've done things like this in the past. We're just trying to have a regular cadence of picking something interesting, but doing something that's not demo only, something that you, the viewer, the the participant, can follow along with us. Do it now, do it later, do it never, whichever is the most appropriate. And the topic today, as you see on screen there, is converting from or migrating your Apache Hive tables to Apache Iceberg. So I'm gonna make some assumptions that you know already a little bit, at least a little bit about Apache Hive and data lake analytics and all that kind of good stuff for sure. So if you don't, you know, we'll pick up some along the way. We definitely can respond to chat either during the session today or or afterwards when I review the chat if I'm somehow missed you or something to sync up with you. You're also seeing at the very bottom of the screen an email address that we'll find myself and a couple others. But you're welcome to say, hey, Lester, if you send an email to that or, hey, I saw you know, if you need some help. So that is just kinda like a I don't wanna it's like the first mile and the last mile. If you're looking for some help from Starburst or about Trino or something, that's one of many resources, but it's one that's, you know, kind of a direct resource too should you choose to use it. Alright. So you learned a lot about me, good old Lester Martin. What we're gonna talk about today in all of our this workshop series, the intention is not to be heavy on slides. Sometimes we do none. We did none last time, I think. This time, I have a few slides because the the subject matter does warrant x times to stop. It just conceptually talks some stuff out and then go do it. And so we'll do that. But I wanted to go ahead and before we even jump into it, let you know since this is a hands on environment, the reality is in the lab in the, the session you're seeing, there's a there's a chat session I'm sorry, a chat tab and then a a docs tab. And if you'd navigate to the docs tab, there's probably something that says instructions. And if you click on that instructions, you should see something like this. You should get routed. It's actually a a GitHub page. So if you get a, you know, a GitHub error, just hit refresh or something. Sometimes they have those too too busy or something like that. And, really, what it's gonna say is, hey. Just like the last one we did on this, we're gonna use an environment. I'm gonna use the environment called Starburst Galaxy. You can use that because all the instructions you hear are geared around that and help you get it going. But if you had your own, you know, Trino environment with an iceberg connector configured up, you should be able to do these activities in that environment as well. That's something you choose to do. Now last time, it took us a little time to set this up. Not very long. The setup really says, hey. Go over here, and I wanna I I wanna just kinda kick this off a little bit, pause a little bit. I might call the group and see if we're ready to proceed here. But generally speaking, you wanna just click on the sign up links, you know, if you don't have a Starburst Galaxy, and you don't really need to tell us a whole lot in email. These would be a work email or non Gmail, you know, if you have something that looks workish. I don't know how far our checker goes. But that'll just if you fill that out, you're gonna get a good old fashioned, you know, email that says, cool. You know, here's your six digit code. It'll give you a link. It'll take you back. You type those six digits in. And then, in fact, it's gonna say, you know, tell me the domain you want to create. If you look at mine, it appear mine is lester. It's probably hard to see, The top left corner there, lester.galaxy. So you're gonna pick and you'll see it visually how to set that up. So I'm encouraging you if you haven't done that, which I think a lot of you may not have or you don't have an environment to work in, do that while we're talking. I would even start that right now. And I probably won't give you as much enough time as you need, but I'll start my conversation. You can kinda, like, double dip, kinda work on this and listen to that vice versa. And, again, as I said earlier, you can work on this how you see fit. If it make if you're the kinda person that would rather watch today and see go find the resources as we go so you know where everything is, and then you finish the day and you go, did I learn what I wanted to learn? Great. I'm done. Did I wanna try it out? Great. Try it out. And there's always folks that, like, I'll do it all all at once. So, absolutely, for you folks, I want you really, really to be setting that up right now, getting that environment set up. The good news is that's the that's the pretty quick part, getting Galaxy stood up. Even on this instructions page, you know, that sent to the environment, there's even a tutorial. You're welcome. I think, like I said here, the process is pretty self explanatory. I'll bump this up too. Boom. Boom. The process is rather self explanatory, but there's this tutorial here. You're welcome to look at lots of screenshots if you need them. But just don't feel like you need to do all the steps of this tutorial. It'll walk you through connecting to a database and some cool things. And once you're all done all through with that, what you end up having is something like you have see here, Lester, I'll log out and log back in. Eventually, you know, you'll get to the login screen. You'll sign back in, and then you'll see something like this. Now just for those that know that love that stuff, we have a pretty cool dark mode. I'm gonna keep it in light mode just because I think it plays a little easier on the eyes and on the recording if I do it in bright mode. So for those that love that other stuff, have it. Once you get that going, though, it's gonna throw you in the query editor, and it's gonna, run against some data called our sample. So there's this down there, the main called sample with burst bank and stuff. So we'll have some queries and validate it runs. But what you need to do next is you need a because we're gonna be writing some data lake tables, some some We're gonna build some Hive tables. We're gonna convert them to to to Iceberg, that kind of stuff. So you need, some kind of writable access. So if you've got a could set up, if you got a Trino or get Starburst Enterprise, Starburst Galaxy, you're already set up and you can write iceberg table somewhere ready. Great. You're already there. If not, I did give you another link to another tutorial. Now this one, I went ahead and I'll just look at it. It's worth taking a look at because the notes are really in here. So I'm gonna go here. And generally speaking, what it tells you is I'm just kinda thumbing through it, and I'll show it to you. You know, it says get logged in. Yeah. Yeah. Yeah. And then it says to go to the catalog section. So I want you to see this. This is right here, data, catalogs. And then here we go. It's gonna say, hey. I'm gonna create a catalog. If you follow the instructions in that tutorial, and you can just reference it briefly, you just wanna say, hey. S three. And then there's a screen that's gonna help you fill out this page here. Mainly, it's gonna tell you use the name temp cat for the the the name, and then it's gonna say use AWS access keys. It's gonna show you those. Yeah. I'll show you over here. So you're say, hey. Yep. Create that catalog. Push to Amazon s three. Call it to catalog. It's all you know, it really will take about three minutes tops to do it all, and then it's gonna say, cool. Let me help you. You here's the the s three credentials that will work right now. Now these aren't gonna work forever. They are gonna work right now. They work for us today, and I'm gonna make sure because I have the same setup. And then it's just gonna give you the details for what else to reference everything else. It's really just pretty much most of these are defaults. Turn a few switches on. And then when all that's all done all through, it's gonna say, hey. Hit this beautiful button called test connection, which will hopefully put a big green bar. So let's go look at it. Don't mind while I'm talking here. So anyone that's trying this and having trouble screaming out or anyone that's trying and you're wanting me to make sure you I'm waiting for you. It's okay to type. I'm I'm working on it right now, Lester. If you're not, that's okay too. It's up to you. Now I will go ahead and just show you. I'll discard mine and say, I already created temp catalogs. Tip cat. Just so you can see it, I'll go to the edit screen. And, again, this is really all it's gonna make you fill out, a name, some credentials, oops, some credentials which would give you. Go ahead and use that name just to help, you know, make it work everything smoothly. The tutorial might say set it to hive. I'm pretty sure it does for a reason. Default's iceberg. It doesn't matter for us. Let it default iceberg is fine. If you're here and if you're thinking about this or watching this video later, I would even recommend we'll leave that out of the conversation for now. I'm sorry. And then, ultimately, you know, it's gonna there's a pull down here, and I've already selected Starburst Galaxy for the metastore and then just pick these use these very specific names. These will get you rooted into an s three bucket to a folder name that this user has permissions to do some writing. Once that's done, really, there's just a big button here that says test, and then it'll it'll say cool. It looks good. And as this guide here shows, close that one down, it'll it'll drop you into the says permissions, and we're gonna if you do it my way, you're gonna be all super user and everything. You just know, yeah. Save it. We're gonna keep it simple. Save it. And then they want you to add it to a cluster. Now cluster, the good or the bad news is that was part of the other setup. I'll help you that if you're having struggles on the way, we can stop. If anyone's having trouble, that tutorial said something like, you're you're when you set up, you'll probably have clusters that look like this, one called free cluster. And probably the free clusters are not in the region I want you to have. So, that guide talks about building a cluster, the very first one that you set up, and you're welcome to go through those. Let me just tell you what it'll look like. It'll be, like, you hit create cluster. You'll name it the name we gave you, US East one free or something. Can use it whatever you want, but it'll help you. And then at this point, you can just pick the catalogs you wanna add. Or if you did it first, it's fine. But you're just gonna leave it like it is. Hey. I want a free standard cluster. If you're here and you got a minute, the best thing I would tell you is turn, oh, I think they're turned off by default. That's good. These options, these cool caching options, I wanna turn them off. Looks like they already are. Alright. So I'm gonna pause. I know I didn't walk through it, but, for those that are gonna do that or haven't done that or get stuck along the way, not only do I give you all this instruction, I gave you a little video of me doing it. So, again, for those that are in that maybe watching this on demand or you're here today live, but you say, you know what? I don't have five or ten more minutes to set that up, Lester. Let's just keep going. I'm gonna go ahead and get started. But, but if anyone was setting that up in the background, and you need me to if you're really close and you like me to hold for you, don't hesitate to chat and say, hey. I'm almost there, Lester, because what I'm a do now without further to do a do to do a do without first further to do, I think, is the way you say it. Right? Let me go to top and run this little query just to keep my cluster lab. Let go back to the slides and say, let's just see what we got now. So then the so the intention here is to do what? That barcode is the same thing in the instructions in the doc tab over there. It's to set up your environment. And and then also those instructions tell you to grab some SQL, and that SQL is also back in those instructions. It says right here, it says, hey. After you get all set up, you got this cluster, you know, select that in that pull down like I had and hit activities. Go copy all this stuff, copy, and then back in our query editor, just blast it in here. So I've already blasted all the SQL in mind, and I encourage you, if you're doing this with me right now or you're doing this later, to include all that code as well. So we're gonna walk through every bit of that. Don't just run it. Top to bottom, we'll do it together, or I'll do it in a measured manner as we go along. Alright. I see no one saying wait for me. I see no one say anything. So, then I would say, just so I know someone's out there, you know, at least post where the heck you're from. And where you're at today, I'm in Atlanta, Georgia, United States Of America, Southeast, US Of A, land of the four seasons. Beautiful. Okay. So I'm gonna assume we're kind of up to that point. We got an environment going. That was our first hands on. Hey. Mexico City, Knoxville right up the street. I don't know if you pronounce it, Fred, P H R E D, Fred. But, if you know yeah. Thanks. If you know, what the three stars in the Tennessee state flag are for, feel free to type them in here. And if you don't and you wanna know, obviously, you can Google it, but you can always ping me directly, and I'll let you know. So my little see if you know your Tennessee trivia there. For those who don't know the state of Texas state of Tennessee has this round thing and then, I think, three stars in the middle of it. And, there's a reason for those three stars. Okay. Let's talk about Trino and iceberg, mainly iceberg because that's what we wanna talk about. He actually doesn't know who this stands for, I'll just say it, Memphis, 10 Nashville, and Knoxville, those three cities, which if you look at a map, folks, it's like in the low lower left, middle, and the upper right. It's a weird state. But enough of Tennessee trivia. Moving on. Iceberg. Let's try it. Okay. So we're not trying to learn Starburst or Trino or anything like that, but I'm just saying it. You know, we're gonna use this engine. It's it's fundamentally under the covers. This the SQL engine itself is called Trino. It's a clustering technology. Hopefully, many of you are customers or if you know about Trino already and, distributed multi parallel processing engine that reads data from all kinds of places, not just data lakes. And not just reads them, we can actually join them altogether, do some awesome data rated queries. Starburst as a whole is that engine. So let me kick it back in big view. Sorry. Is that little engine right here where there were commander Bun Bun. That's the the space bunny guy from that represent person creature. Maybe it's a them. I don't know. That represents that open source project. So that's a heck of a part of what Starburst is, but it's not all that Starburst is. Starburst is that plus norm. We'll primarily kinda stay razor focus in here, but Iceberg matters in a lot of these different elements. So to make sure you know there's a bigger Starburst play here, and they're hone it back where we're trying to be iceberg itself, Apache iceberg, the open, data lake, open table format, modern table format. And, you know, there's, you know, Ryan Blue, Daniel. Weeks, they all work over at, at least Ryan does. Don't know about Daniel work over at Databricks nowadays. But, alive and well, the project for sure. And, you know, the main trick here is we're trying to make sure it can support there's a lot of cool features, but one of the cool one of the cool kind of integration features is we can work with lots and lots of engines. Let's the same different engines. One of the the architectural tenants, I'm not gonna go into this slide. You may if you know if you've been around Iceberg or heard a little bit of Iceberg, you've probably seen this kinda drawing once or twice or trice. I would just simply say that Apache Hive put the meta the metadata amount of table. They put it off in, I think, called the metastore, the Hive metastore. And that was the kinda, like, superglue that said, okay. What's the logical definition of a table and where are those files at on the data lake? And arguably, in many ways, that concept is still very much alive. It's just that that classical metastore, we call it a catalog nowadays, an iceberg vernacular at least. Really, we don't wanna we don't wanna put too much of the burden in there. We wanna actually put all that metadata out on the data lake alongside the files themselves. So I have a lot of information. We have a lot of information about that stuff if that's relatively new to you. For us, for us today, we'll survive if we don't understand it perfectly well. I wanna make sure you know it's there. And then there you go. If you went back to that slide where I said, Darburst is a lot of things, you know, this is still that kind of Trino plus, some optimizations and stuff. And I wanna make the point that we don't only just support Iceberg. We support the other modern table format. We support Apache Hive, and then we support all that other stuff we've all been always been doing with our 50 connectors, the Mongos, the Elastics, you know, the Snowflakes, anything and everything out there, we can bring in together. Okay. Last thing is Starburst can run either as some deployable software that you can run wherever you want, in the cloud on prem, Kubernetes, bare metal, whatever. And then we have a software as a service, and I'm using that. That's where we're using the Starburst Galaxy. Okay. Let's see, let's see what all this means. So I'm hopeful that a few folks might have set up an environment. So if you did, let's just verify everything is working pretty good. So I'm back over here in my world. I'm this query just for a second here. I keep running this query because these free clusters quiesce about every five minutes. So if I keep it alive, it'll work. It'll be a little bit snappier. Yep. It shut down. It's starting up. Alright. So what I'm gonna do is in this cluster, I did set up that temp catalog that I mentioned earlier right here. It shows, like, a s three kinda configuration stuff. And what I wanna do once I know it's up and running, the instructions let's go back over there. Here's those instructions that I gave you. Well, the first thing it's gonna want you to do in that catalog is go ahead and create a schema. And if you look closely, it's got a lot of, hey. Clean it up. I want you to clean this name up. Change the first, the last, and the postal code. And to be honest, this string could be anything as long as it's not anything anyone else is using right now. We're in a shared bucket. And to make all this work the way we wanted to quickly with a lot of people, y'all in different places, y'all point in the same place, I gave everyone just come to this universal access. So if we use the same name, we're gonna step over each other's toes. So if you run create schema with temp first, last name, and postal code as is, you and the other person that does that are gonna have weird results. So what I did is I said, I can do that. I wrote it, to mine, Lester Lester Tester. My name is Lester Martin. That is not my ZIP code, but it's pretty close. Okay. And if all that goes well, it should did. It might even show up. There it is. There we go. We got temp Lester. 323. Yours will show up. You won't see anyone else's, but arguably, theoretically, you have access to see the s three stuff. So, nonetheless alright. So what am I gonna do first? I'm gonna say let's just build a table real quick. Build a quick, iceberg table. Yep. I called it, dune characters, one of my favorite my favorite book from the old days. Interesting set of movies over the years and TV shows. And then I'm just gonna add some records into it, make sure they're in there, add a few of the big characters of the the dune, at least the first book. And, yep, there they are. You know, Jessica, Paul, they're pretty important folks. And for giggles, just to make sure we remember this, you know, it's iceberg, and you're welcome to be running these if you have this on your own. Yep. We're building snapshots. Remember, snapshots are those versions. We change the data or change the structure. We create a new version of the table. And what's nice about versions? Well, I'll show you in a second in case you forgot. They're all out of I'll add about five more characters, so we got a total of eight now. Two for Hyatt and and all those fun folks and the Bistro band and everyone. And that got us up to another one because that was one big insert statement. So there's a third snapshot now as you see here. And really wanna make the point is what can you do with those snapshots? You can do cool stuff like time travel. I'm gonna grab the middle one there. And as my notes say replace this, it's gonna say, hey. Select every three from that table as of that specific version I picked. Okay. That was one function at one operation ago. I just had three users. That's cool. You also can do a time stamp as of a time, you know, five seconds ago, five days ago, five weeks ago. There's more in that sort of branching. There's tagging. There's all kinds of cool stuff, should you want, like, to keep an audible version or something like that over time. And then, of course, we can do rollbacks. Usually good for more, like, testing, trying things out, and didn't like something. Oh, because that name is that generic name. Let me just steal this up here. Catalog temp. So it just says for that schema for Doom characters, roll it back to two revs ago. So when I select now and, again, that's more operational. That's more data engineering, less users, user oriented kind of concepts, but there you go. Alright. Pausing before we jump any further. That was a remind us that, Hive is that, that iceberg exists. It works just fine on Treno and maybe just Dine Tour a few little things. Okay. If there's no no questions on that, I'll keep it going. Again, I hope somebody did get hope somebody is trying to play along today, not everybody just watching just because it's more fun, I think, to give me hard questions. Alright. So if we're gonna talk about migrating from Apache Hive to Apache Iceberg, you know, there's always this thought about maybe do I really wanna do something? So my my intention here is not to suggest you shouldn't go or or might not want to go, but, you know, what are some scenarios that might make you defer never do it or probably defer doing something like this or do it for selected targets, that kind of stuff. And in the good old US Of A, we say things like, is the juice worth the squeeze? Is that lemonade taste as good of all the effort of getting all those lemonade? So what are some problems or what are some potential reasons why not? One is coming down to maybe if you have a really, either old pipeline or pipeline built on an old strategy that's been around a long, long time. And what I mean by that is some people use, I think, called a Hive external table, and they use some totally out out of band, non Apache Hive solution to build data files and just put them somewhere. And then what's nice about that, that's kinda how this whole stuff, this whole data lake analytics started. People just put files in a folder. They wrote a create table to point to it, and they went, and they wanted more records. They just put more files in there. They wanted to get rid of some records. They took some files out. So if you've got a solution like that, you can still march forward to migrate. I just wanna make the point that your data pipeline's gonna have to change a little bit. So, yep, you'd still do your just dropping your files where you want to drop them, but you're gonna have to once we make this an iceberg table, you're gonna have to fire off an iceberg function that tells iceberg, hey. Guess what? There's some more files that are part of this table. This is very similar. We have something similar to this, in Apache Spark as well. And, historically, just as a side note, for those that know about this, recently, the they finally did do the upgrades that take care of if the table was partitioned. So last time I did this a while back, I kinda said, well, if your table's partitioned, you know, you're really in trouble. I haven't done exhaustive testing, but I know the PR has already been pushed through, so it should be in in a in a Trina release and or the Starburst release at any moment. Now it's not already there. I need to double check that. So might need to upgrade, not a brittle pipeline, but a very kinda disjointed pipeline. And those aren't bad. I built a lot of those in the last ten, fifteen years. I bet you a lot of them are still running, so they would have to be just adapted just a little bit. What else? Well, I would say this is the kind of tongue in cheek one, but, but it's serious. The benefits you're really gonna get out of, iceberg, what if you don't really need them? What if you don't really need your ACID transactions? You have a little bit of that in hive ACID, of course, but if you really are, you know, just a true immutable representation of transactions that are maybe time series immutable, there's nothing changes, you probably need ACID transactions a lot less than you think, but you still might want them. They still may be valuable. They still may be quantities. Maybe you don't want versioning that we just mentioned. Maybe you don't want the coolness of what they call hidden partitioning, a way to not build another column often to kinda create a partition value, you know, and put them in the right place. Just it's it's some nice features there. And you just know I am not gonna adapt this schema. I got it right. Data's never gonna change. You can update schemas in Hive, but my experience has been often you'd you also need to couple that with a data migration of some sorts to make sure what you think it's to be and what's out there. So it cause a lot of rewrites. Or maybe your partition strategy, you're not sure what it is. So those are some of the features, plenty of other sessions that we walk talk to those, and I think I reference a number of tutorials or even the last, workshop we did, which you can find on our site, walk through a lot of those features and explore explore them. Maybe another one, same kind of thing. What you got to work is not just the raw functionality features, but it scales performance. What I got works, Lester. It works really good. I don't expect it to stop working really good. I like my vendor. I like everything I have. It's hard to tell someone to change something if it really works really good. So if it really works really good, good for you. Absolutely. This would also be a kinda little bit of a call out here. There also could be some things that maybe you've done with HMS that might offer some incompatibility. So I don't wanna go into too many corner cases there, but it's something worth exploring. If I've done something too special that's binding me to something like something that bound me hard to a high that I can't really, you know, get over without some effort. And then lastly, of course, maybe you just don't have the time or the money. So we talk about money when I'm suggesting resources here, but sometimes it's not just money. Maybe it's just time. I really wanna do a good job on this. I wanna do some testing. I wanna make sure I have some fallbacks if things go wrong, all that stuff. Maybe you just can't get there. So there are some reasons potentially to not migrate. You know, there's no reason to do work just because it sounds cool. We do work for benefits. This session I think our last session was much more about the benefits of iceberg. This one's about, okay. You you decided to go. But those are some reasons you might not you know, there we go. Back to that, you know, that meme out there. But that common thing, don't break it. You know? Don't don't fix something that's not broke. Right? Why why fix? Why don't we know? But I would argue I truly would argue that if we truly believed in that model for everything, you know, there we go. We're still in horse and buggies, much less whatever we're using for personal transportation. They got cars, trains, that kind of stuff. So I just put a feedback loop there and said, okay. If it ain't broke, don't fix it. Nothing's great. Maybe just like everything else in life, build that forward, to do to yourself that says, okay. When do I need to come back and take a deep breath? It could be an interrupt driven. New CTO comes in, new CIO to tell you if we're gonna do this, you know, all that stuff, but, you know, rational, reasonable point in time that I set might say, let's march forward. Alright. Let's talk about some migration strategies and then go do some hands on because I'm getting, spending more time than I want in these exciting slides. So you decide you wanna march forward. You wanna do migrations. You wanna move one or more or all of your, Apache high tables into Iceberg. It probably shouldn't be too mind blowing to imagine there's just two primary approaches. One of them is the classic solution that's always been out there, the shadow migration or, you know, stop, cut. You know? Shadow doesn't mean shadow usually implies you do it alongside something, but it could mean just, okay, stop at 03:00 this afternoon, do all or whatever you have to do, and then turn it on as soon as we can. And the shadow might be that you could also do some of that early. Do some of that work prior to the cutoff earlier to limit maybe some some outage or downtimes there. So we'll run through that migration process and do some of those things. But there is, and I wish it was more applicable sometimes, but there is a concept called an in place migration that if you find yourself in a a table that fits a certain set of criteria, it can be a meta metadata operation, kind of like a like a table takeover. Just go, hey. Okay. No longer it's not Hive anymore. It's a iceberg table because I read, I saw, I put in place enough of that metadata from what I see there. And the nice there is that nice thing there, that one can be pretty darn fast because it doesn't rewrite the data. And if your table's huge and large and long and all that good stuff, rewriting the table is not always, the funniest thing to do. Even if we decide to do, we might talk about strategies how to kinda make that easier too. So there we go. In place migrations, it's a takeover. The files are still there. We didn't change the files. I did call out in the bottom right, since I talked about external tables for those that are aware what that means and Hive's Hive has ex has external tables and managed tables. Sometimes people might call them, other names, nonmanaged and managed or external and internal. I call them external and then just tables or managed tables. We don't have that, so they're gonna iceberg as managed tables. Everything is a managed table, so everything is just an iceberg table. External tables won't be external tables anymore. So to do the in place migration, because that's the easy one, what are the criteria? Well, some of them aren't that bad. You know? If size and scale tables that are in Apache high, likely, they're you you know, it's a good chance, if you've done some performance churning over the years, good chance they might already be in something like the org file format or the Parquet file format. Maybe even Avro, but probably less likely, in that one. So if you're there already, great. If you're in a CSV back table, tough luck. If you're a JSON back table, tough luck. RC files, all that stuff, you're kinda out of luck. You're gonna have to rewrite those. And that could change. Let me just say it that way. Well, I guess we'll go to demo. I would say that the structure and the framework that is Iceberg doesn't necessarily, prevent other file formats. I believe Tabular was supporting more than those three at some point in time. And there might be some movement that says we can do more. I would recommend probably not a good idea unless they're even more advanced column restores that park in or car. I wouldn't wanna try to do it on top of CSVs. That's already a data lake problem. Alright. So an in place migration. Let's go see that. Let's go see why it won't work. Let's do a demo Lester. I bet my query shut down, so my server shut down. It sure did. Great. Alright. So this is that SQL file that you could have grabbed from GitHub. And what I did is each of these little demo section sections, I got a little yeah, k. Did I show you all that? I did. I showed you a lot. I think I'll just show you that it won't work. How about that? Let me look at my notes while it's spinning up. Yeah. Yeah. Yeah. Okay. Okay. So, yeah, we'll do it now. Believe it not, I did prepare for this. Seems like I did it, but I did. Alright. So in place migration, let's try it with some nonavailable data types. So in each one of these incompatible data types, I gave you some SQL. I got a little setup. So I'm gonna do a quick setup here. I'm gonna say, hey. Let's build a table called, in that schema we set up, and I just call it the customer hive JSON table. Ugly names just to kinda tell you what it is. It's a hive table formats using JSON file formats. And if you look closely, I just copied over, I should have copied over a smaller set, the scale factor 10, t p c eight data. If you don't know what that is, it's a data generator. It's gonna create some customer data for us. And there we go. One fifteen million rows, something like that. So, basically, I'd walk in now and say, okay. I got I got this table. I'd say show create table and say, hey. This table oh, it's a Hive table as you see in the bottom here. Make sure this is up on one twenty five. Yeah. Probably hard to see in some cases. So that's my table. That table, as I said, has about 15,000,000 rows. 1.5 or 15 mil? 1,500,000. So sorry. So what I'm gonna do? How do I change it? It's about as complicated as that. Hey. Do an alter table and change the type. Change the type from Iceberg to to to I'm sorry. Change it to Iceberg from what it was. And you can do this in other engines too. Spark has these things similar mechanisms. Failed to migrate table. That didn't tell me a lot. So what I'm gonna do is I'm gonna navigate over to my insights, and I'm gonna look at that thing a little harder. So I pulled that query up, and it said, yep. Failed to migrate table. No one I'm I'm a Java programmer. It usually tells you go to the bottom. So I to the bottom, and there it is. There's a exception. It says unsupported storage format of JSON. Ah, okay. Fair enough. I gotcha. We don't do that. It needs to be on a certain table. So what could I do? Well, I'm gonna go ahead and introduce you to the notion of what you might do conceptually. We'll we'll you know, or or at least least in its rudest form. I'll show you some root root most rudimentary form. What what else would you do? What if you can't do that? Well, probably, you're gonna do something like this. I'm gonna keep it super simple. Hey. Build me a new table that looks that's built on the output of reading that table. So a c test. Great table as. So I'm creating an iceberg table. And what am I doing? I'm reading those 1.5 mil with this teeny tiny cluster, so I should have made it really small, and rewriting them again. And then guess what? I'm just gonna drop my JSON table, the table so this one was a little bit of an outage. Right? And it was a rewrite, and that rewrite was fast because only had a few records. And then I just renamed it back to the old name. So if I said, how many records is 1.5 millirec 1.5 mil, and that name customer hive JSON now is a iceberg version two table using Parquet. That was a rewrite. And I'm gonna, of course, talk about you probably won't do that if your table's terabytes, petabytes, exabytes. You probably wanna, you know, not do it all at once because that could, you know, take all all of a lot of time, but it is something to be aware of. Alright. So needs to be a certain tape certain table formats. What else? Well, here's the one that I think hurts the most that gets a lot of us. There are gonna be a different set of data types that, Iceberg allows than Hive allows. And some of them aren't so big so bit such a big deal, but the one that has caused me concern, of course, is this time stamp. So those Hive will support normally supports the millisecond time stamp and Iceberg through its file formats, everything is really leaning toward, you know, microseconds. So it's what not thousands, but what is it? Thousands, millions, and then beyond. In fact, they have one called a nanosecond that can also go to precision. But net net, what I'm gonna say is my experience has shown that time stamps are the thing that hurt us, the thing that are gonna bite us, and they're the thing that are probably gonna bite us the most because our biggest tables likely are some kind of time series immutable or at least a time stamp when an activity occurred, and this is really gonna end up causing us some grief. So let me show you what I mean because it it is a bummer. It is a bummer. And and I bet if you think about it, you'll have some questions. Why can't you just do a cast list? Or why can't you do this? Ask those questions, but I'll I'll go ahead and kinda give you the short answer real quick. Well, the problem is the file itself, say that parquet file that's ready to convert, it has its own scheme. It has its own data types, and it was written with a less precise Teams time stamp, and you wanna go to the type prime stamp, and it doesn't know how to do that well. It's like you wanna recast. You just wanna tell it, hey. Cool. Lose the details, but call it something different. I don't know how to do that with a raw Parquet file. And, and if I knew how to do it, then I guess our engine guys would have made it work. So, nonetheless, I built this I built the table in the background with a whopping 200 rows. And as you see here, it's a high table, parquet file format, which we hope is gonna do well, but there are some problems. And I don't think I focused on timestamp. I have some dates in there. Yeah. But I'm really focused on this, the chars. Char is one of those ones that also Hive has, but Iceberg doesn't. He just has bar char. You know, don't wanna specifically call it char. Just make it bar char. So I built that table, and I said, let's convert it, and it's gonna blow up as I promised. Failed to migrate table. So I'm a go over there and look. So really just today is a lot of today is, showing you some things that might go wrong, but I'm gonna leave you with some strategies that promise that are pretty good. I hope that was the right one. So he's gonna say, failed to migrate, scroll to the bottom, and there is the real culprit. The type this this just char 16 isn't supported. And there may be other chars in there. I think it's probably just reporting the first one as all. Oh, I I hit this. I don't like it. I quit. I walk away. That kind of stuff. So the good or the bad news is could you continue forward and kinda do, incompatible data type right here. We tried to create a could you do what we did before? You sure could. I'll just build quickly a a CTAS. Build me a new one that looks like the old one. You know? And the conversions get tackled there. So he's gonna do the convert because he's rewriting those files for us. And as you see, if you look at it now to show that he's rewriting them, if you do describe on that new table, there's just bar charts. There's no more charters in there that we saw, earlier. So then the rest was what I did before. Drop the table, rename the table, the new tables of the old table, describe it, show it. You know? Yeah. Is that sweet and easy? Data on tables that are not very big, pretty easy. In fact, those that make it to the end, I'll point you to a personal project of mine, a little a little, kind of in mass converter that's doing all these kinds of things. You know, look at a schema, get a list of tables, see if they have problems, call it out, but, you know, do some conversions, all that good stuff. You know, you don't have to do this all by hand. You can automate, the heck a lot of this stuff here. Alright. So we were able to tackle it. There it is. It's a nice bird parquet. Describe where you saw that, the different data types. Okay. What else is gonna, what else are some of the requirements? Well, this is an interesting one. And I don't know not everybody knows a lot about bucketed tables, clustered by tables, bucketed tables. If you're not familiar with them, it's a mechanism. Partitioning says put data in certain folders based on a value, usually, like, ranges of dates or maybe a region of the country or something like that. And then buckets are different cardinality, much more higher cardinality. So if you might have, like, accounts, show bank accounts, and maybe you got millions and millions of customers, ah, customer ID. If you bucket a table on a customer ID, it's basically gonna take that value, make a hash of it, and then guarantee kinda like partitioning that, you know, customer ID 4529 always lives in bucket g. Another customer always is in bucket h. Another customer might live in a whole other bucket or live in one of those buckets. So we'll have a lot of customers or whatever we bucketed by clumped together, but we do guarantee they'll always go there. What does it help us with? It's usually only done historically for joining gigantic tables because it lets the engines be a little bit more strategic and not have to do a lot of shuffling and stuff. But it actually can be pretty helpful for kinda like one off lookups. Hey. Find me everything that happened with customer 29. So there's more to that conversation if you're not familiar with bucketing. So if you really don't understand that, that's okay. Just kinda run along and say, okay. But I saw a table that's bucketed by what's gonna happen. Well, it's a little different here. Let me show you. Oops. I'm gonna keep it right here. Alright. So in my query editor, I did a setup. Now this one, I think I only just run it, get it started. Yeah. So the setup says create me a table. We'll call it orders hive, port file format, make it bucketed. But if you look really, really, really close and this is still a create table as just that, you know, when you do a create table as, you can be very specific with all the details. So I'm telling it, Hive, Ork. I did a partitioning strategy. So this thing has prior order priorities, like, five or six priorities. So they'll create a folder based on the order priority, and then I'd use that same concept called customer key. I wanna lump them to five. So, again, I wanna I'll stop trying to explain what bucketing does other than, getting this setup going here. Alright. So there we go. I'm coming along, and I said, I ran a couple questions. Let me just kick them all off, and we'll look at them. I said, just show me some data first. Alright. That's easy. You know? Order key, customer key. There's that order priority priority. There it is. 3452, all that good stuff. I did describe. There it goes. This is this table. There's that partition key order priority. Doesn't mention anything about the bucketing. I said, remind myself as the third query. You know, what are those priorities? What are those fields? Because you're gonna see this in a second. What are those field values that I partitioned by? It'll create folders. And then lastly, I said, show me the create. Alright. Well, here, we're gonna hone in on this. Partitioned by order party, and then there's that high vernacular bucketed by the customer key. And when you bucket, you need to tell how many buckets. This is part of the strat part of the solution concept here. Alright. So five buckets. Meaning that I've had a a million customers and and there wasn't any weird, crazy, bizarre skewing. Well, even if with skewing, probably by the fact that we do a hash of that value will get very evenly dispersed thing. That's another reason to do bucking. Sometimes your data is highly skewed. This is a way to kind of flatten out the skewing. There we go. So at least three good strategies solutions that bucketing can help us with. Alright. So I did all that. Why? I did all that to show that if we have these five values for the partition, what if I ran, there's a metadata or special column called dollar path on everything in high. This exists in high. It's been around forever. We we access it too. So I just said, look at everything look at every every record in the file, and then every record will have a kinda like a special point called dollar path that says, what file did I come from? So I just said, give me a distinct list of those files. This is how I would do it in Hive. So what you see here, I've got 25 rows. And if you look closely, right about here where my cursor is, right about the middle, and I'll click on one of these, there is this folder right here that says order priority. Order priority one or equals one urgent. Those are the folders, one, two, three, four, five. There are five, there's five files of which each one of those five is in this one. There's another five files into the two folder, another five files in three. And this is kind of suggesting that or telling us, if I click on this, that this data file was, again, for order party two, and then the file name, the way all this works is that four says this is the one of those five buckets. This is the fifth one there, zero based numbering. So bucketing creates files in Hive at least. It creates files based on what that goes in there. Again, if you didn't know all that, that's great. If it's still still a little fuzzy, that's okay. You know, that's that's a deeper conversation. But I wanted to go into because it's it is a weird corner case. Alright. So there we go. I set it up. I gotta go, and we see this is happening. I'm gonna do an in place upgrade. Now the good or the bad news is, and I think it's kind of somewhere in the middle, is it's not gonna fail. It ran. And if I asked it to say, show me the create table of the Hive or bucketed, and I scroll to the bottom, it's an iceberg table. It's partitioned by the same value, but there is no mention of the bucketing. So what I'm saying is if you have a bucketed a Hive bucketed or clustered by table and other terms you might know, that will convert okay, but you're gonna lose from that point forward the rights the writing to that bucket. That sounds like a lot, doesn't it? It is a lot, but it's something to consider. Okay. Now the rest, since I spent so much time on, I'll go a lot faster because it's the the bucket heads already know what I'm talking about, and they could probably decipher this a little faster. I would just simply say, hey. I'm a take a quick look. This is the same table, same records. I just did the same query again. Show me those files that exist. They're still 25. It's the exact same files. We did not rewrite anything. We just hijacked it, put some metadata around it. But the gotcha is what if I did another insert? Because now the table doesn't have details like that. So if I said, hey. Add another partition, to this where it's basically said I made up a field, a status called bogus because I purposely want that's the that was that field order priority. I wanna force you to create a new folder. So I know those files are the only ones that were written. And I should have done something a lot smaller than the scale factor 10. It's not very big, but it'll take this thing a minute here in this teeny tiny one node underpowered free cluster. My bad for doing that. Maybe, you know, just switch it. Nah. We'll just give it a second. It'll it'll kick up here in a second. It will it will write as if the bucketing doesn't matter or doesn't exist. So it's gonna finish now that hit nine. It'll fly through the rest here, 9%. And I'm gonna show it to you because that's what matters. Alright. So it ran, asked the same questions before, and I'm gonna have 29 rows now, not a multiple of five. Why? Because the only thing new is these first four things, and I'm gonna click on one of them just to see the full name. Again, this is my bucket heads. Everyone else, you might just take a sip of coffee and say, ah, I'll look at that later, Lester. There is the order priority bogus. And then this file, this org file that I created has nothing in that imply that helps me visually at least kinda go, there's a heuristic that tells us this part of bucket one, two, three, four, five or zero, one, two, three, four, whichever you you like to see it. It just says it doesn't do it. And why do I have four? I have four because I got a small little cluster, and it has four little process, four little threads that could write files. I got four. That's all. Wasn't had had nothing to with buck the bucketing is gone. It has disappeared. Now the cool or the bad news is, and I'll just do it top to bottom and then in this bucket conversation because, like, last time I did this, it it took a while. Alright. So I'm gonna I mean, the conversation took a while. So what this is trying to show as well, one strategy, you can just say, hey. You know what? Fine. I will not, expect that to be magic. I'll just do the same kind of thing as before, build a new table, define it the way I want, read from the old, write the new. I would say there's still a strategy that you could exercise with size and scale if you this is the only issue. You could still do the conversion, and then you could do an alter and then add back in because it does do bucketing if it's net new. It does do bucketing if it's net new. Oh, there we go. I rebuilt the table. That's all that was. Sorry. There we go. So I could do that, what you see here. This is build a whole new table and declare how Iceberg sees bucketing. So Iceberg actually does bucketing, but it does it in its hidden partitioning world, which means it's, like, it's it's a it's not gonna create a file. It's gonna create a folder for those buckets. Yeah. So what I mean by that is so it looks a little different here. It's it's listed in the in the partitioning. Here, let me just show it. Lester, show it. There it is. Scroll to the bottom. There so the the bucketing is a subtype. It's a it's a per approach to do clustering. Net net is the same thing. Doesn't create a file, creates a folder, then the files go in the folder. Net net is still gonna work out. The engines can figure it all out just fine, and the same benefits exist. And there we go. Those and the difference when I built this so I built the I built it from scratch again. I'm back to those 25 files. I'm just gonna pick on this one. So here's the thing. Here is temp less rice for the oh, there it is. Order priority two. So there's the folder for the partition about the priority. And then what did it do? It built another a sub partition folder that it actually called it out very clearly. Hey. This is bucket zero. This is the first bucket, and then it put a file. So you can kinda think about that as isn't that different? It is different. But if you understand what bucketing is doing, you get the same benefit here. The engine is still gonna do that same thing of, oh, you're looking for customer ID 99. I know it lives in bucket number three. So it's gonna find that folder with bucket number three designation as opposed to a file that starts with, you know, 00002 or something like that. Alright. I think I beat it up. I will just say this. Yeah. I'll just run the rest here just to finish the little goofiness out there. I'm just doing the same things. You know, you know, swap it out, move them over. If you have if this is the only problem you have with your in place conversion, then you're still okay. What you can just do is do the do the upgrade, alter it to an iceberg table, don't rewrite, nothing got rewritten, and then do an alter on your table and say, hey. Re rethink the partitioning because Iceberg supports not just the schema to be evolved, but it allows metadata I mean, sorry, partitioning strategies to change. So if you haven't thought about that, that's an interesting feature. And it only allow it would start then. It wouldn't go back automatically and rewrite anything, and then you would start that point and march forward. And all those benefits you really want in the past are still there because of all this metadata that gets cached up and that kind of stuff. So it's surmising a bull. It just takes a little more think in there. Okay. Let's let's see if we can finish up. But I took forever on bucket buy. Maybe I shouldn't, but it is a really interesting quarter case. It kind of ignores it, but you can work yourself around it. You know? This is my point. In place migrations are pretty rigid. They're pretty darn rigid. I think the tie the timestamp, most of all, is the one that's gonna bite most people. But if you can do it, it works just fine, and we'll do it just fine because we need something that was easy. Alright, Lester. Let's build another table, another customer table. We'll call it customer hive or partition. I even snuck in their Bloom filters. We're not gonna describe what a Bloom filter is. But, since I've made a big deal about partitioning, bucketing not getting supported, if you know what Bloom filtering in ends and you're using it, if you don't, contact me later. The good news is the conversion will tackle that for us. So if I did a show create, just like you see up there at the top, it's gonna have two levels of partitioning. It's gonna have a Bloom filter built on the phone number. I'm biting my tongue, not trying to explain the Bloom filter to you. There we go. So what could you do? Could you upgrade that? No. You can. So I did a I did a an alter upgrading to Iceberg, and then I said, show me the create table of this this, this act this create table statement for this table as if it was just originally started out as Iceberg. And there it is. It's an Iceberg table now, not Hive, with the same partitioning. And guess what? A Bloom filter is still built on on phone number with the same Bloom filter, false positive percentages, and all that kind of fun stuff. There's always another knob that make performance. A Bloom filter is one of those knobs. So okay. So I think that's pretty cool that I can tackle it just fine. Alright. So gave you a lot of things that won't let it work well. And to be fair, all those two decision trees that I built, you know, along the way in this journey, oops, you know, kinda ended up, putting on one big kind of decision tree. It looks like it's the end of the world. It it looks harder there. We wanted to take some time, go down the corner cases. You're you know, if you're gonna do this in size and scale with help from us or someone else, likely, you're gonna automate this. There's not a perfect tool out there that exists that I'm aware of from anyone that just says do it, but there are some pretty good, get you get you pretty far. So we've helped customers down this path, and we'd be glad from professional services. So it's not a per se a product kinda walk you through. Glad to kinda help you that. And, again, if you wanna just roll your own, I'll give you a I'll give you a pointer to my little tool. So there's our maybe don't do it. There's our can I do it easily? But, likely, you're gonna do quite a bit of you're probably probably gonna have to do some of this, the shadow migration activities as you go along. And if you do those shadow migrate if you do any migration, this is what I meant at the beginning, it's gonna take time and effort and money. You know, test the slot out of this. Make sure it's really, really gonna work for you. Make sure that, checking the time here. Make sure, you know, you automate all this. So take some time. Test, try, validate. If you got five tables to do, can do it by hand. Got a big environment like most of us do, you you know, that's not what you're gonna do. You you might do that again on your first project. Take a little stab. Take something less critical. Do all the same things you do with any other kind of system migration that keeps you out of trouble. And then I would say oops. That last one said consider staging the rewrites for very large, hopefully, very heavily partitioned tables such that what are you really gonna do? You really and I'll walk you through an example of that here so you can have one. I got a big table with a thousand partitions. What do I do? I I and I have to rewrite them to get this. Well, I do not wanna wait for all that to happen. That'd be a long downtime. The short answer is, you know what you could do? You could create a new table that looks like the old table, and you could, and then you could probably, I don't know, name rename briefly the the the original table to something else, like old. And then immediately after that, you could build a view that said, hey. Do a union all of the new one, the new iceberg table, and the old Hive table. So for that brief moment where you had to rename it and then the new table got the new view got created, which could be very short. There wasn't a moment there in time that maybe the table wasn't accessible for some reason just because someone hit the button. Right. Right. But, generally speaking, that's a tiny window. And then now what you can do is start deciding, hey. How much of this, table do I wanna migrate at any given time? So, basically, I would say, you know what? Maybe take some of the more recent partitions, convert them into the new, drop them from the old, convert them, drop them, convert them, drop them. And then if you ever finish, which you should eventually finish, you can just drop the, the the old table completely, drop the view, and then rename the new one the new iceberg table back to the old table name. So there are minor hazard win hazard windows in there you have to consider because we don't have the old two phase commit transactions and all that stuff of, the old world. And even most iceberg instances today, we don't even have multistatement commits either. We have single statement commits. But so there are some hazard windows. But what what might it look like? It might look like this. If you were having fun in the code, maybe it'll be just as fun. Alright. So I'm gonna do another setup here, and it says create me in that famous customer table again. So it's right here, customer hive parquet partitioned. It's partitioned by that markets a market segment, a different one that you saw before. What does that market segment look like? Well, ran a couple more queries. First one, it was just show the create table. You know, that's what you saw. Setting it up. Second one, these are there's only five of these. So we're creating five partitions because I can do a five partition migrated strategy faster than a 5,001. And then I just said, hey. How many records are in there? Because we wanna make sure our table isn't getting, you know, confusing or something like that. We don't wanna have too many records or too few records. So I got a whole whopping 1,500. So remember what I said. You can do a one time credit create create a table. So there's my new table. It's an iceberg table. It's the same format as my old table. It just happened to you know, I called it underbar new at the end. What else could I do for here's my hazard window. I could do the I won't even say won't even run the query yet. We can do something like that. Now, you know, it's gone fast. It's gonna be faster in real world, but, you know, it took us a second or two to do all that. So there was a hazard window. But what did I do? I created a view that had the old table name, customer hide part or k partition, and then I created it as a union all of, you know, the new and the old. And then I wanna make sure the records are there, so 1,500 records in there. Why? Because 1,500 are in hive and zero in in the what you call it? As a reminder, the partitions are those five values I said right here. And then guess what? You could automate passes or just trigger them, whatever you want. So the first pass, I'm gonna just simply say, hey. Why don't I create in the new table? I can say, hey. Insert into new, select from the olds, and I'm gonna just target one of those partitions where it's automobile. And then as soon as something done with that, I wanna delete the other one, automobile. And then for a moment time, I'm gonna make sure that the view sees, 1,500 records across the board. Oh, yeah. 1,500. There we go. So that was my hazard when I was warning you about. Because today, at least, definitely in Treno, some of the other engines have tried to do this from their own, but Apache Iceberg is as a specification, as a standard, is trying to move toward absolutely multiple statement commits. So you don't see a start transaction, insert into new, delete from old, commit. What you see is, you know, insert into new, delete from old, and they both are auto committing. So what the danger is if you don't wanna drive people away from this for a moment time is, yeah, you're gonna double dip on a few records. Well, any record that is in that kind of window there of time. So something to consider, age old problem in the data lakes out there. And then guess what? After that, it's just doing the same thing. There we go. I'll just do it again. We'll do another one building, and, you know, we'll still have our 1,500 records enrolled and all through. Yep. Yep. You can go count individual tables if want to. And then for me, I decided that's so fast. I'll just do the the remaining one. So you can you can do it one at a time, a lot at a time. And you could do it not on partition values only. You can do it on anything with a predicate, of course, if you don't wanna do everything. I'm talking partitions will probably make more sense because those files will be isolated together within a folder already as opposed to spread across lots of partitions. So those operations will go a lot faster if your table's large enough that require this. And if it's large enough to require it, very likely you've already highly partitioned it. Alright. So there, I'm I'm all done. I did them all. I'm gonna drop the, the view. I'm gonna drop the old table. That could be I could I could leave it there. I just dropped it because I don't wanna see anymore. I can keep it for historical purposes. And then I'm gonna alter the new to be the old name, and then I'm gonna just run a few queries to say, hey. You know? What does it look like? As you see on March, what are the partitions, and how many records are in there. So there we go. I dropped it. I dropped it. I altered show. The show says, there I am, iceberg with market segments, all in parquet. There are the partitions. I use the a special metadata table called partitions that say, show me what you got. There they are. Machining, house, etcetera. And then lastly, I hope there's there we go. 1,500 records in there. The good or the bad news is I think that gets me pretty far where I was gonna go and say, yeah. You can do this. You you take the code that's in here and build automate around that. You can build the high starburst. You know, a lot of different approaches to this. You can schedule it some other way. The main thing is make sure it works. You know? Figure out a strategy that works for you. Test it manually. It looks good. Automate it. Test the automation. Looks good. And then, you know, you're ready to go. Make sure and then, you know, what what your back out well, if you don't drop any of those things like I did, your back out's pretty easy. Call back. The gotchas, of course, gonna be like anything else with the migration. Once you hit them, migration's done. Technically, you can roll back you can roll back, you know, logically roll back to where you were then. But if you do make more updates, you'd be responsible for replaying those. So, you know, the rebate the rollback period's there, but the replay isn't there. So, you know, like anything, it's still some effort. Alright. So what else could you do from here? I'll give you a couple links. These are on that PDF. You can download. You can get to them back to, our solutions or info. A lot of, you know, help me help me you know, please help me out doing this. Give me another link to the site that I have there. And in fact, I'll let anyone reach out to me directly because it's not Starburst sanction per se, but I have a Lester Martin little I'll I'll paste the URL. Why not? Lester Martin has a little know. They have some tech. I have a little early cut at an iceberg migration tool myself that really runs in in a Jupyter Notebook, and I'll let you kinda come find this and see if that's of interest to you. It doesn't. Yeah. It's pretty interesting. But it's needs needs some more work and some more loves. I'll gladly take your pull request if you wanna move it, down the line, for us as well. Okay. So I'm trying to find the, oh, over here. Optionally, check out my personal v one of a migration tool at boom. Okay. With that said, we'll just transition to a full on q and a session here. And we have plenty of time. We we allocated about ninety minutes. We're at sixty seven, sixty eight of those, so we have a little bit of time. If anyone does wanna just open mic it about anything and everything, open mic means you have to type it, unfortunately. I don't think we have our our webinar set up to allow folks to come on verbally, or video either. But if you do have some questions about, the information we talked about today, questions about Starburst, questions about Isenberg, questions about data lakes, question about data engineering. Just keep kinda scoping it up. My time is already allocated for another twenty minutes, and I'd be more than glad to try to help you out with those questions. But other than that, for those that have no questions or wanna not wait around to see what might come in, I appreciate your time today. I hope it was useful. If you have any trouble at all finding this material, Quincy in the background here is helping me is gonna this is all gonna trigger an email that gets sent out, all that good stuff with the recording to share your friends, remind them about it. And then, of course, when I went back to that page, it said devrel@starburst.io, you know, for this question, that question, or whatever that you might find yourself in in the future. Alright. I'll have a little sip. And then if we get no questions, we'll just shut her down. Waiting for Quintin. Oh my goodness. Alright. I think the questions, Quincy, have dried up. I think we've got everything accounted for in the the, yep, in the past in the chats. Oh, here we go. Yay. Thank you. Alright. So Rules, though, is typing this question, but I I know what it's gonna be. He says, hey, mister Lester. I got a question about pass through query. So this is just a generic Starburst, Trino kinda question. And this the short answer is this. Keep typing your specific question, but let me just go down and click on this slide and say, the cool news is the bad news, the cool news cool. I see what you're saying. Well, I'll go let me answer generically, and I'll answer that. So in general, Trino being a query engine, which really means it's a database without, definitive only persistent persist this way instead of persist that way. Right? Oracle says I'm a query engine, and I store data in my ORA files, and that is what I do. We act like a a database, but we have some flexibility where we store data. So the reality is whenever something comes in, because we've fed part of that planning and optimizing stuff is getting metadata, structural metadata, and, sizes, you know, characteristics metadata. How many records? How many no records? How many things in this category, that course. We get a lot of that metadata to help us decide how best to attack a particular problem, and that's probably easier to imagine on the data lake because we have to do that. But even a data system, even an Oracle or Postgres, whatever, who has a lot of that stuff in inherently, we'll go out there and ask them for that. Tell me this. And then the optimizer will make some decisions. What did I say all that for? I wanna say it all because we don't necessarily do a pass through, at least not by default. But we might also if we're only running a query nonfederated only on that one, maybe it's to Teradata. And the query says, hey. Give me the count. So, you know, group by, market segment. And, just all I want is the count of how many orders for by market segment for last month or or always or whatever. That metadata is gonna yield a couple things. It's gonna tell us something like, there are trillions of records, because they didn't put a date range or something, and there are only 42 market segments. So we're gonna go, you know what? We know we can aggregate very well. We have this cluster and all that. But the truth is we'd have to pull all those raw records across the wire to us to then start that. So we're absolutely gonna do some level of push down no matter Not not not no matter what, where it makes sense. Let me see it that way. Where it makes sense. So we're gonna say, hey. You know what? I'll push that part of the query down. Maybe there's a sort by the total number. We we I'll be honest. It looks like we always just say, take the sort off. We know how to do sorts fast. So let's not even ask an engine ever to sort. So cool. But let's do it. They're gonna have to crunch and tear data. He sends us 92 records back. We either sort them or not sort them depending on the query, and then hand the answer back. So we are not gonna pass through a query by default, but we may pass through some elements, if not the whole query arguably after we make some quick checks. Okay. A rule had a different question. How do I send parameters to a past due query blah blah blah blah blah? Well, let me just so second part of that question statement would be, there are some ways that you could say, hey. I want you to pass this query down. Those two ways roughly are the the connector setup to that system has this concept called, like it's not eagerness. I forgot the terminology, but it's something like eagerness where we say, how hard should we try to figure out and try to do a better job? We can turn that very cold. That that pretty much says, probably based on the query itself before you can look at the metadata, just find that kind of query, just pass it through. So we can get a sort of pass through, not guaranteed, and then we can very specifically via some syntax say, hey. Tree knows Starburst. Trust me. Take this query. Don't even parse it. Don't do nothing. Send it to the engine. It knows what to do. Maybe there's a function that exists there we just don't have. Okay. Maybe and you need it. You gotta have it. That's the so that's the at least the initial strategy. So there are ways to say that happens. Syntactically, it's kinda like wrapping a query in a function calls as pass through. And then a rule's final question his real question was, okay. Great. I got some parameters. How do I send that to it? Well, a rule in that one, it'd be more like, well, how did you get parameter in to wherever you're invoking the query in the first place? So if you run an application or if you got a BI tool, how does it get parameters? Same. Same. Same. So you're just gonna substitute those parameters. If you're doing it in stream and our query editor, our query editor itself fundamentally doesn't have that. Yeah. There's not like a run workbook seven and then supply these three parameters and that kind of stuff. But you could do a a logical version of that. You could use lookup tables to find values. You could possibly do secrets, if the runtime can sort that out, but that's probably not that's not something you do with Starburst Galaxy and that kind of stuff. So, I know it was a little bit of a short answer to the real question you had and a long answer for the concept, but it goes back to what I said. How how would you normally whatever when you invoke a query to Starburst or to Treno, how on that engine, how would you get parameters to your query itself? Maybe the parameters the query the table name. How would you do that? It would be the same thing about how you pass it specific parameters or anything else. So we can talk offline and try to sort that out for you specifically, though. Cool. Cool. Alright. Awesome. Awesome. I appreciate that. Other thoughts or questions while we're here? Cool. Cool. And, while I didn't put it on here, I would say that we do have this beautiful site called the Star Wars Forums. There's this there's the Trino Slack. There's your support account. Of course, if you got a paid support, use that as appropriate. But, questions like we just did there with the rule, there's a great place to put these as our community forum. I'm gonna this is searchable and findable on our website, pretty easy under resources. But this is just your classic kinda q and a setup here. You know? Tell me about this. Tell me about that. How about how do I apply a column master one, whatever, whatever. You know? This is your old Stack Overflow kinda model. So I'm gonna ask the question. Lester jumped in and said, hey. Here's what you do. And then, thankfully, the question was a little more. Cody on the insurance side jumped in. Oh, no. Oh, I see what you're asking. You you know? So you're getting a wider audience of people helping you out, and, absolutely, you're gonna get, you know, hopefully, the great it's all working. Thank you very much, feedback when you're all done through as someone who voted the best answer and all that kind of normal stuff. We help them out. Alright. So Quincy did post while I was rambling our next, webinar. Those events page, you can get to them from there as well. Plenty of great stuff. I don't think, Quincy, I have any other call to action in my slides. I think I just got a big thank you slide. Yep. So that's my call to action to let us try to thank you again, which means give us some more your time in the future. We'd love to kinda share ideas, thoughts, webinars, tutorials, blog blog posts, professional services, any way we can help you. I know that's my mission in life, not only because I get paid to do it, because I actually like to do it. So, with that said, I'm gonna sign off. I'm just waiting just a few more seconds in case we get another question. And I would say, we did not. I'll go ahead and stop my sharing. And, again, thank you so much for your time today. Hope everyone got something useful out of it, and I'm a see you on the next workshop. Thanks now.