Video: Building Data Pipelines with Ibis and Starburst | Duration: 4980s | Summary: Building Data Pipelines with Ibis and Starburst | Chapters: Workshop Introduction (11.12s), IBIS Overview (143.975s), Environment Setup (240.53s), Cluster Setup (368.805s), Catalog Setup Credentials (507.005s), Environment Setup Verification (641.685s), Setup and Configuration (871.815s), Transformation Pipelines (1048.265s), Data Frames Explained (1207.335s), IBIS Framework Overview (1334.13s), Cluster Connection Setup (1524.15s), API Data Operations (1795.755s), DataFrame Operations (2070.01s), Method Chaining Transformations (2197.81s), API vs SQL Approaches (2400.41s), Q&A Break (2679.56s), Bank Data Exploration (2735.1s), Phone Number Standardization (3055.945s), Choosing Your Framework (3164.665s), Phone Number Validation (3267.185s), Data Type Transformations (3592.225s), Bronze to Silver Transformation (3787.94s), Data Loading Methods (3949.01s), Gold Dataset Creation (4134.29s), Creating Database Views (4320.885s), Query Plans Analysis (4568.295s), Wrap-Up & Next Steps (4640.53s), Resources and Closing (4801.54s)
Transcript for "Building Data Pipelines with Ibis and Starburst":
Hi, everyone, that made it today. Hope everyone's doing really good. My name is Lester Martin. I, as I like to say, I'm your friendly neighborhood, developer advocate from Starburst, as you see on the screen there. And, today, we're doing another one of our, workshop series webinars. So kinda used to do this all the time, but we kinda firmed it up this year and put some repeatability and that kinda stuff. So we're doing this about once a month. And these workshops are about a ninety minute long, webinar, and we're doing stuff. I'm giving you materials. I'm giving you access. I'm giving you rights. I'm giving you instructions. And I'm walking you through those with you, meaning that even if you don't want to do them or if you're doing them, but you have some questions, conceptual questions, intro you know, questions about what's going on or just, hey. Something's broken, or why doesn't this work? You know, in many ways, it's a little bit like a a live workshop. It's a it is a live workshop for those of you here with me this first time. If you're watching this on demand, the good news is all that material is available, all the slides, all that good stuff. There's lots and lots of good ways to reach us. In fact, one way to reach us, if you don't know, is, turn the slide there, is this email address you see at the bottom. That's not the only way to reach us here at Starburst, but it is a quick way to reach me if you're, and me trying to represent and help, you out getting the most of out of our products and those kinds of good stuff. Plenty of other ways for those that don't know how to reach me directly through LinkedIn or even email or whatever. The Slack channels, all that good stuff is available to you, as well. But do jot down or download the docs. There's a presentation link in that presentation link. Obviously, you have the presentation I'm looking at includes that DevRel app. Now I'm gonna go back up a slide again to the top and say, what we're gonna do today is, well, we're gonna explore IBIS. For those that maybe don't know what IBIS is, it is a Python framework that's gonna allow us in fact, maybe I'll just go show you a slide that I have somewhat devoted to that after the environment set up. We'll just talk about it briefly. IBIS is, you know, it's a it's a data frame API. So it's a framework that allows us, if you are familiar with data frames if you're not, we'll talk about them as we go along today. We won't actually explore them in great detail about the functional programming. We'll talk a little bit about lazy execution and others. So if you're not new I mean, you are new to data frames such as through PyStar PySpark or something. Some of this will feel a little strange, but those that came from a background or even use maybe nondistributed ones like pandas or others, any data frame API, you'll fit it home here. So it's a data frame API, and we'll see a lot of that code. But what makes IBIS a little bit unique is the optionality play for data frame APIs. So TreeNet will have to think of it as the optionality to all kinds of data sources we can plug in everything. Well, I just says write your code using this particular data frame API and and then tell me at runtime where you wanna run this thing at, what back end. So we're gonna run today. I was gonna do two or three of them just to make it more fun, but we're just gonna run it in, the Trino back end. We're gonna run it in Starburst. In fact, we're gonna run-in Starburst Galaxy. And to make that useful and fun and interesting, we need to, you know, go back up a slice. We need to set up our environment. This is the part that could take a few minutes. So I wanna go ahead and kinda jump start and get you going. There is a barcode. Sorry. There's a QR code, not a barcode. A QR code that's gonna help you get this set up if you don't have it set up. And I can come back to this or, again, download it from the docs page over there. But the main thing is it's ultimately gonna send you to if you click this link, you're gonna go to well, in fact, you're gonna go to this right here, instructions. And the long story short is this particular workshop says, hey. You know what? Just use the generic instructions that Lester already set up about setting up your environment. And that environment setup can be a lot of things. You need Trino with, with, with a arguably an iceberg back end, but I guess you just need Trino enabled to write to a table anywhere. But my preference is you set up Starburst Galaxy, software as a service, instance of Trino Plus, and then do some kind of, catalog that references references an object store such as AWS s three. So good news is if you click through all that stuff, again, it's all in the chat. I'm sorry. On the docs, this one will be called, demo instructions that should take you right here if you don't have it, in front of you. And, really, all you have to do is you just go to Starburst Galaxy. And guess what? If you don't have it if you not already have a Starburst Galaxy, do this right now. Oops. That took you to a a tutorial, but, arguably, the tutorial is just gonna take it here. It's gonna say, hey. Go to you know, give me a few informations. And, literally, this is what it is. I have a little shorts video that takes about a you know, I do it under a minute, but you basically gonna fill in this address, name, address. You're gonna get a quick email with a code, and the screen will say, you know, go get your code under email, and then you're gonna set, a domain name. So my domain name, for example, is as you see, let me run this query just to keep my server alive. You look it's hard to see on the front the front here, but the upper left, mine is lester.galaxy.starburst.io. So it's gonna let you set up anything you want there, something long as it's not taken. Lester was available. Lester probably would have been available if I didn't take it, so it's kind of a unique name. Alright. So you wanna get set up with that. The the instructions also are pointing to you and said, hey. If you that's pretty self explanatory. You can do that. Instructions say if you want a little help, there is a tutorial here that'll walk you through it in in, you know, laborious detail, and that's cool. But I will say if, before I look at that quickly here is, you know, there's a couple caveats. There's more in that tutorial than you need to do. You don't need to set up the, you know, additional catalogs and stuff you can. And I encourage you to do them, but maybe not. If you're setting this up today for the first time, maybe I wouldn't do it now. And it does say, you know what? Go ahead and do this the the setup step about creating a cluster. So you get set up. You can skip a lot of this stuff, but do this create cluster. It is very likely the cluster that gets created. I'll look over here in Starburst. It will create a cluster probably called free dash cluster, and that may or may not be where we want it to be. It runs in a a variable different, it it it gets positioned in a particular AWS, region and, probably US East, but it might be one or two. And I want to make sure, as instructions are here, you know, go ahead and follow those those bits and pieces here. Let's see. Create a cluster. You know, create a create a cluster. And, you know, you can don't have to name it exactly what it says here, but that's all my instructions or references name. AWS US East one free. It's a long, long, set of steps here. So that's important to get that set up. Why and I'm encouraging those that are gonna do the lab with me. I think there's some chats already, so I'm gonna look over there. Hopefully, just still good. For those that are wanting to set up your environment, I'm purposely kinda rambling here. So I want you to get in there and give it a try. If you got some questions or if you get to the point where I'm about to do something, you're like you think you're a minute or two away, say something. I'm almost there. Hold on a second. But it's very straightforward to set that up. Secondarily, after you have a Starburst Galaxy or in a TriNet environment, you need, a place to store some tables because we're gonna store some information here as well. And I did the same thing. I gave you a link to another tutorial that we have. And this tutorial, same thing as before. Go look at my instructions here. It gives you some caveats. You know, be sure basically, you know, use the name they suggested. There's a couple steps here that say, you know, plug your name in there, replace your dash name or first dash last with your name. Do that. It's important so we don't step on each other's toes too badly. And then lastly, and I'm just gonna go look at those steps real quick here. It's gonna say, hey. I'm giving you some credentials, some some, Amazon S3 credentials. Maybe not the best way to do security here, but it is a a valid way. And I'm gonna say and I'll show it to you. It's gonna say, hey. Get the credentials from me. Well, those credentials are in another doc over there called what is the doc called? Catalog config. If you open that up, it's a PDF, and it's gonna because I should open that one up myself. It's gonna give you the fact that and I'm back in the tutorial here, not that PDF. It's gonna give you let me see if I can get you there. Log in, create a catalog, new catalog, create a catalog, Amazon. S three, call it temp cap. Again, it could be any catalog if you adjust the instructions. And it says on this step, get the information from me. Today, right at this moment, I put them back in this tutorial here. These credentials aren't gonna last forever, but they'll get us through, the short near near future. And the good news is you really can't do too much. It's only locked down to a single bucket. And, I got some jobs watching that to make sure you guys don't go crazy and, store a bunch of other data there that has nothing to do with, I don't know, training workshops and whatnot. So for those that are here today, the the the keys will actually be in there. For those that are watching it later, you may have to as, my doc over here says, at least if I'm gonna swing it over, it might say, in general, for those who are watching this after the fact, the the short answer is you probably need to email me, dev rel, @starburst.ao, to give you the latest, greatest set set of credentials. Alright. So there we go. Some callouts there. This is that other doc that I mentioned there. Alright. So if all that goes according to plan, and I'm gonna hope it's going according to plan, What you're gonna end up with, once you do all these steps and I'll just go click on mine to show them to you. You're gonna end up with an environment that looks like this. So I'll step back over to your query editor. I'm gonna run this to keep my server alive. So what I ended up doing, I got a cluster. So under, admin clusters, I also have a cluster called a AWS US east one free. It could be any anything in The US East 1 domain because my bucket is there. And we're using all a lot of free resources, so nobody wants to pay for cross region transfers and all that kind of stuff. So I put the cluster where the bucket is, and everybody will be happy. So all my notes take you there. What else can I say? So you'll get that set up. And then under catalogs, you're gonna get a catalog called POP cat. So those that have already have some environment or those that are just gonna watch today, we're about to do the steps. This is just gonna be a preamble to get everybody a chance to set things up and ask questions as we go. And, there we go. And I got a catalog called TempCat, as as I mentioned. Oops. I should've said edit configuration. There it is. TempCat, AWS key, Lester. There's one of those examples of things in the notes that said put your name, not your dash name, you know, Lester, Texas. There we go. And and and it was pegged, and assigned to that cluster. So what I'm gonna do is take a deep breath and say, while I read the chats, if you're trying to just oh oh, let me say this. So so with that, you should be able to come to here, find, AWS east one, and find catalogs maybe such as the sample one. The sample one has a domain, schema called BurstBank, and I'll just force it to run a query out of that. And, hopefully yeah. There we go. There's some data. So we're trying to see if you're there. Now if you wanna be there live with me and you're having trouble or you need a few minutes, say something real quick. I'm gonna catch up in the chats and see what's going on here. Where's everybody from? Hey there. Ukay Ukraine Croatia. I got a lot of pen pals. A lot of pen pals in Ukraine. So, welcome, Ivan, Croatia. Quick question is sure. The stuff that's being recorded will the same URL that you, went to to sign up this would be the same URL someone would come for on demand, that hasn't registered and find it. And then if you're here and you gotta leave, we're still gonna get a message in the near future. It says, hey. The recording's up. Take a look at it. So sure will. Quincy answered all that stuff. Thanks. Thanks, Quincy. Based in on Alabama in Alabama, but from, Columbia. That's Wes. Hey. Quick joke is, they found out I saw it on CNN last night. They found out that the toothbrush was invented in my neighboring state. I live in Atlanta, Georgia over over there in Alabama. And the proof they have, they said if it was invented in any other state, it would be called the toothbrush. Think about it. I'll be here all week. Alright. On Galaxy thank you, Quincy. In Galaxy, do you need to register for a lab account? You do. You just need to go there, follow those links, or just go to starburst.io@productsgalaxy. There'll be a sign up link. Absolutely. Alright. Thanks, Wes. I appreciate you. If you're in Alabama right now, you know what I mean. Have fun over there. No. Just kidding. My dad was born in Alabama. Red red level Alabama. Teeny tiny town. Alright. So I don't see anyone else a few people ask questions about it, but I haven't seen anyone say, Lester, don't get started. Don't get started. So with that, again, if you want to do this later, hang with me. Watch what we do. Come back and do it later. In fact, you can do some of the steps, see some of the materials, and kinda look at it while I'm looking at it when you're on screen, as well. Alright. So the assumption is, at this point, if you're watching it yourself, hit pause till you're caught up. DevRel at starburst dot I don't know if you need a little help with these tutorials and everything, but you wanna get set up, with servers galaxy and, a catalog that you can write to. Again, preferably named as a temp cat, preferably the catalog is this, just because the instructions are there. There are some tear down instructions. I'll mention at the end of all this because these are pretty temporary. In general, my strategy and my recommendation is unwind this thing completely. You really don't have to, but what happens is the data often leaves and the metadata stays around, then you run a query layer and go, why is my table blowing up? Well, the metadata is around, but the data is not around. It gets really confused. So, like, kinda short answer is tear down. Okay. I'm back to the regular regular instructions for this IBIS pipelines, this instruction page. And, it's also gonna say, okay. Cool. Last thing you need to do before we get started is make sure this works. So you're gonna hit this link that says visit the Jupyter notebook. Now this is just gonna open up another page right here on that GitHub project. If you look at it, it should look like if you're familiar with notebooks, if you're not, they're web based micro IDEs, I guess you could say, run bits of code, sometimes see visualization stuff. We're gonna keep it pretty textual today because we just wanna focus on connectivity and API and stuff. But if you wanna you could take this, Jupyter Notebook, run it in your own environment. But if you don't have all that, I would just hit this nice little button at the top. It's gonna say open in Colab, and that's a that's a Google service where they'll run some machines for you and stuff. In fact, if you haven't ever used it, it's gonna ask you it's gonna make sure you have a Google account. So if you don't have a Google account, grab the notebook, run your own Jupyter, or run-in some other Jupyter instance you have, whatever. Or just say, yeah. That's me, and eventually, we'll get here. Now I'm gonna leave this up in the background here. And in fact, what I would say, if you got this far, just for giggles, come down here where it says explore it, IBIS, and just hit, this first play here. This this install bit. Just so say run. Yep. Yep. Looks good. And the moment it starts spitting some stuff out, we'll go, cool. It's kinda working. Yeah. It looks like it's I see different okay. Yeah. It is working, and there we go. Collecting a bunch of stuff. So so you wanna find this. You wanna find this. That's so helpful. This. Alright. Let me pull that back up because I lost, the instructions. When I clicked on that, I should have set new window and all that stuff. Alright. There we go. Why does that not only go in there? There we go. Alright, Lester. Let's get into it. We got our notebook up over here. I know it's working. There's a machine here. It's doing some work. That's good. We'll come back to this very shortly here. I'm gonna go back to my Slideware. I'm gonna go ahead to my chat, make sure we're all good, and let's get into it. Alright. So that's what we wanna do. We've got environment up and running. That's important to try to get that going, or just watch. Some people just wanna see what's kinda going on, and that is okay. Again, Starburst, is, you know, is Trino. Trino is a back end, one of the back ends that works in that IBIS environment. We have our environment. So let's talk about with a few slides, and then let's go straight back to that load book and spend the rest of our time there. I just wanna make the point. The the presentation, the workshop name was about pipeline, you know, that good stuff. What am I talking about? I'm talking about transformation pipelines. I'm talking about landing data, cleaning it up, and then optimizing it for, quick access and that kind of stuff, your dashboards or applications, so on and so on stuff. Nothing new. Probably, if you're a data engineer or know a data engineer, you've heard this story about a thousand and one times. That's great. I just wanna make the point that that stuff, those steps, those logical concepts really are pipelines. They're different kind of focuses. Usually, we hear letters like e's and t's and l's and sometimes i's and that kind of stuff. Okay. All good. I'm just really trying to make the point you can use SQL or Python, to be honest, for that. And if you're targeting Trina, if you use a different engine, to be honest, you probably have about the same thing. If you're running in Spark app, about the same kind of options. And then I just wanna say, we're gonna talk about Python, but I just wanna remind folks you could do your transformation pipelines in SQL. And there's some good reasons. You can use tools like DBT and others that help orchestrate all that. You can have a full on orchestration workflow tools like Airflow, etcetera. The only got you, I would say, with SQL, really is, at some point, some of us at least go, man, that's a lot of SQL. It's kinda hurting my head. It is, you know, give or take. I'll go back one. There's other concepts like maybe it's not necessarily code like we think of. So some of the things that we have as software developers may or may not be used with us. So so or all that easy to do or even possible to do. Alright. So we're talking about data frames. So what are data frames? Data frames really are, they look like collections for those that have never seen them before. Look like a big array. Hey, Trino. Give me all the orders from the order table. Right? Boom. Boom. Boom. So I got a collection of orders. What's kinda weird about these data frames, at least in this distributed world, they're not really loaded yet. They're really just instructions to load data. They use this concept called lazy execution. So they wait until so you can write a whole bunch of code, and maybe nothing happens until you actually need to see some results or save results and that kind of stuff. That's a weird concept if you have never heard of it. If you know exactly what I'm talking about, you're like, yeah. Took me a little while to kinda get there in my head, but but I got there at one point. Arguably, the good or the bad news is it's writing code, and some people just like to write code. Good news is regardless if we write raw SQL as a human or even maybe a generated one, you know, with our AI tools or AI friends and that kind of stuff, or we use a data frame API, we're still gonna get SQL in Trino at least. And that's what I just often assume. Many of those back ends use SQL. Not all of them, but me and them use SQL. So that's what it's kinda driving to. I would say if you looked at what the SQL looks like, and we're gonna see that when we run some code here, the SQL that the data frame engine might generate might look a lot more nasty, a lot more verbose of the way they're written. The good and the bad news is that's okay because these engines have cost based optimizers. They're they got statistics. They got smarts. And, ultimately, they should, regardless of how you express it in your SQL statement, be able to optimize and say, I see what you're doing. You try and do the same thing, and they should build if you're familiar with a concept called query plans, they should build an execution approach that looks identical or pretty darn close, to what someone might have written with raw SQL. A bunch of slides here just kinda glass and so data frame APIs in Starburst. Starburst built on tree mail. Well, we're gonna focus on IBIS today, but there is another framework called PyStarburst, just just to make sure you're aware of it. PyStarburst, only works on Starburst. It's it's not a Trina open source element, and it looks, to be honest, looks very much like PySpark if you're familiar with that framework. The real trick is the difference is it sends to the to Starburst cluster, data frame approach, the kind of lineage, the definition. And then inside of Starburst, he turns that around and build SQL. Good stuff. Just, you know, letting you be aware of it. Again, IBIS is focused today, and I really like IBIS. I would like PowerStar First for a lot of reasons too, but I really like IBIS, especially if you're in a shop that your back in may evolve. And I don't mean, like, this year it's that, next year it's that. That's valid. But maybe you do your your local testing and memory, you know, on a local laptop and your your, CICD type bigger testing. You might do that in something else that uses a a smaller environment, and then you do your production in something else. This is a a cool framework that lets you test those things out. You know, you and I know that still didn't hurt to know exactly what it's gonna do in that end game, but the code looks the same. And this is an example what we're about to jump into. We're gonna write some code here. I'm gonna give you some code in the notebook. I'm gonna let you hit the play button so you can manipulate if you want to. But this is the kind of code that that I'm talking about. This is a data frame API. Now data frame APIs, just go ahead and say it for those that are familiar, they're gonna go, hey, Lester. Can't you just write SQL in a data frame API? You can. I'll show you that in a little while. But I'm gonna focus on more of the first class version of that methods like filters and grouping by, and that gets in that kind of joins as opposed to just writing raw SQL. If you don't do 99.9% of raw SQL, maybe just do it and run an SQL and use a workflow to run it all. So, so you have options. You have choices. You know, a lot of activities, approaches, etcetera. So similar APIs, and then we'll jump on it. And same thing. Similar just like SQL and a data frame generated thing should come up with the same query plan. Same thing between these two popular, frameworks. I have seen, and that was some time ago when I was digging deeper where I saw a few times where PI Starburst actually did something a little smarter. But that, arguably, with my statement earlier, shouldn't be the case. We should ultimately the engine should be able to say, hey. What are you doing? But, you know, there's always gonna be a case where something just runs tiny bit better for some reason in one system versus another. But overall, I would say they're very, very, very, very compatible and configure and and comparable about what gets run. And ultimately, again, the main reason I'm talking about IBIS today is if you especially if you're an open source tree or not a Starburst customer, or you have those options and you need some other back ends to play with or participate in, that's a good reason. So enough slides, lot of not a lot of detail, but a lot of slides there for you. The rest of the deck is this, this hands on lab. So let's get on into that. Alright. I'm gonna switch over for a second. And, what happens with the Starburst galaxy, just to go ahead and let you know, today, Starburst itself and sure itself is not built serverless kind of so it does spin up a cluster. I have a really tiny free cluster, one machine, but the free model, we don't keep it running forever. So I cut you out. It it quiesques. It shuts down after about a five minute period, so I'm just kinda kicking it back up, for it to run. Now what we should have, and I think is that right there? I'll go to the top. I think I copied up. There we go. What those labs, those tutorials should have helped us do is something like this. It told us to create a schema called, in our temp catalog called temp first name, last name, postal. So I put Lester tester 12345. I adjusted all this. I'm looking at that particular thing, so I can just reference stuff local here. And, just to make sure it's all working, I'm gonna do myself a tiny favor. Should work great. Scan must be specified. Yeah. It should be oops. I've toggled it off. I'm just gonna make sure, yep, that I can load a table and it's gonna be there because we'll mess up if, it doesn't work later from, from my bus. Yeah. It looks great. All good. Alright. The good news is the rest of the work is over here, or 90% of the work is gonna be over here. We're gonna sometimes go back to that galaxy system. Alright. I'm gonna make that a little bigger too. It looks kinda tiny for those. Here we go. I'm gonna keep it at probably one ten. So if it's a little small, you you chime in and let me know. I will say, Quincy might have mentioned it. Well, Quincy, tell me if I'm wrong on this one. Make make a quick message here. Quincy's in the background here. I think we might be set up where if you need to not just ask a question chat, but you wanna come live audio or audio video, I think there might be a concept to raise your hand. But, if that's not there, I'll let Quincy message us. But chat's always good. I'm peeking over there. Alright. We're back in this Jupyter notebook. Gave you lots of good links about when where why. We got through all this stuff. And in fact, we went ahead and did this install bit down here. Right? Well, for this to run, for the very next thing to run, like, if I go ahead and run this next little piece, it's gonna blow up. It's gonna blow up because I didn't set a lot of details, like what's my username and that kinda good stuff. Okay. That's fair. I actually made that the very first playable notebook up here, our note up here. A little paragraph up here says, let's run this Git pass. Now this Git pass is gonna ask us for your host name, username, and password. How do you find all that stuff? Well, I tried to put it right here, so I'm gonna pull this down and see if I can get them all on the screen at the same time. Yeah. Why not? It's gonna say, hey. Go back to Starburst Galaxy. Go under admin. Go under clusters. So this is all the instructions. That's one. Look at the cluster list. Find the cluster called thought I listed it there, but I guess not. The US East one, the one you're using. That's this guy. Scroll to the right, and there's a little vertical ellipses way over there. And there's a option called partner connect. There's a million ways to figure this out. This is a quick way. And then the note said just hit drivers and connect. Drivers and clients, and there's an IBIS link here. I'm going to click on IBIS. I need these two values and my password. I'm going to grab the host, click. I'm going to paste it over here. Boom. Hit enter. My user actually is what it looks like. We we usually supply our, default role, the role we wanna be in when we log in. So we're using Account admin. Then I'll type my password in. Now the notes do kinda mention here that you won't know this isn't really gonna validate your password. We're just storing some properties in these variables here. So that way, when I come back down to this boilerplate code and run it again, it should work because it found some value. So it is just simply gonna this is just a quick check. It's gonna connect, make sure everything's good. And if it wasn't good, you probably got an exception like you saw. And, if it was good, you got this code word called connection is good. Alright. Pause for a quick second. Awesome. Let's see. Let's see here. Thanks for that, Quincy. I appreciate it. Alright. What I have now is a couple different sections in this notebook, and I'll I'll collapse them, here so you can see a little bit what we're gonna run through. Yeah. I have major I guess, maybe two sections. I got this section called let's just play with the API for a bit, talk about it, looking looking, starter burst stuff, and then maybe do, like, a real world example. Very simple one, of course, that piggybacks on maybe the stuff we run through here. So, here's a quick example. Select a full table. So you wanna get a data frame. We'll call it I'm gonna make that a little bigger too. And if it just looks terrible, we'll know in a minute here. It's okay. Go get me a so use this object, con. What's con? Con is what we set up up here. Con is connect. Hey. I just connect to Trino. This is where you would swap things out when you need a different back end in your code. And in fact, if you had dependency injection, you could do all kinds of cool stuff. Alright. So we say, hey. And the table function, if you if you went and looked up the API, it's gonna say, yeah. Tell me something like the name of the name of the, table you want, customer. And then, optionally, because Starburst has this notion of a bunch of catalogs, and each catalog can have a bunch of schemas and, you know, each schema can have a bunch of tables. You could use the connection defaults here. We we logged it into heck. I logged in to something that doesn't even exist. I should have put my name here. In fact, I need to do that for, for later. Lester tester 12345. We should use whatever we said we would use earlier. So I'm gonna run that one more time because I do want that connection based on that. Okay. Cool. Alright. Run this one more time. Make sure it works great. So what did that do? Yeah. It probably went to Trello and said run you know, give me a select star from, that customer table. And we're gonna verify that. Pretty easy to do by simply going back to Starburst and then looking under our query insights. This is the things that have recently run. And guess what? There's one that said select all from starburst I'm sorry. Sampleburst bank customer. Yeah. That was the query. Very sophisticated, very fancy. Makes a little bit of sense, though. And then we have, you know, we have this object. We have this data frame, customer data frame. You know, what else can we do? Well, if we really didn't want all the table, we if we wanted to select name account balance, usually what you would see happen in something like a data frame is you say, okay. Cool. Maybe use table to get it, and then maybe that gave you a data frame. And then on that data frame called function called select, this is only one of few of the fields. So it's a way to do projection. Now the truth is is what I was trying to say earlier, this doesn't mean that there's an object called custody f that has, you know, thousand customers in it, and then from that with all the columns and then we strip it down. It just means that every time I need to see something, save something, any kind of IO, it triggers an event to occur. So that means if I went back to my queries here and looked at it again I'll refresh it, make sure I'm not wrong here. There we go. You might look for you might see something more like that. Hey. In fact, it represented a weird way. It said, look, select name account balance from customer, you know, and then made a select. So I made like a sub select. Select all from select. The good news is that actually does exactly what it would normally do, something simple and easy. It was it just kinda goes back to what I was trying to say earlier that the data frame APIs, one are lazy loaded. They wait till something has to happen, then they backtrack and they figure out what needs to be done, and then they run it all at once. But that SQL statement sometimes looks a little bit, I don't know, more a lot more robust than we would have typed it as a a mere mere mortal. Again, same run time. It's just it's just a way it's gonna work. So I'm gonna take that projected one. So so projection, easy breezy. I wanna do a where clause. We'll use something use a use a function called filter. You know? Hey. I only want the count balance is greater than $50. That sounds great. If I went back and looked, guess what? It's gonna do something very similar. It's really just gonna do select these three columns from that table, you know, where account balance is greater than 50. And what I'm trying to show this early in the steps here is that, really, it's it's, you know, it's a matter of style. It's a matter of, familiarization. It's a matter of, you know, what your team is dictating or wanting to do, when you build things out. So here's another example. I went ahead and said, hey. You know what? Let's let's show how that might look like. Like, I showed you that as three or four different statements. I might do something like this. I might say, hey, connection. Give me the table nation from this this this schema here. Instead of select, I'll do drop. So if you got a 100 columns, you want 98 of them, drop a few. If you have a 100 columns, you want two or three of them, do the select, but they're the same thing. They're just projections. So I only really want region key and oh, I want region key and comment to go away. And then I really said, you know what? Maybe I'll rename some of those things, nation name and nation key. You know, I just wanted to clean up the names a little bit. Actually, I wanted to call them the thing on the left. That's what you see here. In underbird nation key nation. I'm just trying to make the point. That's that's like an as, you know, select this as that kind of model. And maybe for fun, we can kinda go take a quick look and make sure I'm not making this stuff up, See what that one might look like. And it should look a little bit weirder, a little bit more gnarly. Oh, that actually turned out super easy. There we go. There it is. It's it's using that same pattern of sub select. Select those two values as this from that, you know, boom boom boom. Pretty boiler plate easy breezy stuff. Alright. Hopefully, for those that have seen the data frame API before you go, got it, Lester. I hear you. Move on. And for those that maybe have never seen it, and there's always usually everyone's in the middle. So I'm, I'm just trying to give you a flare of what's going on going on here. What else might you do? You might do a join. Again, programmatic joins look like that. You take your data frame, got join, tell me the other data frame you're joining to, then if I that own column, you know, own characteristics or own criteria, you could you could declare across it, full left, full right, you know, right, full outer, whatever you want to do as well with another optional parameters defaults to to enter like a regular SQL kinda does. Okay. Great. We join that nation table with that, with the, filtered d with the filtered, which was the the customers earlier. Okay. That's cool. And we could do some more stuff. Normally, what happens when you join two data frames, you get all the columns. So if you had ten and ten, you got 20. So, normally, what you follow that with is, okay. Now get rid of a couple of those columns or or just keep a couple of columns, either select or drop. And, again, it may sound weird. It may feel weird, but it is, arguably, could be very thought of as very simple. You know, do a little thing. Do a little little thing, and don't worry about, the fact that it looks like it's inefficient. Looks like we've done gotten this dataset four, five, six times and keep doing something to it over and over. We haven't. We've just described a bunch of concepts that we want, and then we finally say, give me the answer. We're expecting the combination of IBIS to say, what does that look like really? And then building, say, SQL that thinks it's pretty darn good, shipping it to a back end like Trino, and then Trino can then further go, okay. Let me see how I can really tighten this down and do this as as succinctly as possible. Alright. If you wanna do some group buys and aggregates, guess what? It has functions like group by. So group that thing by the nation name. We'll do some aggregates such as, account and an average here. We'll call it number of customers, average, account balance based on those nation names. So they're joining all that kind of good stuff. And then, of course, who who doesn't want some sorting? So there we go. Further tacked on that thing dot sort. Alright. And I don't think I need to look too deep over there. Now I will say if you if you saw that, you're like, man, that's a whole bunch of little one liners, Lester. Is that really how people write code? You can. Some people love to write code that way. I would say a lot of people that spent time in data frames probably write code that looks a little bit more like this. They try to package up. They try to have less, declared variable names for a data frame, unless they just need one. Or usually just need one comes down to code simplicity, or sometimes it's just hard to reference. Often, like, in joins, it's hard to reference kind of like a unnamed thing, an anonymous kind of named. So you often build the tables as data frames and then do a join on them as well. So that's what you see here. I'm basically so do all my filtering and projection and all I can as much as possible in the nation's table, do as much as I can on the customer to get it down to what I want, and then do the join there, including all the other stuff. So what you see is I'm I'm just method chaining everything. A data frame, these functions are called transformations. They build another data frame, not another transformation function built. So it's just logically build in this kind of lineage, this hierarchy, this history, this this set of instructions that naturally fit together with what's gonna happen in all these distributed engines on the other side. Alright. So you can put it all together and replace what we saw above, and this might be what you wanna do here. And what and, again, those are relatively new to this. You use these notebooks mainly because they let you do little bits of code at a time. Try this. Try that. Try that. Try this. And then often, while you could theoretically productionize a notebook and have this thing scheduled and run, more than likely, you're gonna lift this code out of here and go put it in a dot p y file and then either run it with a script or have a scheduler or automation tool, run things for you. Alright. So that was just a API glance over real quick there. I hope it was, of somewhat, interest as well. I will say this. Let's look at that very last one I just ran. Yeah. I wanted to see that one. So I'm gonna go back into the query insights, refresh, make sure at the top. Then I'm gonna glance at those times. Yep. Yep. Yep. Yep. Yep. Yep. Ultimately, that last set of stuff I built, I should've built this. It at quick glance, it might look like a lot, but you you see a lot of spacing, and it builds the date the SQL that IBIS is gonna build is gonna be heavy on sub selects as you know, you could it could have built them as CTEs or something else if it really, really wanted to. And it actually uses all this kinda table. It makes up little names for himself just to kinda keep himself honest. But this is the SQL statement, that he created. So it really was just one command at the very end. There's all that API, then you said, show it to me. Okay. Package it up and send it. Now this is 56 lines. If I looked back into my, notebook, I said something else. I said, you could just write SQL via a data frame API. It's not usually the coolest way to do it. And in fact, that connection option just has an object a function, sorry, called SQL. Same thing in in, pi star pi starburst, same thing in pi spark. They all have this way of, like, hey. I just wanna write some SQL statement. And you can see this is not 56 lines. This is the exact same thing. Right? Joining customer on nation. I got my, you know, where clause on the customer thing here. I got my all my projection and calculations based on my group by. You see down here, do all my aggregations. And then sometimes, as I said earlier, we'll just make sure they have the same results. There's results here. Saudi Arabia, Vietnam, Saudi Arabia, Vietnam. The numbers all, lined up there. And that will look like over here in the query editor refresh that one more time. And that should look like what I've sent over. See if I'm not lying to you. That is it. Now it took it and it ran a SQL pretty, you know, pretty fine pretty DFI kind of thing. But that's the SQL, that I wrote over there just a minute ago. Arguably, this is the SQL right there. It's wrapping it with this limit because it only wants to give me a few rows and that kind of stuff. But that looks a little shorter than that 50 some odd line, you know, the the whatever that was, ten twenty. In mine, I did the old school language, so it's probably only about eight or nine lines. But, you know, when you put enough space in, it gets a little bigger. But it is the query, that I typed right here in a hard code. It said send it. Now why would you or would you not you know, why would you use one or the other of these kinds of models? This one right here, the SQL I'm sorry, the APIs oriented versus the dot SQL. Well, it's the same argument if anyone in in their data frame with a distributed back end. This yeah. I've I've been around a spark for not quite a full decade now, but sometime now. And, similar situations. Usually, programmers like this kind of world better. So, again, a lot of it is style, approach, etcetera. Sometimes, to be honest, it it just makes sense to go, let me just write the SQL, put it in here. And you might say, well well, Lester, you said if you use SQL, just use something else. Well, maybe you just use the SQL to get to this initial data frame, then maybe use some Python libraries to do something else, so on and so on and so forth. And in fairness, personally, Lester enjoys writing code like you see here. But in fairness, if I was on a team of 15 or 20 people and people leave every quarter, we lose a person and gain a person, all that good stuff, and I want to get promoted someday, I'd probably do more of this in my code because I know there are more people that can follow behind me and understand that. And if they don't fully understand, they can easily just copy and paste it and and run it in the query editor as well, any or any other tool they like to see. So I understand why. In fact, probably would recommend sometimes people do some amount of this in any data frame API. But, again, if I had our smaller group and we work together and we really cared about each other and we were gonna work together for a while, and we all like to program, I'd probably do it in this. So, again, style, taste, directions, standards, all that kind of stuff make a difference, and all teams, like it or not. Okay. So I'm gonna pause there and see if there are any rough questions, get another sip, and then we'll jump into my, okay. Let's talk about a real dataset that has real problems and talk about progressing through that pipeline, you know, that raw, you know, structured, consumed layer or the whatever they call it, bronze, silver, gold type datasets. Since the questions are low too, if there's any interest, like this is boring or this is if you're boring, you already hung up. If it's somewhat interesting in line with what you were hoping for, give us a shout out. Let us know we're doing okay. And if we're missing the mark, if I'm missing the mark, you know, we still got a little time today. If there were things you're hoping we were gonna go down today, don't hesitate to put those, over there in the chat as well. Okay. Alright. Let's do this. I'm gonna run this again because I'm a keep my cluster alive in the background. Alright. Let's do a real world example. So let's what if, you know, let's what if we had a a bank dataset? And, we'll focus initially in this schema. Well, we'll look at the schema. That bank dataset that I'm talking about or bank queer, bank schema looks a lot like this is stuff by using the Starburst Galaxy, Gully stuff. Kinda classic stuff, accounts, customers, profiles, employees, mortgages, blah blah blah. So we're gonna look at this. I'm not gonna look at every one of these tables. We're only gonna look at a handful of them to start with. And in fact, to talk about the classic, you know, where's that picture? You know? That classic statements of, like, you know, data landed and let's just get it to this. We're gonna focus on this kind of transition then from raw land landed data to some kinda cleanse and rich, you know, structured dataset. I don't have a lot of time, so don't expect to be the most hard problem. But I'm gonna show you the concepts that you probably would do, tools, and approaches you might take, especially when you're doing your early exploration and building out, what you ultimately will do in some automated fashion. Alright. So what you probably would have done in practice is you might would just go over here and say, oh, I wanna see that dataset, Lester. Okay. So I'm pulling up this customer. I think I might have ran this in the beginning just to show it, but here we go. Jessica Wilkins got an address, you know, when they're you know, some details about who they are. Do they have direct deposit? What's their FICO score? Those are outside The US's FICO. It's really it's it's our credit, rating. This this 376 for this individual, Margaret Larson, she might as well just, never try to buy anything again on credit because that's a rally. I don't know if you can get that low. I would say probably the is it 700 or 800 is the kinda like gold standard, you know. But there's a lot of numbers. It's a it's a score of your credit worthiness. People banks look at that to decide if they wanna loan you some money. Hence, it's in a bay in a bay in a in a bank dataset about customers. Alright. So I looked at it. And then I might, as a human, you know, as I'm exploring this dataset for the first time, I might start going I'm and I'm just giving you some notes that show what I kinda said. I said, well, Look at this. There's a country code, and I see there's US and Canada. And I can do some queries and oh, in fact, I wanna do that. Let's just see what's going on here. So I said, what do I do here? I said, hey. Let's take that bank customer data frame here from here. Bank customer bank customer, and let's just group it by country state. And I wanna just see, you know, who we're talking about. What's our what's our customer database here, look like? And then I really wanna look, by country state, give me, you know, how many, customers in that and maybe just for giggles, what's the average credit score? I'm just curious. And then I went ahead and dumped it into a concept called Pandas wise so I could do something, like, a little bit more verbose and or or feature rich about how I look around here. Well, what I found out by going here and if I asked to do a 100 per page, I might have found out really quickly there's 66 entries. And I sorted it by, I think by country and then the state field. Well, there's my Canadas and there's my US. So I live in The US. There are 50 states in The US, and there are about, what, seven or so provinces, maybe eight or nine or less than 10 provinces up there in Canada. And I'm like, why the heck did I get, you know, 67? This is just the kind of problems we go through. I did a little more digging. I realized they have we have a lot of the USAE's and AN's instead of the army and the navy kind of, the post office, state states. Okay. I looked at the Canada stuff. I found out they include their territories, which are different than their provinces just like in The US. And in fact, that number 66 ended up 67, I think it is, ended up, being okay. So I'm down here, and I said, yeah. That sounds about right and everything. But I kinda you know, I also said to myself, well, probably, you know, if I'm trying to make the world better, maybe, oh, it is oh, maybe they they nail postcode pretty good. That's we call it ZIP code. They call it something else. I think postcode is kinda generic. State isn't very generic. Some countries have states, some have provinces, some have regions, and whatever. You know? I guess, ultimately, my my argument would have been to these folks is like, hey. You know what? Maybe maybe own, validating the addresses upstream and letting us trust you downstream because that's kind of where we ended up going here. So if nothing else, I built some some comfort level that, yeah, things are pretty good from addresses. And, absolutely, if I want to build some quality checks here on data and this is customer, so you don't, you know, you don't have to get changes all the time, but transactions or whatever come in. You know, there's nothing wrong with building some quality checks just to kinda do some reporting on new data and that kinda stuff. But, ultimately, we said, okay. That looks good. And, you know, I'm using a figurative example. I'm just trying to say what you might kinda do as a human. So you say, okay. Great. Addresses, for the most part, are in good shape. Let's not really play with it. Let's look at those phone numbers again. I'm gonna scroll back up and scroll over. These phone numbers are all over the place. That might be the first started, there we go. There's a question. I'll take a look in a second here. They might look at those phone numbers, or I looked at those phone numbers, and, man, they do not have a good data capture upfront because some of these are clearly a US number like this one. This one's clearly sorta u and they're all US numbers. I mean, I didn't look at even the Canada ones, you know, look like US because we can play that game with Canada. Sometimes they have extensions. It's all over the place. And with that and I'll look at the question next. I went ahead and said, well, I need a little help. You know? And I realized there's this great little library out there in Python called phone numbers. So I do a quick PIP install, make sure it's on this box. And then I just wanna make sure in this next little module, because I'm just in Python, I wanna make sure that library works. So I got a quick example with just some hard coded phone numbers, and it said, what can this thing do from a standardization? Validation and standardization trick here. And what I did notice, and I added what I did notice is for all these different formats, it does pretty good. One thing it does terrible at is, things like extensions. It just throws that stuff away. Extension one zero three, e x EXT, extension or box or whatever. You know, they just go away in this e I one sixty. I did a lot of digging on phone numbers. My team, somebody here at Starburst built that test dataset, and they purposely made it some nasty stuff. Really, what I ultimately said, look, let me let me look at the question. We have some weird data there, so that's good. Alright. So, if customer start to use Starburst and wants to write some Python code, and which data frames to use, in which cases we recommend them this either. So, yeah, that that question, Ivan, is what I put way back here. I kept it kinda high level and said right here. That that was my rough decision making characteristics. I said, absolutely. If you're a Starburst customer and you assume you'll be on Starburst for some time, I would recommend you focus on PI Starburst. I think it's just gonna it's a little tighter, closer to us. The the people that wrote that code here so if you ever see something that's a little quirky or want something worked on, it's gonna be easy for us as a support company to help, you know, make it better for you. But, absolutely, IBIS has been around for some time. So the source project, we do not have any committers on IBIS. I know a few friends that, were part of the regional build out and all that kind of stuff couple years several years ago now, I guess. But, you know, if you're if you're not on Starburst or on Trina, then, you know, I missus your answer. Or you just really, really, really see that super value of that, you know, optionality play that we talked about right here where, hey. Yeah. I'm on, Starburst and Tran today, but I don't know. I change my mind every five minutes. Good luck with good luck with your day if that's what you do. I hope you don't change every five minutes. You can you can have options, but don't do that. So that's my general gist. Yes. If you're dedicated and long term belief of being a Starburst customer, Lester would say, you know, just Lester and my company too, but this one where we line up. Yeah. Use use pie Starburst. Yeah. There we go. Thanks. Thank you. Alright. I appreciate a question. Alright. So what did I find out? What did I what did I do? I got myself in a situation where I could actually, take a good look at the phone numbers that we have and run them through that phone number thing and then see what's going on here. So a lot of code here that basically just I mean, it's a lot more wrapped up and stuff. But, ultimately, what I'm gonna do is I'm gonna say, look. So I built me a little quick function here, you know, standardized phone numbers. I hard coded it for US. I don't I don't have to use anything, but just send it the phone number. And then I'm using that parse function that it has available, and it has ways to say was it number was it valid or not, so on and so on and so forth. Ultimately, all I'm doing is gonna say, look, and I'm wondering what's stalling out here. I wonder if my cluster quieted and just fell asleep, which is okay if it did. No. It looks like it's running. Make sure that's gonna run. Okay. So I'm not so sure what happened on my oh, in my IBIS world. Okay. It finished, I think. Yeah. Okay. I'm not sure where I got lost at. I just built a Python function, like a user defined function for my, my data frames code here. And then what I really did is remember I had bank customers? That was just the raw table. I just said, from that, I only want the phone number. And to use you have some options in IBIS, to use, Panda, Xero, a few different kind of frameworks that have their own, and and IBIS has its own built in functions, and then you have hard coded user functions and engines. What I created was a model. I said, I wanna use a Python function, raw any any Python code that I wrote or I grabbed. The problem with that and Trino is Trino don't know what to do and when, Ibis sends him just some arbitrary Python. So he doesn't turn around. So by using any open ended Python, u d f, it depends on the back end. So that one won't work. Meaning, what what I mean by won't work, it won't work inside the Trino engine. So what I did is I said pull it local. Two pandas got the data frame and then loaded into a kind kind of a different kind of data frame API called pandas, kind of like a fully populated collection. And then from there, I said, hey. Look. Slap onto it another column. I'll call it phone standardized and, you know, use this function called standard phone against the phone. And so what did I get? I got a little quick table that says this. It says, hey. Here's the phone number you gave me. I'm just validating the phone numbers. The bad news is, as you can look around, there's very few entries that it was able to consume. So what is that telling you? It's telling you, it's not really saying that the the phone numbers are wrong. If I look at this number that converted well, the bad news is it it lost the extension right here on 981. If I look at this next phone number, format wise, it's okay. That phone number library is rigorous. It's actually making sure they're valid by area codes and prefixes. It's going pretty far to do some hardcore checking, and there's no 001 prefix inside of an area code, call 911. There's no 10. So a lot of them are actually failing because you know, as I scroll back up here, this was a test generated dataset, and it had some, you know, interesting numbers. So maybe we should have gone back and, populate that with some real valuable numbers, but there we go. So, again, what am I trying to show? I'm just trying to show what you might do as a data engineer. What I took from all that with this particular one, I said to myself, look, where's my phone number? Kind of assessment down here. I said, yeah. I said, hey. Look. Most of these values are not getting standardized. You know? It actually doesn't look like that these numbers are really all that valid or for one reason or another. I reminded you that those extensions get lost. And I ultimately said, look. I can't right now with this dataset because I'm not gonna stop production, from going forward. I'm gonna say, look. I can't do anything rigorous today. I can't really standardize these phone numbers, because I'm hitting 10% at best. If I was at 90%, I would say, sure. Let's standardize and then go back and, over time, reinforce it. But I'm really gonna do do the reinforcement. I'm gonna tell my upstream team, say, guys, you know, it's on you to do a quality job of catching this. I can do things, but but, you know, this one's tough. You need to kinda do a better job for me. And they would you can break out things like extensions and all that kinda good stuff. So, ultimately, I said, you know what? For us, that column, I'm gonna leave it like it is when we build our silver table, which really kinda says if I go back up here, I did the same kinda argument for, addresses. They were in better shape, I think, than the phone numbers, but but I kinda said, you know what? My due diligence said everything's okay. The reality is I might have done some of these. I might have cleaned them up and standardized the phone. I might have, you know, greater of exception important, that kinda good stuff. So let's show a few things that we could do some work with. So I looked at there was a couple fields with dates. So bank customer had one called registration date, one called date of birth. So I said, you know, give me all that. That's what you see here, and they were strings as you see here. And I just wanted to say, are they okay? Are they in good shape? So all I did is I added, you know, using the in in this in this data frame API, it uses the word mutate. The function mutate that says, take that data frame and mutate it. So add some more, add some more columns to it. That's the way I did it. And I added columns, DOB casted right here, and registration date casted. And there we go. I just took that object and I fired a cast method. I said, can it work into the the date data frame? And this one actually works, easy to validate because, I'm gonna go and make it a little bit funner. If anything failed again, I'm just exploring. I'm not productionalizing this. If anything failed at all in that dataset, they would have blown up like this. It'd be nasty. So I I I put the customer key, which is just like a number 12345628910 and said cast it to a date. It didn't do so well. I was actually worried it might treat it as milliseconds since the UNIX epoch or something, but, no, it said, no. I I can't do that. So that was good. Let me know. Yeah. Everything's good. Now what does that mean? It means for me, you might for you, you might say that's fine. I went around gold table. Sorry. My silver table strings are fine. I decided in my shop, you know, when when I when I productionalize this or when I build my silver table, I'm gonna just do a cast. So I'll make sure I take the raw and replace the values. I'll keep the I'll keep the same column names. I'm fine with them. I could change those if I want to, but I will change the data type. I'll cast it and, replay it in the silver table appropriately. I kinda did something else. There was this system way up here called or field way up here called paycheck underbar d d, paycheck direct deposit. No. Yes. Yes. No. I did a little digging and looking around, and I determined there we go. I just ran it. I basically said, you know, same thing. Take the bank customer, get those get get the pay get the paycheck, DD field, and then I just built a little case clause using that mutate. I built another field called called, now what I call it? Direct I call it direct deposit. There it is. Having trouble for a second. And I put made it a Boolean, and I if I found why, you know, I made it true and no false. And I went a little harder. I uppercase it. I took the first letter in case they put yes or no, and I went out of my way. Anything that didn't fit into that rigorous, I decided to make it no. And I did account, and guess what? There's mostly, people do have, mostly the people don't have direct deposit at this place, which is interesting because that's not the reality that I know. But that was all good. It was in really good shape. So when I decided, you know what? This is a good indicative thing. I'm gonna go and use that logic, same thing, when I build my transformation from, silver I'm sorry, from bronze to silver. Alright. So I'm gonna look back here and say, we were doing this. We researched the heck out of the bronze. Again, simple example, only focus on a few of the columns. But this kinda work that often people do when they get their first time at a dataset. It's a whole different story when you inherit something someone else built. But what do you go through? You do these kinds of things, and then you say, okay. Cool. What is the set of actions, transformations, whatever, that will take this, that customer landed dataset, that bronze dataset, build us over. And in my simplistic world, it really came down to this. I said I just put all that code together. I said, look. Go get the customer, mutate it, which means, you know, swap out the date of birth, go ahead and cast it to a date, switch the registration date, opt it to a date. If you wanted to change the names, there's there's a rename what you saw earlier, how to do that kind of stuff, kinda build a little bit dictionary and stuff. And then I also said, let's swap out the let's keep it paycheck d d, but I wanna build it, Boolean value and use that logic. So this is the world's easiest, maybe second easiest, transformation job. It was expressed really, really quickly on, multi chained a chain set of data frames. It was a, select you know, insert into select from kind of model. You know? You could so they easily could have converted that, into SQL. In fact, let's go look and make sure that's true. So I'm gonna go back to query insights. Should should have been reloaded, but we'll just make sure. And I'll scroll to the right. 104104. These last two statements. Use it the first it does a quick kind of a eager exploratory check to get column names and stuff, but ultimately, it did this. It said, cool. What I'm gonna do is so this SQL statement is the conversion. Read from my raw table, my customer table, the one I ingest it already there, and then what am I gonna do with it? It's what you saw earlier. For the most part, everything's in great shape, a little bit of casting here, and a little bit of casting down here to those dates. And the biggest chunk is this right here. Hey. Build a build a case statement so and look really hard at that, field and replace it with a a true or a true or false or null if it couldn't make a check. So there's a nullable field. And, again, that'd be an easy thing to check afterwards. Soon as you're done, he could say, were there any were there any nulls? What's the count of nulls? None. Okay. Good. It was the pass all the QA on that kinda good stuff going forward. But I did feel like I had good confidence in the system. Now that was just a a query. Right? I didn't do anything with it. I said, that's my logic that summarizes one of my tables to do its transformation. I didn't automate. I didn't do anything, But let's go ahead and play with a little more. I'm gonna run this and say, well, what would I do normally? Well, after that code, instead of dumping it to screen, what I really probably would have said is this. I probably would have said, hey. You know what? I wanna just create a table. Why don't I call it customer silver? What am I using? The silver t. That's this guy right here, silver t. And then I'm I'm doing it like a CTAS, create table as with this statement here. It's I'll show you in a second. Probably, you would do more of an insert, and I went ahead and built a override through. I'm just testing this out, trying to get it going. Ultimately, it said, yeah. The table created. And for giggles, I just said, go back from the start and go read that customer silver table and show me some records. There we go. Comes in. There's our date date fields. There's our boolean fields. Here's our other date fields. So, again, not the world's most complicated data, transformation, but that's kinda what you would likely do. And then to be fair, what you'd probably do is not do a a create table. Right? You would once you decide what that table looks like, you'll probably define that table and be happy with it. And then what you would likely do if it's a typical kind of batch world, you just kinda do this. You'd build that silver t, whatever it is, the thing that says ready to load to silver, and then you would just use your connection object or your connect objects insert statement into that table you cared about, you know, with, silver you know, with the data frame that's gonna fit just nicely to it. And what I did do is I put a before count and after count. There was only a thousand rows that we're playing with. I just added those same thousand rows in there again, so I got 2,000. If I giggled for fun and ran again, it'd probably say two and then, 2,000 and then 3,000 at the bottom. Yeah. That's that's how I'd kinda see, likely from some kind of batch or micro batching cycles where we're just grabbing stuff, clean it, enrich it, load load it, move on. Clean it, enrich it, load it, move on. And this does also just like, just like the just like the create has a concept like override. They have a concept called flush it out. You know? By default, it's not flushing out. It's just saying if the table exists, add to it. But you could say, yes. If the table exists like it should, empty it out first and then load it. That's not what I'd wanna do. I wanna keep adding it. Unless it was customers, then maybe I might do a flush if I'm just maybe that's what I do. Maybe customers are such a nonvolatile thing. I every six hours, every 20, whatever, you know, it's only a million people. Maybe I just do this. I'd say, you know what? Make sure the customer looks really, really, really, really good and just overlay it. You throw throw out, add it in, call it a day, and move on. Alright. If all that was good, then I said, you know what? That got us a gold we got we got us a we didn't focus on how we ingest it. We started with an adjusted dataset. We built us a quick silver version of that. And just for fun, since we're here, we're looking at bronze silver, gold, or my preferred terms, a land structure consume. What what what is that? Let's build a gold dataset. So what are gold datasets? If you remember, they are those ones that, we want quick performance or and or ease of use, ease of understanding, that kind of good stuff. So maybe maybe, just maybe, I wanna run this just so it's running. We could say, hey. Basically, we wanna we wanna group customers by their medallion status. They have that gold, silver, titanium, all that kinda good stuff, by their medallion status and the state they live in. So Texas gold, Texas silver, Texas bronze, Texas titanium, whatever it is. I lump them together, and what I really want to do is I wanna build a a dataset that really focuses on auto loans. So only get customers that have auto loans, group them together by the medallion and their location, and then give me two important things, the the number of loans in that group and the average auto payment that has been made on, again, active loans. So I'm not gonna run you through all this code in great detail. I'll just simply say, I got a customer, did a little, just a couple fields from customer. I got a couple fields from profile. That's where my segment, my gold, silver, bronze stuff comes in. I went ahead and joined those two real quick just so I've got a a customer state they live in and segment. So just and and you could and if you wanna see these, you'd normally be building these one at a time and looking at it and inspecting it. I did all that. And then I've been you know, got took took all my debugs out there. I didn't wanna walk you through all those unless anyone has a question. And then I said, you know what? Why don't I go look in my account table? And I'm looking for I need a couple values. I need to on that, I really wanted to only get values that, you know, show that there's not a loan that's open. And I really wanna make sure if it's open, maybe it's open and their balance is zero and they haven't closed it yet. I don't I just want I wanna decide if my rules said, yeah, More than zero balance, open account is what I cared about. Great. So I've got loans. And then guess what? I got a separate table for payments. So I said, go get me some payments. And then I put it all together. Right? Took the took the active autos. I joined it with the the customer kinda roll up, and then from there, I joined it with the payments. And then from there, I grouped it by that the state in the in the segment. So why? So I can figure out those two answers I had asked for. How many loans? What's the average payments? And then I just did maybe one more pass. I I did a little bit of cleanup, and I rounded up, the payments to the nearest dollar. And there we go. For Connecticut, diamonds, you know, there were fifth only have 51 auto loans there. You know, I got a thousand customers in this database. Not very big. This will table. With a $556 auto payment. Okay. Great. What was nice about that is this was just the logic to figure out what that would look like in data frames. The gold tier, you need a dataset. Right? I can't give this to a SQL writer. I can't give this to a data analyst. Can't give this to a Tableau guy and say, have fun. I could, and they could unwind it. They might even go say, oh, let me go look in this query inside and see what the heck Lester did. And they might go, oh my gosh. Well, that wasn't it. Don't think it is. Nope. That's let me reset refresh. Thought it was auto refreshing. See if this will do it. I could, you know, use a tool like this and come in here and say if it paints any second now. I always love when something takes more than it should. K. I'm not sure what happened. I'll just reset it. Complete queries. Look at it one more time. Here we go. You love here we go. This is not that bad, 70 lines. Yeah. This is the logic. Could I give someone the sequel? Sure. I don't wanna do that by any stretch of imagination. What I really wanna do is, you know what? Why don't I? No. I guess the quick statement is, if you if you thought that was ugly to get there, it's not that ugly in SQL, and you can always write SQL and data frame APIs. Usually, two or three ways to get there. I used the kind of a keep it simple, couple CT common table expressions, and ultimately got it the same answers here. Connecticut diamond has 51, you know, counts with $556, yada yada yada. But this dataset that I'm building, this data frame, the view, just like earlier, I created or inserted into table, I could do that. I could build a new table and load it and that kind of stuff. But right now, I've decided, you know what? This thing runs pretty fast. I will just use, the connection the connect to object's ability to create a view. So now I don't really have to worry about staleness. It's always gonna be accurate. So I'm gonna run that and say, oops. What was never closed? Okay. Right here. I guess I committed it with a bug. So, hopefully, if you run it, you can see that in these parenthesis right here. So I'm gonna say, you look, go out there and build a view. And we're gonna look in the database. We're gonna look in Trina and make sure we see that in Starburst. Build one called auto loan aggregates view. Right? It goes VW. And then to make sure it's really good, I said, hey. Build the, connection table auto loan. So let's just go look at all that and see what really happened. What really happened is a couple things. First and foremost, to the average person that's using whatever tool they have over here, I think it's already there, under temp luster one two three four five where I've been writing, guess what? There's my auto loan aggregates view that got created. In fact, I could just run a good old select from it. From the bottom of the screen down there, hit run, scroll up. There we go. There it is. You know, Connecticut diamond 51, that kind of stuff. And in fact, if I wanna know what that thing is, I could just ask it. Like I say, show show, create, view, temp. I'm gonna actually actually, I'm gonna say auto load aggregates view. And so what really got created? Yeah. It's creative view with this nasty looking SQL that he created there. It's not a materialized view, just a view. And, again, maybe it looks overkill, but that's okay. It's gonna deal just fine. It's gonna take that in stream, build a query, run it. Same thing if I looked in my, kind of results here. I can actually go backtrack. If I look far enough, I could probably see this. A few minutes ago, this is what I just sent over. Create a view. Guess what? This is what we just saw. Right? Create the view. Here's the view. So on, so on, so forth. That's great. And then, really, what we just saw when we just ran a question, what is the refresh show create oh, that's what I did. This one. This is what ran. I I ran a select all from this, you know, blah blah blah. And if you took the VAC query and looked at it a little harder, looked at the advanced tab, you'll see it actually you know, it did reference all those various tables that we just talked about to ultimately get, you know, to that result. So it really ran, you know, if you know a lot about query plans and all that good stuff, pretty multi step job here to do different source queries against things like account, against things like scan filter. Right? Here, the customer, you know, so on and so on, so the customer profile. And in fact, if I'm those that know how to read this well, I would go I would say, cool. It's doing a broadcast join here. It's doing another broadcast out of here. There's another broadcast. So it's joining, and each time it's joining, it's doing a broadcast. If you know what a broadcast join is, that's a good thing. And, yeah, that's pretty cool to me at least. Alright. So I'm back over here, and what you might have noticed is I hit the end of one. I'm at the end of my, kind of, notebook, here. So to refresh and then I'm, you know, if I got some questions, I'm glad to take them. And it can be other than just IBIS and Python and, you know, anything in the data realm preferably with something to do with TreeNow or I could argue something something something to do with TreeNow or Python or something. But, ultimately, you know, we just talked about IBIS as an option for Python programmers that want to do data pipelines, transformation pipelines, data aggregation build up pipelines, that kind of stuff, how you could use something like a notebook to try this out or directly with, Trino or, in this case, Starburst Galaxy, Formulate what you have, what you're doing. I didn't show what you would normally do to operationalize, but, hopefully, you can argue, oh, those snippets were the bits I need to productionize depending on how complicated your world is. You know, could your whole, orchestration and scheduling system might be a big old Python file, and a cron tab to run it. You know? Or it might be something much more elaborate, or more mainstream, maybe like a airflow system, you know, where you build a workflow with or maybe a a workflow of a bunch of independent actions, and you make it all very modularized and all that kind of good stuff. Those are one those are elements we let you decide, as a Python programmer what you want to do with that stuff, how you wanna incorporate it. We help you figure out, because those are all gonna work just fine with, Trino and Starburst all day Sunday, and we just don't sell tools in that space. So if you want opinions, you ask for them, we'll give them, or you hire consultants and we'll, you know, sit with you and talk to you first, share opinion, share thoughts, ask questions, and then we'll pick something to do together, and implement it. But we only have, like, a workflow system and tooling, for SQL. And I would argue it's, still nothing to the level of combination of, like, a DBT and an airflow together or something like that. It's much more of a scheduled set of sequels that we could put together and build a, you know, periodic schedule against that and let it all surface and error messaging, all that stuff come out of our, engine, into your current login stuff. Hopefully, that's what you were hoping for today. That was what I was thinking about when, Quincy and I were talking about this some time ago. And in the last couple days, while I was getting my thoughts together and making sure I had something for you, that's, what I was hoping to do. And then, actually, I was glad it all worked like it should've. And, I think at that point, I'm gonna take you to the last slide I got for the day, which is, what are some good next steps for you? And, you know, of course, go to IBIS, you know, type it's in it's in the deck in there, but IBIS if you just type IBIS space data frame, you'll find it, the IBIS data the the IBIS site. See all the back end, see all kinds of good like, all sites have lots of little quick starts and that kinda good stuff. But I plugged the heck out of some starburst elements here, our dev center. That's a kinda culmination of a bunch of things. Our forum site there, which is a q and a system, kinda kinda like an old school Stack Overflow, I guess you could say. If anyone uses that kind of fun stuff anymore, if you want a human to answer it for you, not an AI, I'd ask her, come put it in the forum. And, and then let someone else give you the answer, and then let the AI agents in the future find it there. Other webinars are upcoming, so on and so forth. Definitely, I gave you a link to a blog about PI Starburst, that alternative one. I have personally on my own blog and and the corporate blogs and, probably more PI Starburst stuff than I have IBIS. Gave you, again, the link back to all the stuff we're doing here over there on GitHub and, plugged my book that I'm working on on Apache Iceberg, with O'Reilly that will be coming out later, this year as well. Beyond that, I wanna remind you that I appreciate you being here. I keep staring at the questions to make sure I didn't miss anything. I'll remind you of that devrel@starburst.io. And I will say for those that are listening to this on demand, if you got here and you're like, I saw it all, I wanna do it all, and you're missing something, again, devrel@starburst.io. If you don't feel like you have everything you need, ping me. I'll probably be the primary one that reach out to you, and answer from that. But it does go to a small group of us here in case I'm out and about. Someone else can help you out, etcetera. I thank you so much for your time. I I meandered and rambled as long as I could looking for questions. Don't see them. So I'm gonna bid everyone a farewell, a great day, and, come back to our events page again to Starburst. The other Starburst website, resources, events, so it'll show our event calendar. We do we do these workshops about once a month, give you all the instructions. We often do them now or later. And we also do one called office hours. That's usually about thirty minutes, and we do a ten to twenty minute demo on some topic, whatever the topic of that time is. And then we really, really just turn around and make it and ask me anything. And, we usually do get folks showing up with some interesting questions. So if you just wanna come listen, we'd love to see you there too. Thank you again for your time. Thank you, Quincy, for all the help, and I will speak off stage. Thanks again. Bye. Bye.