Video: Python for the Modern Data Lakehouse: PyStarburst, Ibis, and Beyond | Duration: 3532s | Summary: Python for the Modern Data Lakehouse: PyStarburst, Ibis, and Beyond | Chapters: Introducing Python Options (7.92s), Session Overview (90.54s), Introducing the Speakers (182.095s), Starburst Platform Overview (256.85s), User-Defined Functions Explained (504.905s), Python UDF Implementation (751.395s), UDF Implementation Example (1116.62s), Distributed Data Frames (1386.005s), PyStarburst vs IBIS (2243.83s), Demonstrating PyStarburst Capabilities (2477.835s), Conclusion and Q&A (3200.305s)
Transcript for "Python for the Modern Data Lakehouse: PyStarburst, Ibis, and Beyond": Hey, everyone. How y'all doing? My name is, Lester Martin, and joining me is Angel. I'll let him introduce himself in just a momento here. Let me go ahead and get my screen share going for everyone. And I think I can do it right about screen share. Here it comes in three two one. Hey. We're here for a webinar today as it's I shortened the title for the slides so it fit in there. But, arguably, what we wanna talk about today is some options you have with Python, using Python as a programming language in the Starburst ecosystem or in the Trino ecosystem. So put a couple different icons as you see over there, the round circle Starburst, Python, Trino, Ibis that we'll talk about as well. I'm checking the channel under the docs to make sure that's working just fine. Yep. We should have something called hands on notebook. I'll talk about that briefly. And I would also invite anyone to say, hi. I'm Lester from and then maybe let us know where you're connecting in from today. So I'm here in Atlanta, Georgia. That's on the East the East Coast Of The United States, and it's about something noon. Yeah. About noon. About noon here. There we go. So, again, like I said, Angel and I'll give a little bit more introduction about ourselves and just the momentum. But what we're gonna focus on today is I'm gonna show you that hands on notebook. There's a link to it. You can go ahead and get there. Because all of our examples, we're gonna do demos and all these other bullets, and I wanna make sure you have opportunity should you choose to wanna exercise some of those yourself. You'll have that opportunity. So we're trying to make that pretty easy. So we'll run through first using Python's, for user defined functions as an option, and then we'll transition to maybe more real Python programming in in a in your own IDE, or we'll use a notebook today, and we'll make try to make that notebook set up easy breezy for you, but you can lift the code and run it wherever you want to. And then we'll transition probably to the bigger part of today's session, this is which is around the data frame API. What can we do? What options do we have in there? Because, you know, those who know about data frames know that there are multiple APIs out there. And if you're not new to if you're new to data frames, that's fine. We'll give a brief introduction as we get there. We had about an hour slotted. Angel and I are gonna try to do this all in about somewhere around forty, forty five minutes. We'll see how it goes exactly, but that that will give us time at the end for some q and a. But as I mentioned, do not hesitate over there in the chat or even the q and a tab, but definitely the chat's probably the easiest. Drop us a message, drop us a question as we're going, and, we'll see if we can, resolve it and get you figured out here. Alright. So, I'll let Angel introduce himself, then I'll tell little bit more about myself, and then we'll jump on in into the session. So welcome, Angel. Yes. Of course. So my name is Angel. So I'm from Spain also. And I have been working for the Starbucks for now almost four years. So as a recent solution architect focusing on performance configuration and some very interesting DevOps practices. So probably we will be talking about this a bit today if we have enough time, and I hope it will be interesting for all of you. Thank you. Perfect. Thank you so much, sir. And, my name is Lester Martin. I'm a dev advocate here. What does it really mean? It means I try to help do events like this, webinars, user groups, definitely post to our blog, write tutorials, answer questions on social media, a little bit of everything. But, ultimately, I'm here to help make folks, comfortable, productive, resolve issues, and that kind of stuff. So don't ever hesitate to, try to connect with me, reach out to me, and I'll put it in the chat when I'm have a non talking session here and just say devorel@starburst.io will find myself and a few other folks. Alright. So let me go ahead and jump on in. We wanna focus on the Python bits, but I have I think he put three slides on here. They're not even very technical, which is okay. Just to make sure we all know a little bit about Starburst, I would say that Starburst, you know, said logo in the middle here, is a is a is a data for a data platform built on top of the open source project called Trino, trino.io, if you wanna point your web browser to there. And what's pretty darn unique about Trino is itself, it's a query engine. It's not a database. It has a database concept and a kernel and a cost base optimizer and all that good stuff, but it's separate from storage. So we like to work with and we work really well with data lakes, HDFS, object stores, that kind of stuff. But we can touch, all kinds of things, all other kinds of data sources and data systems. So, you know, maybe you might have heard of an old term like data virtualization, valid. That's that's an appropriate thing. And a single point of access to all those tools allow us to have federation across all those activities. I can join between AWS and Mongo and Elasticsearch. I can make a query run across all three of those technologies and more. That single point of access gives us great that's as I said, a single point of collaboration, things like federation. Other things like data application, data products, We have our own AI agent in our software. We'll probably have more. We have an MCP server. We have our own functions and that kind of stuff too, of course. So we're not we're not, an AI model builder, but we integrate with that world as well, since we believe firmly that the data you have under structure, under guidance, under governance is probably some of the best that you could use for your AI solutions. And as I just kinda mentioned, governance governance. So a lot of people and for us as well means security and maybe lineage as well. So so we wanna bring a lot of technologies together. And if we leave the the conversation about the AI piece that I kinda rambled on a minute ago, we have plenty of other artifacts and webinars we can point you to. It'd just be fair to say that we see ourselves as an open we can talk to a lot of different technologies, a lot of different sources, hybrid, meaning we can run anywhere, in the cloud on prem, in multiple clouds, somewhere in the middle, all that good stuff is appropriate to us. And while I won't walk that stack from the bottom to top, it just says what I said a minute ago. We believe in a separation of storage and compute. We're a compute engine. We want storage. We can connect to lots of things, but the thing that we're gonna have the opportunity to shine is when we connect to a data lake. And the things that we advocate the most, not the only things we can do there, but the things we advocate the most are open file formats such as Apache Parquet, you know, OpenTable formats such as Apache Iceberg. And if all those things are new, again, that's what I'll put in there to reach out to us. We have a plethora of information and from, in a million different formats. We wanna help educate, help learn that kind of good stuff. So there we go. Starburst powered by TreeNotes, ports that whole data lake world, but it also sees that little waterline above the top that says there's other stuff that isn't in the data lake. There's lots of other data sources out there, and that's what I was suggesting to them when they go. We wanna have a single point access to all that because likely not likely, surely, you have at least one or more of these things with data sitting out there, and we wanna include that in your environment. We can run this thing anywhere. Like I said, We deploy we primarily have two kind of product names, I guess, you could say. Enterprise is our version of enterprise software. You install it where you want to, where you need to, on prem in the cloud, Kubernetes, bare metal, whatever is the right answer for you to do. And then we have Starburst Galaxy. That's already hosted as a service kind of model, quickest way to get up and running in Treno. And other than one demo today, we'll do everything in Starbridge Galaxy, and I'll explain why we get there. I think yeah. I think that was that pieces. Alright. So we want a hands on notebook. You know what? The good news is or the bad news is, I think, I did not pull that up. So what I'm gonna do is I'm gonna go over to the docs tab, and I'm gonna click where it says hands on notebook, and that launched this right here. You know, let me put it in my other browser because that way I can not convolute everything. Apologies for being a few seconds behind here. Thought I had all my tabs up and running. Okay. So I have, this. This is what I wanna talk about briefly here. So I think in my slides, we see a hands on notebook, and I'm saying, hey. Here's a here's a QR code right here. You can kinda see on screen. The link is over there in the docs link, and what you should see is this. You should see a GitHub a GitHub artifact, a file in a GitHub project called Starburst DataFrames expiration. And all the code we're gonna do is in here and running. What I would encourage you to do is if you wanna do this along with me or later, if you're watching the video later, that's fine too. Right there at the top, it should say open in Colab. And if you do that and if you're already logged in to Google, you're likely if you've never done this, it's gonna Google often does, it says, are you okay with this? So be okay with it, but it's a Google's kinda hosted, a little little little environment with a notebook and a machine and all that kind of good stuff. I encourage you to do that, and I'll come back, of course, and walk you through there. But all the instructions, everything we can do is in here around those technologies. Let's see. So that was of my show me. And I think at this point, I'll cut it back over and let me go back full screen here. Let the let my counterpart here walk us through this section about user defined functions. There we go. Yes. So well, as you can see, if you go, for example, to ask the Google AI, in this guy in this case, Gemini, so you can see that if you ask what is this UTF. So it so it will give you, as you can see on the definition, so something completely hallucination. So it will say something about DVDs or Blu ray, so which is not our case. But the reality is that so at the end, UDF, so in the context of, well, many databases, so Oracle, BigQuery, Snowflake, all almost every every database, so allows you at the end to add some custom logic because of, of course, the SQL language could be very powerful, but not all possible transformations or validations can be done with the functions that you have implemented on your data. So the idea is you can extend this number of functions and then do your own stuff. As you can see here on this definition, so there are many possibilities. And so luck we have luck in this case, and those are precisely all the languages that we have. So SQL, Python, and Java. With Java, I think Lester created a very good tutorial that so will be provided a link at the end of this material. And if you want to do it, you can also test it, but it's a bit more complicated. So the idea of UDFs is that so you can have as you can see on this fragment, it can be a scalar, aggregate, or table functions. The difference is that scalar will be one input, one output. So usually, a column from a table, an aggregate, so you can take many columns or many different attributes and then output only one summary or table functions, so where you will have an input and you will get the whole thing. So that's at the end what the EDF is. So, please, next one. Absolutely. Now that I'm having fun, I was flipping it for you 10 times. There we go. Yeah. No problem. So in our case, so in SQLs, you can also have UDFs, but so this is not the main use case that we want to provide here. So we want to focus on Python EDFs. It's very easy to start. So at the end, to start, you will need to declare that it is a function. You'll need to give it a name, and then you can also start plugging the parameters that you want to use. So like any so I would say standard function in almost all very well known languages. Then the second part of step would be to add a return declaration, then you need to declare, in this case, that the language is Python. Fine. And there is a reference to a handler that it is the name of the real function that we are going to implement. So, please, the next one. And so at the end, you need to declare. So there are there is a total assigned to start and finish. So the real Python declaration, and this is where you will implement the real logic. So nothing really complicated. So as a summary, because we haven't introduced the SQL EDFs, I just want to give some guidance on when to use one of those. So, usually, when you have logical transformations that are not very complicated, it is better to use SQL EDFs. So but sometimes so as you know, SQL is a very powerful language, but in some cases, it cannot do all the transformations that you would like. In such cases. So it's a good idea to use Python, and that's precisely the, I would say, main idea that we have here. So, for example, imagine that you have a JSON file that you want to validate that cannot be done easily with SQL, but they are very good libraries in Python that make your life very, very easy. That would be the perfect use case. So execution, of course. So when that's as you probably know, when you try to execute the code near your data, it will be better. So in that case, so for SQL would be running natively in the engine. So and it will be compiled directly. With Python, well, it's it works on a so called web assembly that it is at the end. So I'd say bound with the Strava cluster. But so, I mean, there is some, like, translation between Python and the engine. And so performance. So as I said, usually so when you run the logic near your data, so you will get better performance. So that's the way it is. And with in the case of Python, well, it will work also very well, but, well, perhaps, if you are going to do a very complicated transformation on imagine a 1,000,000,000 table, then it is probably a bad idea. And, well, for complexity, well, in SQL, so I wouldn't add a lot of complexity because it is also easy more complicated to debug. And on the other side with Python, so you could even create your own unit tests, for example, and then make so at the end, you can use the full power of the language. So you can test do integration test everything in your environment before going to the real data. So debugging, of course, it's what I said. It's bit more complicated with SQL and security. So the good thing about the SQL UDFs is that they will execute on the same web UI, for example, or on your own Python client code. This is what we will be showing later. And with Python UDFs, so there is so this environment is running as an extra sandbox. So there are some limitations, so you cannot use all the libraries from the language because, of course, you wouldn't start messing, for example, with threading on very complicated networking calls. So you can use the full power of Python, but not the whole language because of the libraries that you cannot use. Well, while we show you some of that real quick, unless you got something that I missed. I'm sorry. I jumped to the show me slide. Alright. So the show me would be this. Go back. As I said earlier, there's this notebook that you have a link to in the docs page there. And the first big section is these UDFs, and and there are a couple links here that were just referenced about other kinds of UDFs. Can you build them in Java and SQL? Lots of good examples, docs, that kind of stuff here. But in fairness, what I wanna run is a a simple very, very simple function and use, and then it may be a little bit more complicated one. And what I really wanna do is I need to run them somewhere. Usually, you run a notebook here, but I'm gonna run it in an environment a Starburst enterprise environment instead of the instead of the Galaxy. Why? Because we haven't supported in Galaxy yet. We're really we're we release this to our enterprise customers. We're trying to get a lot of good solid feedback. At some point in the near future, we'll probably make it a a public preview and then and then a generally accepted offering as well there. But here's a great example. Let me zoom in a little bit. I thought I zoomed in. Nope. Yep. At one twenty five already. There's an example of creating a function, a super duper simple function. Yep. You know what? I need to change my role. That guy doesn't have enough rights. Let's see here. That sounds like a good one. Why not? We'll go all the way. Yeah. Let's try that again. Create a function. I'll call it answer. And if you look really closely, we everything we just said there. Right? It takes it actually takes no arguments, so I wanna make a simple one. It returns an integer. It's Python within these dollars or my implementation code, and my width is telling me basically, hey. When I come, use this function. Again, no arguments. Those that know the lore of Hitchhiker's Guide to the Galaxy probably get the joke about the answer's 42. If you don't, that's for another day. And then there we go. Exercising that. Super simple. Hey. Run that thing. So I didn't even run a query and send it anything because I didn't have anything to send it. The answer that I got back was 47. You know, it's 42PlusPlus5. But what's really probably more interesting is a bit a more you know, a little bit more realistic of an answer. So in, in TrinoLand, there's a there's a connector called, the TPC DS, Transaction Processing Council data data's decision support, I believe, is the name of the whole TPC DS. It's a star scheme. There's a there's a field there's a table out there called customer demographics, And I just grabbed a few of the fields, gender, what kind of education this person may have had, and then do they have any student any dependents that are college students to currently. And what I wanted to do without going in great detail is say, can I build a UDF that's called a university sports fan chance? Can I build a UDF that when sent that gender, this education status, and, the number of dependents in college that I could make a guess at how likely, you know, low, medium, high, they might be, good sports fans? I'm just trying to determine should I be marketing them to them. Yes. That's a classic machine learning problem, lots of other stuff, but I wanted to pick something that sounds somewhat interesting. If I go back to the top, really, the truth is I just write there. There's my handlers called chance level. So I have several other functions because I wanna make that point there, and there's my chance leveler. It takes those three arguments, returns a string, and then I just have a little bit of, you know, function call and function kind of stuff. I say, hey. Hey. Take the education status. Generate me a score based on that. There's a function up here right here that does that. Hey. And I basically took the model that the the higher you went in college, the more likely you're a fan unless you got an advanced degree. Then you're too smart to care about sports, so the numbers actually started falling off. So that gave me a score, and then I took that plus the gender and the number of students that are in school, and that was here. I fired a function called overall score, and I basically was a little bit leaning toward that the the possibly men were more enthusiastic than women. Okay? You can call me on that one. I'm just taking a stab here. And then, definitely, I put a lot of value in the fact that the this person has students dependents that are going in college. You really are fanatical when you're when your child or your dependents go. So build the overall score, and then lastly, that, you know, this determined ranking just leveraged. Hey. Less than 10, less than 20, a low medium, or maybe a high if you're 20 or beyond, something like that. So to run it, you know, we just put it in a SQL statement just like any other function. There's my function right there, the university sports fan chance on, like, 73. And, hey. I actually like the fact that some of the highs came up. Here's some great examples. Likely, regardless of your, male or female, regardless how much university you did, if you got five of your children currently in college, one, you're broke, and two, you're probably a high sports fanatic. Alright. That's the quick example. Again, those things are available to you in that notebook if you wanna play around with the stuff. And as I said, links to other stuff, other kind of UDF activities. Now what we really want to talk about is, like I said earlier, what if what if when I find my slides, what if you wanted to get out of a SQL editor, and you wanna go into your code editor, and you wanna go into your, you know, IDE or just the dot p y file with v v I or maybe a notebook like I'm doing, and what options do you have? Well, you definitely have had for a very, very, very long time the Trino Python client, the classic Trino client. I have one slide here before I go into my demos because it's pretty straightforward. You do a PIP install, you set it up, you add some properties, and then you fire off some queries. And it's very useful, but the thing to consider is in that show me and let's see if I can, get there. This time, I'm gonna show you in, Starburst Galaxy, I believe. Thought I had it all up. Python functions type on prep. Okay. Galaxy demo. Why am I not seeing my stuff? So let me think for two seconds what I'm doing wrong here. Oh, because I'm gonna run it in. Thank you, Lester. I'm gonna run-in the notebook now. Why not? Okay. So the rest of the stuff I'll run out of is your notebook that you have access to. And then one of the first things you need to do is when you run this notebook is you oops. You need to not a mail, not send mail. You gotta love when, when you're when you're on top of your game here. Let's get rid of my 10 Lizzies and all that good stuff. Yes. Yes. Yes. Don't save it. Don't worry. I created a little quick little thing that said, hey. Give me some credentials. So my labs I'll tell you how to go up here, tell you how to get set up with Starburst Galaxy. I want you to do that stuff, but I'm already set up. Meaning, over here, I got a cluster. Let me make sure it's running. Yeah. He's sorta kinda running. Let me just run something so it will so it'll stay alive here, create a place function. Yeah. Yeah. Ultimately, what I wanna do is I need to get the user the the host name and that kind of stuff. This is how I would do it if I was in, Starburst Galaxy. I'd go up under here, under admin, find my clusters. All this is documented in the notebook. I'd say, hey. Tell me about the partner connect, and then ultimately, show me about this tree note Python client. How am I gonna I need these values right here. I need the cluster. So I'm gonna grab that thing real quick. I'm gonna go over here and paste it in here. Boom. Boom. Boom. Also gonna grab my fully qualified username. Has a my username and my role in there. Paste that. And then, hopefully, I know my own password. And then what did I wanna show you? Well, you have to do a PIP install. Now I've already done the PIP install a minute ago, so it should actually fly through pretty fast and say everything's set up. Let it spin. Yep. And then there's that boilerplate I talked about. I need to I need to do some imports. I need to make a connection object here that basically uses those properties and a few other hard coded ones, and I'm doing it with the simplest authentication. There's a lot of other authenticate authorizations. Sorry. Authorization opportunities, OAuth, and stuff. I just have a basic username password kind of model that, right out of the box, Starburst Galaxy uses. And if everything went well, I didn't wanna do wanna do any work yet. I just wanna do a fake SQL a collect SQL statement to let it give me the phrase connection is good. So it's working. Great. So that's easy. Set up, import it, make a connection. And then in fairness, all we do with the Python client is there's nothing sophisticated. It has one, you know, one big function, you know, run a query, Execute a query. So here's an example. I said, hey. Go out there, run all from this nation's table, and you see the contents down below. In all reality, that row's object is all you need. I mean, if I printed that row object by itself, it might look kinda ugly. I might have to build a loop to make it look pretty. So what I did instead is I used a popular, data frames package out there called pandas. So, again, if you don't know what data frames are, hang on a little bit more. We'll talk briefly about them. Ultimately, I did a few extra lines of code y to make it look pretty here in my notebook. K. That's pretty normal. That just kinda means I can run any kind of query, and it should make the point that that's all this API does. Runs a query, runs in the grid, the power of the cluster itself do all the work, and then the results have to come back to me. So what would not be a great example is something like this. Hey. I wanna run a query that says, just give me a subset of the customer objects. And then to hold on to that, so that ran in the cluster, came back as local. And then I run another query that says, hey. Go go back. Get something from nation. And then that pandas, I'm making those both local panda data frame objects. And then, basically, I did a join on those. That works just fine for this tiny little dataset here because this it's not very big. But if you think about what's happening is I'm bringing all the customers, then I'm bringing all of the nation's local to this machine within its bounded memory and all that good stuff. And to be honest, a better answer is when you when you can when you're using the Python, based API is do as much as you can in a single SQL statement a single statement. I only need the answer. I didn't need all that intermediary stuff. So, absolutely, let the engine let the cluster find out the answer. Should give the exact same answer down here, but let Tree know. Let Starburst crank that stuff to ground. Now very not very overly crazy or complicated. It should be kinda straightforward. Let me see if I can find my, so it happens when you lose your here we go. Use your slides. That should be kinda simple, straightforward. But to be honest, we kinda move past that, and we have other opportunities and to be more formal. I mentioned that pandas is a data frame API. I'm gonna let Angel tell you about the data frame lightly, and we'll show you some of the what I call distributed data frame APIs. So we write a bunch of code locally, but we don't really execute anything until we actually need some kind of results or save it and that kind of stuff. So I'll turn it right back over to you, sir. Yes. Well, I will start with so the technical specification. So for Python I mean, for the language, there is a standard data frame API. So that was so there is a so there were some people that decided that so as Lester said, you have pandas or you have many other implementations. You also have by Spark. So different data frames implementations are the problem is that so until this unified specification, so your code couldn't be translated from one framework to another. Then a new well, standard was developed, and this is what really brings a lot of power because your code can be without a lot of changes, be moved from one framework to another, and it should work more or less without changes. So what is in fact a data frame? So a data frame is, well, something like a well, like, it's like a table. So as what you can have as as a spreadsheet on Excel, for example. And the idea is that you have rows and columns, and you can do a lot of manipulations with different data types. So you have numbers, text, days, etcetera, on each column. And the good thing is that so as we have said previously, so you've got pandas that it's very well known, but you can also combine. So the same, I would say, operations from one framework and another. So. that's I'd even say that's jump in there. I'd say, you know, we're as a programmer, a data frame is really just, you know, at least a logical connection. You know, it can. it could be physical or it could be logical. We then it you know, what you saw a minute ago was a physical connection. It was a a list of physical collection. A list of things that everyone's the same, and they just have structure to them, but we're gonna talk about maybe it doesn't have to actually be a physically loaded thing. It might just be instructions that are get there. And that's the part I think it's hard to understand until you kinda get deeper and deeper how these distributed engines work. Yeah. Turn it back over, sir. Yeah. No problem. So please continue if you want. So what's the important thing? So then with this powerful standard, you also have the power of one of the implementations that we want to present. This is named by Star Wars. And at the end, this works on top of this, I would say, whole Python libraries. And the idea is that you can do select, filter, join sorts, etcetera, without having to write your very long SQL statement as Lester mentioned previously on this example. So with many tables, many joins, so you can do very, very complex things. And so that's not what we want to really do. So we want to move from the world of SQL to be more oriented to data scientists or data engineers, so where you might be more comfortable with code, I would say, in real Python code, for example. So what happens when you define so Celestra has displayed here, when you write a SQL statement that seems to be, well, more or less complex. So you have here many different tables. You do a join. You try to pull many different attributes and several conditions. So this is, for the point of view, the code. But internally, so you can see on the right side, there are many transformations that will be done by the engine, in this case, automatically for you. So you don't need to worry completely about that. So at the end, when you create your data frame, you will generate so the code that you see on the right, but so from your point of view, so you only need to worry about the different conditions and what you really want to do with the data. So and, well, as you can see here, so behind the scenes, so you have so the polyblended, at least logically, would work, and so on the right would be so the real implementation. So next slide, please. And, well, as I said, we have two different or I would say, APIs to show. So we have the PI Starbase and the IBIS API. So what's a data frame at the end? A PI Starbase data frame. So as you can see on the slide, so the logical blocks are so you have the code on one side on the left. And when you create a data frame in what you are really building is a logical plan. So with this logical plan, you will generate at the end SQL. And in this case, well, I wouldn't say three. No. In fact, it would be Star Wars because so that's one of the main difference. But the engine will generate your SQL based on the code that you see on this fragment above. So what's important here? So it's as it is mentioned here, so you have a PySpark like syntax. So if you are really comfortable with PySpark, then so the, I would say, conversion is almost automatic. So there is almost nothing to change. You have also lazy execution, which is really, really important, and that's one of the huge differences if you use this API compared to what you do with your, I would say, classic train of Python code because so at the end, what you want is so to make, well, I would say, an optimal execution. And with lakes to lazy execution, what you do is that so you try to delay the real execution until you really need the data. So that's the main difference than when you execute the code in the Trino Python client. So And since since lifting. we didn't show you a lot of code I mean, I know some of you know already about something like Pipespark. Good for you. And I'm glad you're here. Those that don't, it'll it'll make a little bit more sense as I do we do the demo, and you can kinda see that above there, what Angel is saying is you basically you you're gonna logically express things, a whole bunch of things, and it'll feel. like, wow. This is not efficient. I'm keep going back. We're not really going back. We're waiting until we hit something that triggers the engine to stop being lazy and do some work. So we're building a lot of logical instructions, and then we say, hey. Show me what I got or save it somewhere or pipe you know, put it on another table. Insert it somewhere else. Anything that's kinda IO ish in nature will trigger it to do some work, and then everything. He just says it's beautiful stuff. Do you want me to go on, or do wanna stay on the pice per Starburst, page here? Sorry. Yes. Yeah. We can move. Okay. Then so and now so if we compare this with IB, so it's, again, based on this common standard that was built for Python, and the idea is that you can do data manipulation in Python. The good thing is that IV supports many, many, many different products, not only by StarVest, which is only for us. And so the good thing is that so you can move very easily from one database to another, so using the same code. So please continue. Yeah. And so as you can see on both sides of this slide, the differences of the API are almost so there are almost no changes. So on Pyestrabas, at the end, you can define a data framework. So you need to create a so called session. And when you build the session, that we will see later on the whole full code. So the way to build a data frame is with this session. So you say, well, I I want to pull this table. I don't want to get, for example, as it here, it's mentioned. So the region key and comment, and I want to eventually name some. So that would be saying as doing in SQL as so colon x as another name. So as you can see on IVI, so it's almost the same. So there are almost no changes. So it is very, very easy to move from one to another. And so if you want to, for example, create some filters, so the same it is the same function, so filter, so dot filter, and then you can say, well, I want to apply a predicate with a given condition. So for example, here, we are saying the column needs to be greater than 9,000 on the left. And on the right, it's, well, the same. And so the same can be done. So small difference is that to get or to display the data that you are manipulating, so you need to on our case of for the top IS server, so you need to use the show function. And in IBIS, it be with print, and then so well, you can pass some size. see you'll see we're we're about to be to the the labs in a second. I would say those that know PYSPARK know that PySpark always gives you two or three ways to do the same thing. Well, PyStar versus model after PySpark, so we're gonna still have that kind of two or three ways to do something. And then definitely at that bottom, there's two or three different good ways to kinda print something out or show something from Ibis, but I think she let me jump in there. Yep. Well, the slide shows that it is almost the same execution plan, so nothing really changes here. And so please continue. And so this is also a very small guidance because sometimes you might need to decide what to use. So if you are using Starbase, so you can, of course, choose PyStarbase. If you are using only TreeNote, then you need to use IVs because so as as you can see here, so PyStarbucks is not supported. But on the other side, the code would be, I would say, transportable to some other databases. Yeah. Yeah. Awesome. Well, cool. Hey. Good good news is let's do, let's do some demos. Let's look some at all this out in, in our notebook and, walk through and see these things in action. So it's the same notebook I just did. You know? Here we are. We finished up in here with the the Python client, and I got a section called PyStarburst, points to the docs, all that kind of good stuff. A reminder that it's only gonna work in Starburst, not open source tree note. I went ahead and did my PIP install. I would say a warning, if you do do this notebook, when you do this PIP install, it'll make you restart the session. And if you restart the session, you'll have to reinput your your host name and course, all that good stuff. So, but I went ahead and did that in the background just to make sure it wouldn't, slow down. So let me find where I'm at again. Okay. Here's my boilerplate. You know, in many ways, this is a little bit like what we saw Python client. It's just the connection stuff. But I would argue that if you know a little bit about PySpark, really, as Angela was saying, we're trying to get ahold of an object called the session. And this is the piece of code that's gonna look and taste a little different from pi Starburst pi PySpark because we're not connecting to a classic Spark, you know, driver and executors and all that good stuff. We're connect we're ultimately gonna connect to to Starburst. So but after that, once you create that session object, everything else should look and smell and taste the same. So I didn't do a whole lot. I use a function on the session object called SQL. So you can directly just issue a single SQL statement if you want. That's fine. The trick is it doesn't execute anything. I'll try to explain a little bit more. They're actually I chained a function called collect that said show it. So, nonetheless, that was just some boilerplate to to know where we're going. Now what I wrote here is a handful of things. I'm gonna walk you through something quick and easy, and this was probably the same thing we saw earlier. Right? So, hey. I wanna get that session object as functions like table. Hey. Give me everything from customer. Now what I'm doing is actually I'm kinda learning my code. I'm interactively exploring and figuring out what I'm gonna do. So I'm doing a lot of those actions. I'm forcing it to go run and give me a result. And I'll show you in a minute that may not be what you want to do once you figure your logic out. So I'm figuring it out. Yep. That's custom. That looks good. And then I say, oh, I learned, my DBA since I was three years old told me don't get all the columns if I don't need it, especially if these are columnar stores. So why don't I just take that same object? So this is the data frame. This cust d f, it's a variable. What is it holding? It's holding a data frame, an object of data frame, and it came from this function, this table and gave it. What am I doing? I'm firing that to another method. Hey. Select, actually. And use this way, it's actually saying take the data frame I have and just keep certain fields from it. So this was kind of equivalent of select these three names from table customer, and I said show it. Again, I may not really wanna show it, but by showing, I did slow down. I had to wait for to go and run and all that good stuff. And this would not be what you'd want your code to do. This is you figuring this out. So let me just stop saying that and show you at the end. Alright? So I realized, oh, yeah. I probably don't wanna see anything and that's, you know, not even at least about $1,010,000 dollar balance. So that same other that new object projected. What am I doing? I'm firing a function filter, doing some do, like, a where clause, and, yep, that looks kinda like what I'm, looking for. And then, you know what? Like you said, we wanna do a join, so I need another table. And this is the first time I introduced we're not trying to master or teach you the data frame API, but I'm trying to give you a few pointers as you get into it. Often, what you see in data frame API coding is people will chain functions together. So here, I went back to the source. So I'm back to that session object, which is not a data frame, fired off table that says, give me everything from nation, put that in a data frame. I didn't name it, but I said do another function on a data frame called drop. So this is the opposite of select. Get rid of a couple columns. So drop two or three. And then I did another data frame function that said, hey, Jane. You know, this is the as. Rename the name. Rename the nation key. And then I triggered it. So what's happening, it looks like it has a bunch of loops and all that stuff, but it's not. It's ultimately gonna hit that show and go, okay. What did you want me to do? You wanted me to do select, those two field names as from and then run. They're pretty pretty simple stuff, pretty normal stuff. And by having two different objects, nation d f and I think it's filtered d f, I could do a join. So I'm firing on the filtered d f, do a join on that new one nation data frame, and that's just the syntax. Find the key, and you could be more arduous if you wanted to. Simple one. And then I just said, thing. That's a new object. I called it join d f. Show me that result to make sure I'm kinda right. Is this what I was looking for? Again, round trips, gotta go to cluster, all that good stuff. And then I realized that's pretty close. I really wanna drop these two other nations because that is a consequence, a side effect of joining via the API. You don't really get to drop in that call extra columns you don't want. You usually get the whole both whatever's in both data frames will show up. So trim out what you don't want by dropping them. And then, ultimately, I probably wanna do some sorting. So I'm again, I'm just taking that same data frame from the previous time over and over and over. Ultimately, I say, you know what? This is kinda what I wanted. And I and that's what the power of a notebook is for, especially when you're learning and figuring out datasets. You don't have to write code and submit the code. You just kinda push the play button, get results, you know, that good stuff. But toward the end of this, I realized, you know what? That's kinda what I wanted to do. So I kinda put my more formal data engineering hat on and say, time to go to production. So what am I gonna do? I'm gonna take that logic I just did and remove all of those show me the kinda intermediary results. Just build me everything I can and then fire off one function at the end that says give me those values or save them or share them or something like that. So all of these instructions, it's the same thing we just saw, but just put together ultimately, really, in, like, two commands. Build this, build this, and then show that second one. But the the power of Lacey execution is none of this is running until we hit these they're called actions. So we wanna hit one of these action functions. And then behind the scenes, as we were trying to explain, it's gonna build, you know, a data frame lineage. That data frame lineage is gonna turn into probably a really long SQL statement, longer than if we just wrote it ourselves. But the optimizer, the cost base optimizer is gonna go, I don't care if you're a good or a bad programmer, a SQL pro or either I'm gonna do my best to make it the most efficient. So it's gonna run it, and it's gonna do pretty well. And what I wanted to make the point because we also called it out is you could also just run SQL statements, just kinda like we did in Python. And arguably, if you have an app that just periodically needs to run a few a few queries and that's it, probably the Python API would be just fine for you. If you're a data engineer building a long pipeline with a bunch of activities, more than likely that data frame API, especially if you wanna write it all in Python, then the data frame API is gonna be your best friend. These two are really producing the same thing. And while I don't show you a great example, you can mix and match. You could use the SQL to build your data frame and then start running these other functions, the filters, the wheres, the aggregations, and that kind of stuff, and all we ultimately get there. So, Angel and I were talking about this. The folks that know data frame APIs and know PySpark already gonna go, cool. I got it. That was too simple. And the folks that are, I would say, new or completely new to something like that, you're gonna go, what? That's okay. Those are the expectations we have. Really, the goal is to make sure you know there are options to run these various activities in Starburst and Treno. So what I did and I'll just hit the play on these. I'll just run two quick examples. More likely, this is something you might see. Again, very simple. It's using those same tables again. Actually, third through a third table in there. So we're joining across three tables and ultimately doing some aggregations as you see, like, with the group buys here, sorting. And then we're gonna get some kind of result. I'm gonna let that finish spinning. And as soon as it's dumped something on the screen I think I waited too long in my free server quite asked down on me. It likes to shut down quickly if I don't use it. So that's finished. I'm gonna kick this one off too. It should run a lot faster. Those are some results. Here's some other example of something. Again, this is something you have access to look at later. A windowing claw, you know, if you know about windowing partition by all that good stuff, and you see the value of that. Here's an example just to give you some some example that we can do all the same stuff that you can do in, you know, complex SQL with this data frame data frame API. And as I said above, if you don't like this, you can still write this, and there's still value, for sure. Alright. I'm seeing the time is already about ten minutes left. So what I'm gonna do is quickly tell you the cool news is the rest of these, little paragraphs in this notebook are emulating are duplicating what I just did. So I went and installed, Ibis here. And because Ibis can work in a lot of different back ends, it's gonna tell me, and I don't wanna do that. So I'm gonna I'm gonna go ahead and hit that and, hit that. Oh, it's restarting. And I because of time, I'm not gonna run exactly to go back and put my username and password in there, but it looks very similar. Right? Set up an environment, and then I did the exact same code, the exact same functions. Those were those subtle differences might show up. In fact, that showed a different way how to do an output. Same you're gonna get the same results with this four. It's exact same statements. You can still do nope. There we go. I showed you the same thing. Okay. You do all interactively package it up one big set of calls. You can also do SQL directly in, Ibis, but the warning that Ibis will tell you is the people that wrote Ibis will tell you all over the website is that they don't really want you to do that. And why not? Well, we PyStarburst is ultimately sending if you write some SQL, it's just gonna send the SQL on and all that good stuff, and that engine is trying to be generic. The problem when you use SQL directly with Ibis, because it knows a whole bunch of different back ends, it can do a good job of using its data frame API and then understanding all the nuances and send it, but they chose not to try to do a generic SQL to rebuild it as another SQL. They just simply pass it on through. They it's a pass through as is. So that may lose some of that portability benefit that you have with IBIS as you jump around from different back ends of which Treno is one and only one of those. But the truth is, at the end of the day, you can still do it. We did those same examples that I did up up again, number one. And then the windowing one, we actually just kinda went and made the point of could you as I was trying to say, could you write some SQL, create a data frame, and then from that data frame, continue on and write API calls? Absolutely. We did a filter, and then we did an order buying. Just get what I said in IBIS. Likely, what's gonna happen is that this is gonna run, and then the rest is gonna run. So some cautionary tales about how this all works in IBIS land. The short answer is you go IBIS, go data frame API, try not to write straight SQL, in that space. Okay. I think that gets us through that demo. And I think all I really, really have is the, you know, the old q and a thing. You gotta you put a nice QR code for you to click on. So I'm gonna look over there because I haven't looked in a little bit and see if there are any questions that asked. Okay. We see some questions. Awesome. I don't know if I don't know if Angel jumped in. Hi. So Greg says, Long Island, New York. Are there any updates coming to in PY Starburst? The short answer is there's not an immediate update coming to PY Starburst. I'm gonna be mindful and be careful. So I'll say I'll be very transparent, but very careful too. PY Starburst is a supportive platform. We have also a, a partnership with Dell. So we do a appliance that has Starburst Enterprise and those things. And, during that process, it's been about two years running now, part of that was we decided for that platform to also bring, in addition to the Starburst Enterprise, we're bringing Apache Spark as well, full on classic Apache Spark. So for them, we've devoted the same people that built PyStarburst. They've been working really hard and to support Apache Spark in that appliance. So software that, you know, we're kind of packaging and we're supporting with Dell. And it also let us think about other opportunities. So PyStar Wars will always be there, but, you know, they're I'm I'm not suggesting at all that we're gonna run as part of SAP as Apache Spark instance. Theoretically, it could happen. More likely, maybe something like Snowflake is done. So with the they have a a Spark Connect API. So we are investigating what's our next big push for our data frame users. So I'm not discouraging you for using PyStarburst. I actually encourage it. It's there. It ships. It's live. But you're absolutely right. We don't have a major, we're we're bug fix mode at best right now. At the moment, while we make a good decision about what's the next few months of, effort that this team should focus on, and it could include other opportunities. So so short answer, no current releases are slated. Alright. So what else we got there? Data frames, SQL query, two pandas, return all kinds of strings that are doing the or doing the all columns as string. I have to look at that with you, Greg. Let's find out what happened there. I'm not I'm not so sure that should have happened, but I am familiar with what you're doing. So, Greg is saying I'm running, a data frame, and then I'm, passing on the end of that a two pandas function. And what that's gonna do is gonna bring his results back to the local machine and then take that data that data frame and build it into a pandas data frame. We manipulate it, and Greg is saying he's got some problems. Some of his fields, all of his fields are showing up as string only. So, Greg, would you, I hope we you and I can catch up. I'm gonna put my email right here for everyone to see. Lester.martin@starburst,uh,data.com. And to be honest, what I would the best place, Greg, you and anyone else, if you have that prob any kind of problem, if you do me a light the lightest favor in the world, if you post it here, then not only will I make sure that myself or someone else get you an answer, even if it's a bad answer, other people can pick up from it and, you know, learn from it should that probably exist and that kind of stuff. But you can email me directly. I might, know, see if I can get us to put it out there while I run it down. So, yeah, I don't think you're you're probably not doing anything wrong. Sounds it sounds wrong. I'm gonna help you figure out what's going on there. Any other questions before we, run out of time in the next three minutes here? Angel, got any other comments you wanna chime in on before we, move on? Because I think we're gonna just disappear in just a few more seconds. He's quiet. In, in two minutes I said a few seconds. My backstage crew here said two minutes. If anyone remembers that Chuck Waverly or something, I'll be back in two and two, but never mind. Some old weird dating app from I don't know when that was, the nineties or something. Okay. I think that's the questions that I see. Greg, I'm hoping you and I will catch up. I will go hunt your email down too in the, in the list here and make sure we do sync up. But if I don't reach out too quickly, please just ping me and say, and I'll start looking at that problem for you. Alright. I thank the folks backstage here. Thank you, Angel, for joining me here today, and I thank all of y'all for taking some time and hanging out with us. Have a beautiful. time with all things Python. And, for the live folks, this is toward the December. This is December '5. So enjoy this holiday season, whatever it means to you, but have a good time. Take a little time off if you can, and, I'll sign off for now. Thanks now. Bye bye. Thank you.