Video: Starburst Office Hours: Connect Once, Query Everywhere | Duration: 1900s | Summary: Starburst Office Hours: Connect Once, Query Everywhere | Chapters: Introducing Starburst Office Hours (13.695s), Starburst and Connectors (152.275s), Connectors and Configuration (461.17s), Starburst Galaxy Catalogs (653.195s), Multi-System Data Integration (885.28s), Starburst Data Storage (1462.61s), Data Integration Strategies (1539.315s), Dell AI Platform Integration (1635.45s), Cross-Cloud Data Billing (1716.495s), Concluding Office Hours (1776.33s)
Transcript for "Starburst Office Hours: Connect Once, Query Everywhere":
Hey, everyone. My name is Lester Martin. I am just starting to share my screen, and there it goes. And what I will do is put this up in big old big old screen, drag all some of these little notices off my screen so it won't bother everybody. Hopefully, that it's all there. And, I'll jump into a little bit of slideshow. So, again, my name is Lester, and today, we're gonna start a new webinar series. We've kinda done something like this before, kind of an ask me anything series. We did a you'll learn with Lester a couple years ago, but we're gonna call this one Starburst office hours. So we'll try to do this initially probably once a month. If we're really lucky, if we get a lot of audience participation, maybe get it to be, you know, twice a month, once a week, maybe even more. But, ultimately, we're trying to create a quick a quick webinar or show, whatever. We spend maybe thirty minutes entirety in its entirety. And the first, you know, ten to twenty minutes will probably be me presenting something, probably some kind of demonstration or explanation or something, something hands on of interest that's in this realm. And and then we'll spend the rest of the time in some q and a. And we have tools like chat that you probably see in the screen there, and Quincy just shouted out, looks like. We have chat, and then Quincy's also gonna let us know toward at any bet any point, I guess, you can say, there's gonna be a way to raise your hand and say, hey. I wanna come on screen and join and join you on stage there, Lester. So this is something new for us. We're gonna give it a hard try, and we encourage you to to do that. But, again, if you feel more comfortable writing things in the chat, that's really, really, really good as well. Alright. So I left my my shameless self promotion slide up for a little while. A little bit about me. I'll let you just have read it if you wanted to, that kind of stuff. Now, this material that I'm gonna do, I'll go ahead and say it real quick. This material is actually in the link that says docs. If you click on that, it's gonna take you just to a a GitHub project, particular folder. And there's a link in there to these slides. They're very short, five or six slides. And then some appropriate links are thing of things that we'll be talking about today. Alright. Go back over there. Alright. So what are we gonna talk about? Well, first, I guess, it's it's worthwhile to talk about the fact that Starburst is an engine, framework of data platform built on top of that open source Trino. In fact, the folks that created that when it was really originally called Presto moved on into created changed the name eventually to Tritos, founded Starburst, all this kind of good stuff. And like a lot of open source plus companies, we, you know, try to make sure it works really well for folks. We can install it and configure it, run it, integrate it with all your security stuff. Absolutely. And we're we're gonna talk about a lot today is this notion of connectors and and extensibility or optionality. We not only get a lot of the open source or or the open source connectors that are available, we may create our own or we may enhance some of the existing ones as a company Starburst. And then things that we won't spend a lot of time on today is, you know, you look to your platforms to do even more, to run faster and faster and faster without breaking the fundamentals of of the things like, that that Trino provides. We we're real big on that. And that last big pop up at the bottom says, hey. We run, a bunch of places. We run as software. You install wherever you want in the cloud, Kubernetes, on prem, whatever, or we have Starburst Galaxy. That's our software as a service. I'll be using, that in my demos today. I would what I'd mention, like, just to go back when we even have a platform or appliance with Dell, and we actually have some other hardware partnerships that you'll hear about very, very soon as well. So today, we're focusing hone in really quick on the on Trino proper, of course, and then those connectors, what what makes good use elsewhere. And those connectors are really what Trino is about. Trino is a query engine. It's not a database. It's a processing engine, a query engine, a compute engine, SQL based, massively parallel processing engine. But we have built a solution where there's a connector architecture, connector plug in architecture. Someone could build a connector to our world. And I'll show you those links in just a second, but it can be to anything as long as we try to represent it in something that we could see with SQL. So that's our kinda gotcha. We wanna make sure these things are appropriately represented. So if it's something completely different than SQL, we might have to do some work in that mapping tier. But if it is something that, you know, could looks like a table, smells like a table, it'll probably be a table. And that'll give us this cool concept called, you know, a single point of access. People often historically call that data virtualization, and I'm cool with that phrase. It's an old term, but data virtualization often suggest that we just do pass throughs to you're just like a middle middle person in the in your architecture there to take a query, pass a query, return a query. And we can do that, but we're gonna do better off when we have a data lake in the mix. And we're gonna really shine if you have a bunch of different data lakes or a data lake or three data lakes and Oracle and Mongo and etcetera, etcetera. Why? Because we can do a query a join query across all those sources. We call that query federation. And that's what the kind of the end of my demo will be. We'll set up a couple connectors. We'll do some joining across them all and go, wow. That was neat. And then we'll turn it over to q and a. And this one was just showing you a fit quick figured example, especially with that color coding of cool. Wouldn't it be nice to join your Hive customer table with your MongoDB, bank account data with your Oracle, you know, customer profile data? It sure would. And wouldn't it be cool if somebody did that for you? So let's see if we can do that. Let's go to our demo for the day and see what they have. Now I mentioned I am using and I'm gonna hit this query because I'm using Starburst Galaxy. And I thought I bumped it up to one I'm gonna bump it to one twenty five. It eats up a lot of real estate, but it might make it a little bit more visual. Those that like, you know, dark mode, we have that. I am on the fence sometimes. I love dark mode, but sometimes, like, just give it to me nice and bright. So I'll leave it that way. So what we do in Starburst Galaxy again, this conversation is really about Treno. I'm gonna use Starburst Galaxy because it's a very fast way to get up and running with Treno. In fact, just go to starburst.io, you know, and then just pick products. And under Starburst Galaxy, you'll see that we have a start for free. And what we're giving you for free is you all you can get a free Starburst Galaxy setup all the time forever. Mine's called lester. As you see way up here in the top left. Lester.galaxy. But what we're also giving you is a lot of credits. Now cool thing is I'm gonna do stuff today that costs no credits. I'm gonna use a free cluster. Doesn't cost me anything forever. The consequence, it does shut down after a little while of inactivity, and I have to incur a few seconds while it spins itself back up. And you have to kinda connect your own data. We'll supply some data to get us started. Now I mentioned a couple links here. So I showed you Treno, showed you Starburst, but let's look at a couple of those things. So in there, it's all in that connector space. So I'm gonna say Treno connectors. Simple enough. And I'm hoping that there we go. This is the docs or the most current Trino version, that's four seventy nine. And this is the list of connectors that come with open source Trino. And what you do with connectors for the most part, let's just pick online, maybe Cassandra. We haven't talked about that in a while. In the software that you installed yourself, you really the solution is you ultimately need a properties file that talks about that system, and that property file, you know, needs to live in a magic place. So it depends how you install it. That could be in home charts, could be YAML files, that kind of stuff, I guess, meant to say. Could be just dot property files if you're just doing, like, a binary installation. But you set up some properties, and then all of a sudden, once you do that, you see something like this. Your cluster starts to have connections. I didn't undo a Cassandra one. Here's a a Snowflake one, as you can see right there. Snowflake one that has some Pokemon data that's present to us. Alright. So these are a bunch of lists. Now the truth is there are other lists. There's a Starburst connectors. When I'm making the point here, Starburst connectors are well, there's a there's a a marketing page with lots of cool icons and searching that all cool. But in the day, one of these, if you really, really wanna get into them, they're gonna send you to them. Oops. So they're gonna you can have a nice PDF. I was really looking for just the the doc site because that's what I wanted to focus on. I think that's it. Yep. Now this connector list, in some ways, looks a lot like the other one. But if you look really closely, it's hard to see until I zoom in a little bit. These connectors actually are gonna show up as three different ways. The black hole let's pick them one. The DuckDB connector. That's an open source, comes from the Trino community. We're part of the huge part of the Trino community. And then you guys see ones like this, exclusive DynamoDB or GreenPlum. So those are ones that we at Starburst created, offered up, etcetera, and you can only get with a license. And then we do some that we would just call improved improved. So we're doing some additional things. Sometimes that could mean that maybe there's just some statistical things that we can do to make our engine run faster and they don't violate anything. So the goal with improved is not to break what you get from the open source community, but offer some usually performance, improvements and that kind of stuff. Alright. We're talking a lot. What we wanna see is how you do all this. What does it do? So I'm gonna go run back over here to my Starburst Galaxy. And, again, I'm just gonna run a query to keep this thing alive. Now I mentioned in in the in Trino and Starburst, you kinda go create these files. It's a little easier in Starburst Galaxy. It's a software as a service, so we're gonna just give you a wizard to walk you through that. You're gonna say something like create a catalog. For example, I want one for against some an Amazon history bucket I have. So I'm gonna click on that. I'm gonna type in some information, pick a metastore. And in fact, I'll show you the one that I'm gonna use right here. I think it's called My Cloud, very, very fancy name, Cloud. My Cloud. Yep. There he is. So I hit edit, configuration. And there you go. I named something, talked about security. This is the one that can support all kinds of table formats, Iceberg, Ive, Delta. And, and then want some know some some metastore, what what now we might start to call catalog again. So we have this where it means catalog, and they have there's another word that means catalog, one of those heavily overloaded overloaded words. So we can set up things like that. We can set up we're gonna do one called, I don't know. Think we just called it office hours. Yeah. Office hours post grads, another one that went here via wizard and simply said, hey. You know, connect to and I did a direct connection. You could do all kinds of sophisticated security. I went to AWS and lit up an RDS and since the post grads and exposed it to the Internet. Maybe not the smartest thing to do, but good enough for today because I'm gonna go shut it down as soon as we're done and tear it down because I did it on my personal account, and I don't wanna go broke either. And, you know, like I said, you can create all kinds of catalogs. In Starburst Galaxy, we'll limit to what we feature, meaning we put a lot of support around this, so we wanna make sure it works. Again, if you do an open source Treno or a Starburst enterprise installation yourself, you'll actually have a lot more options than we currently have, you know, put a lot of rigor around feeling really, really comfy with them in our software as a service offering. Okay. Let's go do something. Alright. So what if in this environment, I went ahead and did a lot of that. I created a bunch of these. This cluster has, what is that, about 15 or 20 different data sources tied to it. And something we've done a cup about a year or so, I don't know if I love it or hate it, is we we actually put some cool indicators of what these things really are. Like, this Postgres internal is a Postgres connector as you saw with with the icon, with the calling it out. In the old days, we just have a, like, a database icon. And I kinda deep down prefer that because part of that single point of access is to tell folks, don't worry about it. Don't worry that that's Oracle or Postgres or elastic search or something. You know, don't worry about that and and just think about the logical names. But that said, it is pretty cute and pretty interesting. It helps a lot. So so that said, what am I gonna do? I'm gonna look at this one down here called sample. So we generally don't have a lot of data we offer. We offer a sample with a couple different datasets, but one of them in there is a schema called burst bank. So the sample catalog has a schema called burst bank with ideas like this. In fact, it has a table called burst bank I'm sorry, called account. So I'm gonna do a quick select all from count and say, okay. Cool. You know, couple. There's some records in there. And what we're gonna try to find it out that I wanna ask a couple questions. I wanna say, what kinda if you provided that, what is that? Well, this is a by doing a show create table. I said, look. It's a and I'm looking for my clock to see how I'm doing on time. Okay. This is a hey. It's a Hive table. As you see, way down here. It's a Hive table using Apache org file format, and then we talked about what's there. Very exciting. In fact, it's terrible. It looks like it's all VARCHARs, but okay. That's fine. It is what it is. Alright. So what I can do is, I I didn't mention it, but it look if you notice what I'm doing here. I'm using a three part naming convention instead of the classical two part. You connect to Oracle, you say schema dot table, you connect to SQL servers, say database dot table. And either one of those, if you don't wanna do the two part, you say use the database with the schema. You have the same thing here. We would say use sample dot burst bank. We can just go to the first part. But generally speaking, if we go all the way down, then we could I meant to show by doing that, you could just say you know, you can do what you might wanna do out of the convenience. You can just say, you know, select all, from account. So same table, HiveTable, and, you know, this GUI lets you set them all things like that. I myself do a lot of three part names because I'd like to show off that things work all over the place. But let's look at this data a little bit and say, let's start to do some kinda real world scenario. We found this table. We're trying to figure out what we can do with it. We see, hey. I'm curious. I saw there were some mortgage information. So let's just say, give me all of them where that mortgage, what was it called? The mortgage ID is not null. Because when I did the select a minute ago, I didn't notice there were some nulls in there. And in fact, I started looking at the dataset, and I realized it's kinda kinda got a pattern. It has a products field column, which has an array of types. So you can look for ones with mortgage a couple different ways. So I said, yeah. They'll only want mortgages. And then I realized, the mortgages have things like I don't know. Where is it at over here? Mortgage yeah. It's got a mortgage close date. So I really want the open ones. Now I think there's actually here mortgage status. So there's lot of ways to find this out as the world takes us down that path, obviously. So we're gonna say, hey. I'm looking at just, accounts that have, that have a mortgage that's open. That's what I'm focused on. And I said, cool. But who is this person? Who's this customer in there? Well, I find out over in Postgres, and that's this one, o h Postgres SQL, I have a table called customer master. So my customer master is in my relational system, my kind of account data is in this hive system, and I say, cool. Can I join those together? But before I go too much further, went ahead and fun with it. I started looking at it. It has a lot of Canadian folks. I decided I really wanna look at a lot of maybe South Central, Southwest, kinda Texas, Oklahoma, Louisiana folks. So I got a handful of those folks, and I wanna join them. I wanna join my Hive table with my Postgres table. So that's all I'm doing there. And I went ahead and just fully qualified them with catalog dot schema dot table. Catalog dot schema dot table, And I got pretty far, and I did a pretty cool thing. I joined some data lake stuff with some relational stuff. And in fact, I went a little further. I said, and if you really wanna see that folks that love to look behind the scenes, I could kinda say, well, what just happened? And I got a lot of information here, but maybe what I'll click on first is this quick view that, yeah, absolutely, I did read these two different datasets. Those that know a little bit about query plans either with Treno or with Spark or Hive. If you know a Spark or Hive query plan, to be honest, you'll be pretty good with a Treno query plan. If it's new all that's new to you, that's okay. If you wanna walk into it, there's other ways. You can always kinda look at a visual representation of what that thing look like. In fact, it's it's not all that exciting. It just did for those that know a little bit about this kind of these parallel engines, it did a table scan of that customer the Postgres data, and then it broadcasted that. So it could do a map side or a or a broadcast join where you just limit a lot of shuffle, and then there we go. We did a quick partial sort to sort. Very exciting stuff. I'm a get us back to the queries here because I wanna get to the q and a. And I think what I said oh, yeah. Look. I said, hey. Guess what? I have another table that lives somewhere else that has the payment information, yet another place. So there's some payment data for those mortgages. And if I ask the system to tell me about it, I'm actually gonna find out it is an Apache Iceberg cable. In fact, they're using the Parquet format. And for giggles, I went ahead and made it the version three, some pretty pretty, advanced stuff, some pretty new new new stuff instead of go version two for those that track Apache Iceberg. So there we go. What I wanna do now is put it all together. Yeah. I only want those columns, of course, and we go back. Well, okay. I don't know what I did wrong there. Just get the values we need. I guess I didn't qualify that because it's looking somewhere else, and that's okay. But, ultimately, what I really wanna do is this one anyway. Go back and look at that in a minute and see what I did wrong. I really want this issue a three way a join across all three of those systems. And there it is right there. And I'm gonna do that same query details page. I'm gonna jump straight to that advanced tab and see absolutely. There we go. I hit my Postgres. I hit my hive table here. I hit my on one data lake, and I hit my iceberg table again on another data lake. And I can bake it more more than just the data lakes. Those are more interesting and fun ones to me. And, again, if you really wanna kinda go break break it down and see what really happened, there's plenty of tools here to help you see that. Okay. So I got all that running, looked at my query details, and that ultimately, what was I really trying to do today, as my, you know, exploratory analysis kind of phase was taking me. I really just wanna take that information. So, again, people who live in Oklahoma, Texas, or Louisiana that have, open mortgages, why? Because I wanna see how many, I wanna see the average. There we go. And I wanna see the average payment buys those in those particular states I mentioned with open, not closed, open mortgages. And to be honest, they're all about the same, about, what, 52, $5,300, not a small mortgage payment. And the reason they're probably so tight because, yeah, it was this is data that was likely created by a generator in the first place, and it probably didn't have too much variability. Or you could just say, these states live near each other. The average ought to be about the same. I don't know. Alright. So I think from a demo, that's what I was trying to show, today. And so what I'd like to do, is and I'll just leave this on the screen here. I'm gonna switch over and look at the chat and see what kind of questions we may have in there already. If you don't have a question, you just wanna say, hey. You're listening. Just shout out where you're coming from, that kind of stuff. I'm in Atlanta, Georgia myself. And Quincy, in the background, colleague of mine, is gonna help folks that might want to ask their question out loud either, you know, with audio, video, come on screen, and that kind of stuff. But I see so she's asking that. I see one from Vilo. Does this connection to Snowflake, Postgres, etcetera use federated query or federated catalog? Well, I would say today, it uses federated querying. So each one of these, cat you know, we're using the term catalog to mean an instance of a connector type, and that's tree node vernacular for ten plus years, a catalog was at a connection to of a certain type. Now when we say catalog, like in your question, a federated catalog, we're probably talking about a meta store. Right? Look up a description of. So today, each one of these definitions here needs some kind of catalog. So if there are data lake, they're gonna you have to supply something, nice progress catalog, Hive metastore something. And more than one connector can use the same metastore catalog. You could have three different connections for some reason that use the same Hive metastore, if that made sense. But they are need they're configured independently. They may be the same one. And then some catalogs are catalogs that are in the system. So Snowflake has its own internal catalog. So when we connect the Snowflake, we're using Snowflake oh, excuse me. Snowflake's catalog. We could also use their Polaris catalog and other stuff. I will say on that question, I hope so. The answer was, yeah, there it's a federated query with each, connector. Use likely have its own catalog, and that's okay. Starburst, Trino, the coordinator, the brain engine is gonna grab metadata from both or all three of those catalogs. And then ultimately, from a plans perspective, gonna figure out what to do and go at it. But you bring up a really cool stuff called so we believe there's lot of interoperability with that, and I do not wanna steal our product management thunder. But I would say, that phrase federated catalog means a lot to us too. And I would say very soon, we might be seeing something on that space because if I was a company built on interoperability and high flexibility, I would see the value of making sure all those catalogs could be meshed together. But I I'm speaking a little too much on that one, Elisa. Hopefully, that helped. Alright. If they didn't, just raise it up again. Okay. Quincy had a quest Quincy Quincy. Alright. Gerson on Starburst platform. I'm reading these without looking at it. Do I always need to gather data straight from data sources using connectors, or can we load the data into our own Starburst storage and process transfer that? So, Christian, the answer is not yes or no because I wanna make the answer's yes or no. The answer depends what you wanna do, but let me debunk one little thing. Starburst doesn't have storage. So when you go to Snowflake, Snowflake internally managed table, they want you to bring that data to them. Here, we don't really have that. We don't have a persistence layer. Now that said, what you may do is you say, well, primarily, I'm using Starburst and Treno against AWS's s three or something like that. You might think of that, but you're bringing that storage. You're gonna get billed for that storage independently from us, that kind of stuff. So you never have to bring anything us because we don't have a place to put it. But, the subtle question underneath that might be, do I have to or could I? Do I have to or do I need to do some transformation and rewrite that data somewhere else? So if you got data on Oracle and SQL Server, do you always wanna federate between Oracle and SQL Server, or do you wanna push one or both of those into maybe a data lake? The answer is yes, yes, and yes, and yes. So we have a lot of flexibility and optionality in what you wanna do. And the short answer is what's right? What's the right answer for you? And I would say if it is occasionally that you need to join these two datasets and that makes perfect sense in those systems, a federated join is probably awesome. You don't waste any transformation jobs, duplication of data spaces. You don't have a concurrency issue to worry about. I'm sorry. Current, currency from, staleness. You wouldn't have a staleness issue, but you don't have a multi user concurrency issue to deal with either. Now you have three dashboards that are driving off these three kinda combined views that are that are merges of that joins of all that. Yeah. Likely, I would probably build a transformation job that ran that query and built an aggregate or whatever and stored that in the data lake and then have my my dashboards ping on that. Heck, yeah. I think that would make a lot of sense. So the answer is we'd let you do whatever you need to do. And, of course, from a consulting company, from a DevRel space, etcetera, we are more than more than glad to offer our opinions or, you know, we'd like to think they're they're valuable opinions and and etcetera. But, but, ultimately, the answer is up to you if you need to. And often, it comes down to that mix of performance, cost, staleness, etcetera. Oh, you had some more questions on that. So let me read on. So Gerson still had one thinking about Starburst use on a Dell AI data platform, which has its own solution. So that Dell AI platform also can reach out to data somewhere else too. The Dell Data AI platform, part of know, there's several parts of that Dell appliance. We provide our Starburst slash Treo software. We also provide a a Spark runtime to that. It's the only place that we support Spark directly here. But the short answer is in that Spark space or in that Starburst kinda analytical engine, sure. Absolutely. Highly opinionated data data lake s three compatible storage, all that good stuff on the Dell appliance and but do you need to bring it there? Technically, you don't have to. That's up to you. That's up to you as it makes sense or not, etcetera. But, yeah, if you're in a Dell if you have that in your environment, at your shop, behind your wall, and you got some Oracle and Postgres systems, you don't have to push those into the object storage. You may want to, but you don't have to. And if they were outside the firewall, as long as you, security wise, are comfortable with that, performance wise, all that stuff, then we can actually go out. So I hope that helps, Kristen. If not, just ask again. I'll ask, how does the Galaxy handle the egress charges when you're joining data from multi cloud different well, we handle it by, for the most part, by the fact that you're gonna pay for it. So I'm I'm I'm joining between GCS and s three, and I went ahead and made cross cloud just to make the point that, hey. Guess what? There's no shortness of, you know, AWS. You can be in the availability zones and, you know, don't cost to move things around. That bill's gonna come from the cloud provider depending on how they charge you on data moving around. Hope that helps. Yeah. But we don't wanna we don't we wanna charge you in Galaxy for the processing engine that happens, how big of a cluster you need, all that kind of stuff. So so the bill's gonna include that, but you're likely still gonna have some direct billage from your, from your object store provider. Alright. I haven't seen anyone, take the bait. Come on screen. I would love you to. And in fact, we're we're at the thirty minute mark, which doesn't mean I'm leaving, but it does mean for those that allocated a whole thirty minutes and you hung around, I appreciate you, and I hope you come back and do it again. I think the next office hours that are on our events page, maybe Quincy can post the events page on there. It's gonna be about our back and our back and call it. We're I'm just gonna run through some of the security things that I believe, you know, we do a good job at. One might argue that they're kinda table stakes, but I wanna make sure, you know, we integrate with all those players that are out there, privacy errors and mutants, but we also have our own built in access controls that I believe we're pretty pretty mature, at this point from time and effort in the field and from feature points as well. So I will pause a few more seconds. If no more questions roll in, I'm gonna call it a successful day. I'm gonna invite you to come back again in the future. We're also doing a series that we're starting up next week called workshops, and they will last about ninety minutes. And I will provide, a lab guide, and we will make sure those that are going very fast that we can do it all together, the lab together. We'll do it on Starburst Galaxy. And for those that watched but had trouble keeping up, they'll have the video. They'll have the materials. They can do the lab later. And for our on demand friends, we can come back and and do it at another time. Same thing here for you on demand folks. And, thank you, Quincy, for wrapping this up. Thank you everybody for being here, and thank you for all the great questions. We'll see y'all next time here on Starburst office hours. Thanks again. Hi.