Video: Optimizing Apache Iceberg Performance | Duration: 4816s | Summary: Optimizing Apache Iceberg Performance | Chapters: Introduction and Overview (9.84s), Apache Iceberg Overview (437.105s), Data Platform Optimizations (1811.35s), Join Optimization Techniques (2126.73s), Bucket Join Optimization (2899.45s), Iceberg vs Hive (3650.09s), Orphan File Cleanup (3754.375s), Orphan Cleanup Considerations (3865.235s), Z-Order Optimization Discussion (3947.635s), Understanding Table Operations (4046.62s), Optimize Data Compaction (4238.44s), Variant Data Handling (4329.4s), Maintenance and Optimization (4528.69s), Stats in Iceberg (4609.88s), Conclusion and Farewell (4684.455s)
Transcript for "Optimizing Apache Iceberg Performance": Hey, everyone. Looks like my colleague is still coming on screen and there he is. Excuse me. Marius Gromma and myself, Lester Martin will be presenting today. Our webinar is titled Apache Iceberg performance, optimizing iceberg performance. We are slated for about an hour. I think we might have blocked ninety minutes your time. We're targeting an hour today. And we're gonna try to leave about the last, ten, maybe even fifteen, but let's say ten minutes to q and a. But, I will say don't hesitate as we go along, to ask your questions. You only have chat or that QA tab, but try try to use the chat if you can. But, if it if the QA feels better, that's fine. I'll primarily be watching that. You'll see in a second, I will do, maybe just the first few minutes of talking here, and I'll turn it over to my esteemed colleague to kinda go through the depth and breadth of the conversation today. And I will be looking at those questions, for Marius as we go along. Anything I can't answer in the chat, during, his presentation, we'll we'll catch you at the end unless it's just so timely that I think I need to interrupt him and bring it up. Alright. With that said, I think we'll turn the slides here and see what we have for you. I think next slide will be something like, yeah, our quick agenda. And, for the most part, like I said, me, I'll do bullet one, quick speakers and starburst intro. We'll keep that really, really short. I promise to be well under five minutes. I'm including my minute and a half. I think I've already rambled. And then look. Mary is here to introduce himself in just a minute, run you through some of the topics that we wanted to address today. I will say this. We gave ourselves an hour. We're talking about, performance optimizations. We're gonna cover a number of topics, but there are always more topics. And the topics we cover, there's always gonna be a deeper conversation to be had. So I will put in the chat ways to reach us, posting questions, sending an email to me. Hey, Smriti. And like I said, he'll introduce a big second. So he's gonna run through a number of topics. No need to you already read the bullets while you saw me there. So if we turn the tab, I've already been introduced to myself. I get Lester Martin. Got my q QR code there. If you wanna find me on LinkedIn or anything, I represent our developer relations team or dev advocacy, all that kind of good stuff, you know, things like webinar, blog post, on and so forth. And that's me. I live in Atlanta, Georgia, United States Of America, where it's only lunchtime, noon, and I'll turn it over to my colleague, George, himself as well. Hi, everyone. My name is Marius Gramma. I'm joining from Vienna, Austria, and I'm with engineering at Starburst. I focus mainly on lakehouse table formats, and I am also one of the open source contributors on the tree node query engines. Thank you. Perfect. Yeah. As you turn the slide, what you're hearing is if if you didn't catch that, you're hearing it from the experts today. Lester always has opinions, and I have great ideas and opinions. But, you know, I'd argue I'm a user of this technology. Maris represents the engineering team that's been building this stuff for a long, long time, not just not just the connectors like Iceberg, but the core engine, all that stuff. He's he's a great resource on Tickle Pink that he took some time out for us today. So, hey, Maris. If you want, hit enter, like, two or three times to fill in those columns. I would just say this. Look. Starburst, to keep it super sweet, Starburst is our data platform. What is our data platform built on? It's built on the open source Trino query engine. If you don't know about Trino, go to trino.io. It's a massively parallel processing engine for ludicrous speed, query federation, all kinds of really awesome stuff. Means we can just read data from all kinds of places. And then Starburst as a whole, we won't talk about this stuff today, but has a few more value add features around that as well as performance enhancements and stuff like that. We will definitely be chatting and making sure you know where our events page is. We got a nice webinar next week about differentiation from Trino to Starburst. But, we'll turn the slide and say, what else are we doing here? One more picture, I guess, I would say this, and you can even hit the bump one more time, fill those words in on the right so I can just hammer through this. This just says, look. We think of ourselves as an open hybrid data stack hybrid run on prem, run-in the cloud, run-in both, run-in all. And open meaning, you know, we wanna talk to not just open standards like Iceberg that we'll talk about today, but open to all kinds of technology. So that that stack from the bottom just says separate we have a we believe in separation of storage and compute, obviously. We, wanna look to file formats that are very open. You hear you hear that word open. We'll talk about probably parquet off and on during today's conversation. Above that are concepts like table formats. You don't see listed there Hive, but Apache Hive was that original table format where table format just means how do we track metadata and data so that people can write queries and get answers. Pretty easy stuff. And, hopefully, today, we'll talk about some optimizations. And then Starburst built on Treno sits above that. And that waterline says, this is the data lake, but we also believe 100% that you have data all over the place. You can ten, fifteen data lakes and then lots and lots of other data sources. So we wanna be able to federate Ergo query in a single like a join across multiple datasets in multiple different locations. Last slide, then I'll be quiet. Just says, hey. We run a lot of different we run we have a product, Starburst, where we have some form factors. The main ones are Starburst Enterprise. That's classic software that you license. You install it however you want, wherever you want. Pretty typical Kubernetes, either on prem or in the cloud, or you maybe use one of the marketplaces on the different cloud providers to do all that kind of stuff for you, or use Starburst Galaxy. That's our software as a service out there in the cloud. In under that docs tab, you will see, a link that says Starburst Galaxy. If you haven't used that before, get set up free to use, for now, forever, that kind of stuff. I could ramble more and more about Starburst, Starburst Galaxy, form factors, all that. But I'll say, if you got some questions on that, I'm here. Put them in the chat while Marius turns his side and really gets into what, hopefully, you came here, to talk about today. And right before he says it, I'll say it one more time. Please ask your questions. Put them in the chat. There's no there's no voice opportunity today, unfortunately, for you. But put them there, and I'll be the traffic cop in, for Maris as he goes along. And, with that, I think I'll pass the baton. Hi, everyone. I'll I'll start by making a quick introduction to what Apache Heisberg is. Apache Heisberg and but first of all, Apache Itsberg is a table format. And if table format is a relatively new concept for you, you can think about it as as similar to what the file format is, like Parquet. A file format, the same as file format is tracking the rows or the data within a file. The the table format is tracking the data files within a table, and it comes with with organization and metadata that helps you to quickly find the data within the table. Apache Iceberg is such a table format. It it has been accepted in 2017, and it it it grew out. It it has been created out of frustration of dealing with idiosyncrasies related to Hive tables. And along it currently, there are multiple multiple table formats which are well known. They are Apache Iceberg, which we'll be talking in-depth in this webinar, Delta Lake, and Hudi. More more all of them were created more or less around the same time, and all of them have more or less overlapping concepts. But, obviously, some of them point better than the others in specific use cases. We are going to present now this, as mentioned, Diceberg table format has been created somehow. Actually, all the table formats have been created somehow inspired by limitations related to Hive. Hive was very, very heavy in storing data about tables, metadata like schema partitions, column metrics, table properties, all in the Hive metastore, and this was causing various headaches for the for the data engineers. Iceberg comes with a slightly different approach by storing in the Iceberg in the metastore or which is called also Iceberg catalog because there are multiple implementations of it. It just stores the latest table pointer, which is the location to a metadata file. And the the metadata about how the data is stored in Iceberg is structured hierarchically. We have the manifest list which contain partition summaries. So it summarizes data, mean and max stats for partitions. And then we have the manifest, which contain a collection of files of data file, which contain references towards data files along with column metrics, so min and max stats for those. And you will see in this presentation that having this multilayered data metadata structuring can become very effective for effective filtering for filtering data. We are going forward with speaking about table partitioning. In the world of iceberg in the world of big data iceberg, hidden partitioning is a bit of a magic trick that solves one of the most annoying problems for that existed in Hive. There was a disconnect between how the data is stored and how it's queried. If in Hive, the user had to know which month does timestamp course the timestamp column column corresponds to in Iceberg, you don't have to care about this. And when you say, I want to I want to get my data greater than a specific date, Iceberg is smart enough on on to know how how to make the transformations behind the scenes in order to make performant query filtering. The physical data layout is separated from from the logical partitioning. And in the image here, you can see that the the cut is made at the beginning of the 2009. And from then on, the partitioning is happening by day. And we can showcase here I've inserted I've inserted a data row when the when the filter when the partitioning was done on month, and then I altered the the table part tables partitioning on day and enter a new data row. And we can see when querying the the path to the data files from within the table, we can see that they the this is the file corresponding to the initial partitioning, and this is the file corresponding to the newest partitioning. Yes. So the takeaway here is that the old day old data in the ICEBERT table stays put, and the new partition when the new partition spec, this is a simple metadata operation. It does not involve any data rewriting, and the new data that's inserted in the table just follows the new partitioning rules. We are going forward with filtering techniques. The hidden partitioning is as, as I've mentioned, is this magic trick that Iceberg has on its sleeve that allows allows the capability of not having to create any artificial columns in order to filter efficiently. Then we have the metadata based files keeping. If you remember from the initial image around Iceberg, I was mentioning that both the manifest lists and the manifest files have min max stats. So we have the partition level at the file level stats. And if you deal with query engines in general, you will likely hear about the term predicate pushdown, which basically means that evaluation of the filter is moved towards the data source. In case of a relational database, we are likely going to strive to push the filter within the relational database. Within Iceberg. We are going to strive to to perform partition pruning and files keeping through metadata. We are going to make a small demo right away. And there is also file level additional additionally to the metadata level, there is a predicate pushdown is acting also at the level by doing row group skipping or in case that the Parquet files have also Bloom Bloom data sketch, it can be employed for doing equality checks or inquiries. We are going to touch this as well. Now effective filtering, I'm going to show you here how the how the how the metadata filtering actually acts on a on a a on an iceberg table. Let's assume that we have an IoT table storing events events data, and we are partitioning it by the day in which the event event occurred and also by the bucket. Bucket is a a partition transformation that that takes an a value and puts it in one of the in this case, one of the 64 buckets. And if we look at the query that is being run by a user for retrieving the events that correspond to the device ID 74 done in the time range of tenth of October 1 at 10AM up to October 4 at 8PM. This specific query will be translated internally by the engine into something that Iceberg can understand a bit easier. So it will do the partition transforms under behind the scenes. So device ID 74 corresponds to a device ID bucket one and event the the the date range, the time time stamp range presented in the original filter will correspond to the day being between the October 10 and the October 4 the October 1 and the October 4. And if we if we look within the manifest list, we we will we will look at the partition summaries and try to overlap the lower bound and upper bound of our partitions to see which of them match. And we see for the manifest two Avro that the lower bound of it is October 15, which is after the date range that we are looking for. And, therefore, we can skip it. And if we go within the manifest one, we we have on the right right hand side, the manifest file, we see that we apply again the we we check again the partition values, and we see that two of the two of the data files are matching. But when we we look at their lower and upper bounds ranges, we see that 02 data parquet has the the lowest device ID 186, while 65 data parquet has the lowest device ID 65. And we are looking for the device ID 74. Reason why from all of this from this from this many files, we actually have to filter only within the 565 data parquet. Yes. For the for file level filtering optimizations, we we have two topics, sorted files. Our query engine offers the ability to write files sorted after a certain column, and this this is pretty helpful in case that column will be involved in filters that, you know, usually involved in filters because in this way, we will skip we will potentially skip entire row groups, the Parquet files. And from a writer perspective, if the data is clustered, it will be better compressed. Bloom filters is a technique that is employed in in verifying the existence of a key within within the data file. So we can skip the entire file altogether if the data sketch says I definitely do not have this value inside my data. It's worth of mention. There is a slide. So this is there is a this is a data sketch, and there is also an additional setting for for specifying the probability of heating. So it can be further tweaked, the Bloom filters. One thing to keep in mind is it the the scenarios on which you employ these equalities and in filters. Merge modes are basically table properties, so not necessarily they are they are technically table properties, but they affect very much how the operations retrieving data from the table work. Margin read is the commonly employed technique and is the default mode that we have on our connector. And this one optimizes for faster rides. So in case that there are deletes being performed within the operation, the original data files that that will be created delete files that will need to be reconciled dynamically when reading data from the table. Copy on write has a different approach. It optimizes for faster reads. So whenever doing operations then modify the content of the table, the the the the the data files that con that get content deleted will be rewritten. This obviously can result in faster joins, but the the writes will be obviously slower. The documentation is linked in in the here in the in the documents that are provided with with this webinar. We go forward, and we are speaking about the table partition strategies in Iceberg. The most commonly employed strategy is the time series. We are speaking here about temporal transforms, so day day or month for high volume logs or hours or minutes when dealing with real time streaming data. High cardinality strategy, this is when we want to partition based on on a column that has very, very many a very high number of values. So a good example of it would be user identifier, device identifier. And then we employed the the bucket transform on on it. Identity strategies to be employed when the number of possible values in the column domain are relatively small to medium. So a region, a country, a department, something that does not explode in terms of values. The golden rules for partitioning card that we should aim for one to 10 gigs of per partitions. The partition if the partitions are smaller, feel free to update the level. As as I mentioned initially, the the partition evolution is a metadata operation, so it doesn't cost very much to do. It doesn't cost at all to do. So feel free to to tweak it if you see needed. The recommended number of files per partition is between one hundred and five hundred data files per partition. There is a small caveat or small hint here that this needs to happen after compaction, and we will we will touch very soon what this means. And the general rule of thumb should be that strategy result the partitioning strategy should not result in more than 10 k partitions in order to keep the query planning decent. The small file small file problem and delete file problems are classic performance killers in big data, and Apache Iceberg is also not immune to this anti pattern. It basically means that a table is composed out of thousands or dozens of thousands of tiny files, which we'll touch a bit we'll touch a bit more in detail in one of the following slide. But the consequence of having so many small files is that the query planning and query execution in general is tends to get slower. The delete files is another anti pattern that relates to merge on read strategy. And this can happen when there are frequent updates or deletes, like GDPR requests or CDC updates. And we end up with thousands of small delete files. And if you remember from the previous slide, when dealing with delete files, then the engine needs to reconcile them dynamically when reading data from the table. So if the select starts taking a lot of time, you it's it's quite likely that the engine spends. If the engine starts to get much slower than it used to be a few weeks or months ago, it can be because the engine spends a great deal of effort in figuring out which are the active roles. For both of these anti patterns, the recite we have is to optimize the table, which basically means rewriting parts of the content and in order to have consolidated data files that are have an optimal size and are delete free. We are moving forward, and we are going to slightly touch the topic of caching. And there are various flavors of caching. The most straightforward to understand is when you're in the IDE and you execute the same query multiple times. And that's query result caching where you just get the queries out of cache so the results are stored somewhere. And if the table did not change, you'll get the exact same results as for the query as you've executed it a few minutes ago. File system caching is a technique that is to be employed when you want to spare the store the object storage on retrieving the very same files for, let's say, a hot table that you have. If your users are clearing very small amount of tables, then it's it's worthwhile to investigate setting up a file system cache, and then those files will be retrieved from the local file system instead of going to the object storage. In memory, Iceberg metadata caching is a similar technique, but it does not rely on the file system caching. It relies on in memory, and it it it has been added to to the connector in order to speed up the the query planning. So, again, if you have some tables which are very often employed by your users and you want to speed up the query planning, you don't and do not go every time to the object store for retrieving the metadata, you can set up the metadata cache. Starburst offers also an autonomous performance acceleration layer, and I will let Lester to chime in and present. it. Sure. I I mean, I think it's worth even before I say that, I think it's worthy to get that when we talk about this slide, I always call it, like, your data platforms choices to help you. So we didn't have to do anything to the tables per se. We didn't have to restructure anything. We have to quote, unquote rewrite a query. There are things that we could do just to, you know, enable these things. And, mostly, they also have ways that you don't want that. You know, if you don't want the query results cache and you can do things like set a session property, says turn that stuff off for me because I really, really, really need immediate. What, Marisa was talking about at the end, we call it warp speed here. It's something, that we actually bought through an acquisition a couple years ago and just integrated into our product. And what I really love about this thing is it's really like a columnar store, you know, hyped up, amped up. So just like you think how columnar stores store all these chunks in these, these row groups and these columns, it's doing something very similar saying, hey. As I read that stuff, let's just load it up, put it on some SSDs, and that kind stuff very close to machines. And I love that it's actually not gonna ever let you get bad data because it's keeping track of that data too. So the cache that it knows about it also knows that, hey. Has that data been touched? The moments touch and it validates. So that's the caching part, but the indexing is the piece that we always get in trouble with in the data lakes. We, you know, we we talked about mirrors, talked about partitioning. We get great benefits there. We talked about things like what was the thing about sorting and bloom filters and things like that briefly. This is another place. What if you have a bunch of access patterns? So an autonomous agent that's looking at what's going on intelligently, not only caching what needs to be, but starting to realize how fast and how quickly to find things based on those predicates that maybe aren't the ones that line up with the the partition or the inherent sorting you did and that kind of stuff. So I think I think we're wanting to give you wanting to give you a couple of customers. I decided maybe I'll just not point out to a few customers, but I will put some some links in there about if you'd like to try this out yourself, we got a little lab that will work great on Galaxy a little before and after and really go, wow. That's pretty cool stuff. And I see what you're talking about. So, again, I love as a programmer at heart, these things you can might say, oh, these are boring, but these are the things that really can make make or break some nicer systems by enabling some of these switches and enabling enabling some of these features. So yeah. Thanks. Thanks, Maris. Sure. Thank you. We are moving forward, and we are speaking about maintaining stats or statistics. The way to to collect the statistics is through analyze command, but it's worth mentioning that we we have specific configurations which allow maintaining the stats seamlessly when doing DML operations on the table. Extended the way to see the stats for a table, you it's being done through show stats, or you can do even show stats for a specific query on the table if you want it more granular. Now stats, why are there why are the number of distinct values and row counts important? It's because they have very much a cost optimizer in making better planning decisions. The join reordering is one very common use case where figuring out which is the small table and which is the big table is very, very important in in in getting a small a faster finishing query or a slower finishing query. Also, the join distribution, whether to make a broadcast join, whether we are dealing with a very small table that can be broadcasted to all the workers or whether we're dealing with too big tables that need to be shuffled across the workers, this is something that table stats can help with. The the image of this POFIN is presented here because that's the name of the in iceberg, we the table statistics are stored in POFIN files. And worth it's worth mentioning that it this this specific addition to the Iceberg table format has been done through through Starburst, and we are very proud of it. Recently, there have been added to iSquare also partition statistics, you can leverage those as well. We have added also support for them. So this can also speed up the query planning. With regards to file size considerations, I've touched already the small file anti pattern. Again, this is about dealing with too many very, very small or small files. And the impact on the query engine is that query planning overhead, there's lots of such files. There's lots of manifest list, manifest files, small data files. So the the planning takes a long time. The query engine is bombarding the object store with with request, and this can end up requests to the same location, more or less the same location, the same prefix as free prefix, and this can end up in front link. And the pattern is generally inefficient because most of the time is spent on opening and closing data files instead of processing data. How to how to fix the small file problem and the delete files problem. Again, I I I the the the result is regular table maintenance. What Trina offers is the command x alter table execute optimize. Here is depicted a situation that one of our customers had where they were complaining that the queries were very, very slow. And when we looked at their data file distribution, active data file distribution, we've noticed that they had very, very many small file very small files small files, and only a few of those were towards an optimal size. And after applying the consolidation operation, the optimize command, the number of the number of small very small files has been has dramatically decreased, and most of the the bulk of the files were in the optimal range. I'm presenting also here with regards to optimize. We've created a table with very small files, and this this ended up so with this command, with this session command, you can end up in having very, very small files from your c test statement, create label as select. And we see here that now that there's we have here a query. I can make it available if needed. Can be also obtained through your favorite LLM. It just says how many very small files are there, and we see here that there are 150 files approximately very small. And after applying optimize, so I've modified the target max file to towards a bigger value and applied optimize on the table. And then we see that the number of files has the the very small files have completely disappeared, and the number of small and optimized there are only three files now. We are going back to the presentation. Optimize comes in different flavors. And in case you're wondering why there are so many flavors, I think the answer would be that our customers employ it in various ways in order to catch only the the only the partitions or the files which have been last updated in order to avoid optimizing the whole table. Optimize manifest is another utility method that helps in reorganizing the manifest list. It it it helps in in the situation where the query planning starts to become slow. So in such situation, maybe when dealing with streaming ingestion use case where there are lots of small manifest files or if there's late arriving data that causes the newer manifest overlap with older time ranges. So when back filling some data into the table, this can end up in increased work for query planning. And what what this utility does, it does it consolidates and orders the manifest files, and this can help out the the query planning. I would just say I'll throw a little a little commentary. to side to one more back even before that on the many flavors of Optimize. Lester Lester is a big advocate of, you know, this kind of humpback whale, and I'll just put a chat of what I mean by that. Partition definitions were, hey. You know what? We build a new partition, we fill it up, we maybe tweak it and tune it, but at some point, we stop playing with it. So if you have a good so time series immutable data, it's one of those classics. Things like that are perfect for using targeting. You know, with it can just be a simple partition key, you know, just something that's logically gonna get that partition because we can stop rewriting that data. We don't wanna as our data gets bigger and bigger and bigger, have to constantly be thinking about rewriting. So if you do have a partition strategy, let your data slow down, if not totally just stop changing, you can, at some point, kinda mothball that effort and say, hey. It's already well optimized, but we've been re been, well re recompacted, rebuilt, and that kind of stuff. So thanks. I'll put a comment in this. Hey. And for those out there, lots of great questions are happening in the chat. Keep them coming. You know, everyone else that hasn't chimed in, keep them coming. And I'll Maris, if if I can't. get any, We promise I'll save you a couple of minutes. Alright. Go ahead. Thanks, we're. going be soon through with the presentation, and we are going. to dive into q and a section. It's looking good. Thank you, man. Sure. We we are slow we are presenting very briefly also some others housekeeping operations that the query engine is providing. And these are either keeps every version of your data for the so called time travel feature, which allows you to query the table as it once was at the previous time stamp. Now if you don't want to if you don't expire the whole snapshot, your storage costs will balloon and because the data the deleted data is never physically removed from the storage. So do consider to expire the snapshot whether the seven days, which is the default value, is a good value for you or whether you want a different value. In any case, please do consider applying regularly snapshot expiration. Orphan file cleanup is something that, you know, in an ideal world would never happen. But, yeah, every once in a while, some of the queries crash, and they leave unreferenced data behind. So the metadata file, which was supposed to reference them, was not committed to to the iceberg table, which means that there's no link to those files. So for such use cases, it's worth as well in order to keep the storage costs low is is worthwhile to regularly run the removal of orphan files. Yeah. We are are touching now joint optimizations. And on this section, we will be speaking about dynamic filtering, cost page cost based joint reordering. We will not be touching, but there is a blog post. Again, I'm reminding you that in the docs tab of the webinar, there is the link to the the slides so you can access it. The TLDR for the scope of this webinar is that the build side is the small table. The probe side is generally the big table. And it's very important to have table statistics available when when dealing with joins. Merge on read versus copy on write, we've already touched. They are technically table properties, but dealing with with reconciling dynamically delete files for merge on read can affect your your join heavy operations. Bucket joins, we will touch as well briefly. We are moving forward, and we will be speaking now about dynamic filtering. This is one of the most impactful join optimizations in Trino. And you can think about it like in the image linked here in slide where you join a large fact table, in in this case, the sales, and with smaller dimensions table, the items. And what Trino does, it it tries to to to gather the the item IDs that correspond to the filter, make a range for them, and pass that range very simplistically. It will pass that range to the sales table scan. And, therefore, when the the the sales table scan will then filter specifically on these item IDs or item ID range. And for Iceberg specific benefit, if you remember, we have the min max values at the partition level within the manifest lists and the min max ranges at the file level within the manifest files. So the connector can use these dynamic filters to skip entire data files or partitions that do not contain the matching keys. Dynamic filter technique, again, relies on table stats in order to identify which is the fact that it does the the dimension table in order to try to make out make up a dynamic filter out of it. And it results in huge IO savings. As mentioned, we skip data we skip reading data. So there are much less bytes read from data files or not at all. And, also, in terms of network savings, we are also very good because there is less data being shuffled between the workers for the join. Bucket joins or partition aware joins is a technique that generally in in a distributed join, Trino has to shuffle data across the network in order to ensure that the rows with the same join key from the table a and table b end up on the same worker node. Shuffling is expensive and often is one of the biggest bottleneck in a query. And what's this optimization about? It's if the table a and table b both have both are partitioned on a high cardinality column, like, for example, a user ID, they are using the bucket partition transform, Trino will know that the bucket one of the table a and the bucket one of the table b make are con make an are containing the same possible ID. So there's no need for for shuffling the data. You may see in the explain plan that the exchange operator is skipped in such use cases. Here, I would there is a work of mention that explain plans can seem a bit scary if you see them. But if you use your favorite LLM, it does a very, very good job in translating it in plain English for you. So please take this with you when you do explains for the query for the queries. We are ending our presentation with, I would say, a good to know hint. Insert override in case you you dealt with it or you ever you encountered it. You can think that while standard inserting to simply a to an existing table, insert override performs a two step stance. First, it wipes the existing data for for that party for a specific partition, and it replaces it with the result of your insert. Now you Treno does not support does not offer insert override. However, it this can be employed through merge. The if you if you look up how to do insert override via merge on your favorite LLM, you will very easily find a recipher that. What you want potentially finding your LLM is that branching, iceberg branching, can be employed to speed up. It's it's often speedier, faster than the merge statement. And how does it work? You create a branch for your table. You delete the data from your partition that you want to exchange. You insert your new data, and then you fast forward your dev branch to the main branch. And that's it. Good. This completes our presentation. I got very question on that one, Maris, your attention. because I I haven't tested that myself. If you go back, I've been I've been meaning to. What happens, if you go back again, what happens if, create the branch, you know, simultaneously six more? I mean, soon thereafter, six more sub snapshots are created, then I'm I make my change in my branch and I say bring forward. Is the bring forward gonna do a merge or gonna replace what you bring forward? Because it could be dangerous. you're asking about this statement. Yeah. I can't read it too well, but I think it says if it's the one on the last middle of slide, this oh, yeah. Yeah. Fast forward. Does it just say, hey. Tough luck. This is the snapshot I want, or or does it I I would think the answer is yes, but I'm not sure. The fast forward so branching in you can think of branching as it may be a bit deceiving. Iceberg branching is not the same as git branching. When you fast forward means if I can apply these changes and reconcile with changes that happened in the. meantime, Perfect. then the operation succeeds. It's, it won't. it's kinda like any other concurrent commits. If if six other people made a snapshot, I'm based on six ago as long as the engine goes, okay. I'm not the I'm not based on the most recent one, but if I can rebaseline and I'm okay, which is really good and lot of, like, inserts that okay. So it does do that. Yeah. The engine probably has to do that rebaselining, figure it out, and then decides kinda okay. Cool. Yeah, and And it in practice, it turned out that this was much faster than merge. Yeah. But I. That makes sense. in the technical details just in order to keep things. That makes sense. small. Well, hey. There's a couple of questions I didn't get answered if you oh, I'm sorry. I didn't know if you're ready for them. I can throw a match in, and I'll look at the couple of. last few. One of them I saw, and I gave a quick answer, I'll let you think about it. The golden rules that you primarily were talking about with partitions and maybe file sizes, are those any different for materialized views? My suggestion was the materialized views, in essence, a table, and we ought to follow these same kind of rules because the engines are gonna act upon them. But if you had any other materialized views, perspectives that I wasn't thinking about, folks would love to hear that. For I mean, for materialized views, I I cannot give K. a a golden rule. However, I can share that we are very close to to improve. We we used to keep a bit of unnecessary data into the materialized views. So the old data was still there, and. we are now in process of curating and improving the mechanism in order to keep only the latest data. Cool. But saying the materialized view should not have more than this much data. I I think it it really varies on the use case. I don't know whether there's a Yeah. magic number. I agree with you. And I think if you for some reason, your materialized views creates too many tiny partitions with tiny files, then you messed up. So I think maybe the rules are the same. Hey. Here's one you might know. There was also one from Sebastian about the dynamic filtering, the. fact that it has this kinda give up time out thing. And. he says he's seen he's seen that maybe talk to that about one more why that's a good idea, not a good idea. He was thinking maybe it's not a good idea ever, but what was the thinking there on that one? If you So there is indeed this specific configuration method that how long it takes, what's what's the timeout. You can control it also with the session property. And if this turns out to be very slow, it would if if collecting of the data filter turns out to be very slow, then we will just fall back into the regular join. up on it. Yeah. That's what I was saying too. Maybe that's all that's all I was saying. If if we the whole benefit was to get this nice finite list, but what if it's taking forever to do it anyway? Then likely, that means we'll probably need just about everything anyway. So but but you're right. It's a it's a it's a parameter. Though I'm moving on to the question. So if I'm asking a question, you don't like the answer, put put it back in there. Another one I saw, and I think this one I can answer for us here on the fly and see if you disagree with me, It was about oh, no. No. No. Not that one. Not that one. Someone asked about the all all the table maintenance. You know? Are they running the background? The they're they're being run as a job. They're making changes. And the point that Maris was saying just a few minutes ago is, you know, if nobody else does anything and your your rewrites are working just fine and come back, then everything will be fine. But it'll just catch up. The reality is it probably will be fine unless you change those datasets again. So generally speaking, we don't really need to know, but, you know, that how do you know that that statement finishes? So it is like a query. It's like a SQL statement. So you can go verify that it has run to completion. If you're doing it as a human and pushing the button, you know, wait till it says I'm done and it's done, that kind of thing. Hope that hope that helps, Shreedhar. Alright. What else did I miss in here? Something about 32 processors each and demos with workers. Nick, we might have to table that one. Think, you know, I think, Jamie kinda offered up an answer to this. He's not really aware of any issues. I don't know if you see this one, Marius. I don't not really iceberg performance, but it's about workers with 32 processors each and diminishing returns. I'm gonna I'm gonna skip past find. it. Yeah. It's not too far back, about 10 messages back from Nick Staver. I'll let you take a look at that. I think we're nothing to there. There must plans to. open source Iceberg v three. While you're looking at that, I'll go and answer the other one. We don't have a definitive date on that. I will say there are some PRs in open source Trino too that you were worth looking at that possibly might make it in April or something. But the short answer, Sebastian, is absolutely look for v three, open source solution in Trino one way or the other very soon. There we go. I think that's the best way to say it. Features. That's a quick one. That's right. You know, hey. And I'm stealing these questions, Maris. Thank you. So tell me if I'm wrong. Hey, Jake. You know, iceberg features, like, all this hidden partition, all what's the biggest impact on the query performance cloud storage? To be honest, it's really big about because when all like, Hive would run on HDFS, and that metastore was a database that was painful. But to be honest, it gave us, like, directories, and then we had to go pull those directors to get the files. That worked pretty good in old Hadoop HDFS days because that lived in memory. It was just like a quick directory listing on object stores, though. There's not really a directory, and it gets very chatty. So a lot of the costs are just those gets and looking up and that kind of stuff. So that was a big reason Iceberg and others said, let's just bring as much of that metadata out, stored as a fewer number of, you know, just metadata oriented files, read those quickly. And then in memory, I know what was out there already, you know, based on the mutability, the snapshot, all that good stuff. So I think many of the reasons are that, just limiting you know, working better with an object store and your cost do go down. And it is it is on, you know, less files to go get, but less files to figure out I need. You had a partition with thousands of files and another partition, thousand files, that stuff is in those quick metadata reads early and not having to constantly just go back to s three or s three compliant or something like that. So Jake or Maris, if I missed that, was screaming at us. And if you're looking, Maris, if you see some other stuff that you wanna chime in, just go ahead. I'm just trying to catch up here. With regards. to the question with, I don't have off the top of my head an answer for for the one that you've asked me about, the one with 32 workers and. return. I have for for the question about what you've already answered. Indeed, Iceberg compared to Hive in Hive, you have if you go if we go to the very first slide, if we see here and in Hive, we don't have the data files from within a table link. So we have to list them, we have to get them, and we have to before we process them in Iceberg. However, all the files are linked within these manifest files. So you do not have to do any listing operations. And the as as Lester pointed out, the you just get data. So I it in practice, it has shown it has it was clear that with Iceberg and the cloud object storages, this combination worked much, much better than Hive. Yes. Good stuff there. Hey. We're gonna continue to hang around. We have time. If you had to leave, thank you. If you're leaving, that's fine. We're we're gonna try to answer these questions here, until it dries up or maybe another thirty minutes go by. But the question about orphan files, Marisa gave you a great scenario, which is, you know, what about someone you just expired, but someone has an active running query? That's great. But I would say things go wrong. Things blow up sometimes in around. And someone someone might say, why wait a week? I think that was a question. Well, I would say wait a week because it is kind of exhaustive search to find out, check everything and all those snapshots, and then check everything. that exists, and then reconcile and see if there's any leftovers. The leftovers should be rare. The leftovers shouldn't be killing you on cost of performance of storing them. You wanna clean them up periodically, but let's not clean them up every hour. So so something like a, you know, a week probably is not wrong. Now many people just couple all their maintenance activity in a back to back to back, and that's perfectly fine. It works. And that was a lot of the questions that we did a lot of q and a while Maris was talking about this. It depends. You know, it really depends on maintenance. We we and others offer automated maintenance, but today, I wouldn't go as far as to say those are all super intelligent AI. They're kinda schedule oriented for the most part. Now by letting someone else do it for you, you've given them the chance to be really smart about it. But I would say today, your biggest, largest, most gigantic tables that with frequencies and stuff, it's probably on you to kinda do some deciding how quick, how often, how frequent. And there are people at they are over to a year or two ago talking about expiring snapshots within an hour. You know, they had they had such a volume and such a streaming system that made sense for them. So I'll pause and look a little bit more, but orphan cleanup is an interesting thing. Orphan cleanup. There you go. is an opinion. one agree. one thing to keep in mind. Someone was saying I I do zero seconds. Doing zero seconds has a is a sword with two cuts because, yes, you do cut everything that may be wrong. But if you have an ongoing query which takes a longer time, then you may be removing the way the query engine works, it goes from the bottom to the top. So it creates, first, the data files. It adds them list. It creates a metadata that links it, and then it commits it. So if you are at the stage where you're creating the data files in the query engine, they are not referenced by any metadata metadata file. So they are, theoretically, they are orphaned, and, therefore, you may be corrupting the the table this way. So please do do use, as someone suggested, at least one day time lapse just to be on the safe side. You do not want to corrupt the the content of your iceberg table. Yeah. Iceberg supports column sorting, but also the order optimization. Yes. Indeed, Sebastian. Correct. I support what Starbase offers is the ability to write when you write content, it will sort it by by a specific column. Z order optimization is something that definitely exists, and we have been looking into. The practicality of it has not been big enough so that we invest the resources to do it because it involves rewriting the whole partition in order to be z order compliant. So this for for Starboard, this is not offered at at at the moment. I like it when I type something and Mary says it, it's the same thing. I was a little more rough. I kinda said, we don't we can't or don't write in z order, but but, clearly, if you had another engine rewrite those, Mark, we're gonna read them because we don't know there's z order. They're just files of metadata, like a snapshot, so it doesn't really matter. So, yes, if you do have a scenario and I and if you just go out to to Gemini when you say what is z order in Lester Martin's YouTube video is gonna come up and give you a visual for those that don't know what z order in. So maybe I'll do some shameless self promoting. Hey. Shreedhar reasked the question. I'll try to answer it a little more slowly. Shreedhar, he said, hey. When I run those alter table, you know, optimize. that stuff, He's like, when do you know it's done? It's not run everything's running. There's not really a background. There are active jobs and completed jobs. So just like you run a query and it might take a second or a minute or five minutes, when that query is finished, you get an answer. The same thing's gonna happen with the with the optimize. When you run, it's not gonna come back until it's done. It's not necessarily gonna just say finish it and unless I missed that, Maris. I'm pretty sure you'll see it. Worst case, you just go look at your query in our query analyzer, and you'll see if it's running or if it's not done. But I thought that block and hold, Miris. So is that not true? It's not gonna just back and say I'm I'm done. Are we are we speaking about two. Well, line, two, wait. It won't return until it's done. Line two, yes, alter table. it will not return until it's done. There you, go. we return to the this initial. image. What will happen is that from this myriad of files. that you have here, it will consolidate them, and, it will create new manifest, manifest lists, and. metadata. And only when this thing is being committed to the Iceberg catalog, only then the operation completes. So Yep. that's worth to keep in mind. So so, Shreedhar, I will say, yes. We're gonna we're not blocking and holding that command and that window is waiting. But, arguably, like I said, there's not really a background. There just may be many concurrent jobs, multiple concurrent jobs. So someone else is running a query on that table. Someone else is running an insert on that table. Heaven forbid, someone else is running a merge on that table, and we're relying on the spec and the engines that implement that spec, how they tackle concurrency. And we said it earlier, which is really just about, hey. If it's based on the the previous, snapshot, great. You win. If not, it just goes back and tries they call it kinda rebaseline. It says, hey. Can I look and see if my changes would be okay? And that usually means an easy answer that is a whole bunch of other people that insert sent you, it'll work just fine. It's just the engine has to kinda go back and re kinda catch himself up and go, okay. Let me set it again. Is it current? Yep. Write it. And it'll do that a few times and give up after a little while. So, hopefully, Shreedhar, that answers the question, what is running and how this running and all that good stuff. There's not necessarily a little meter on a table to tell you that. You'd have to know who's running it. So if you're running the jobs, the statements manually in your data pipeline or by hand, it's on you to know that it's running. See it if you're using an automated, framework or engine like our we have automated table maintenance in Galaxy at least. You know, you kinda have to look at that UI and see what it says or say, hey. I trust it that it's taken care of for me. Try you if you wanna know, it's just a previous statement that's run. You can look at any kind of query the query history server and, see all that good stuff. Alright. I think we're pretty close to call. them. I'm gonna look at it. I'm looking at it again, make sure that what else we didn't miss. And I think Optimize is just optional data compaction, unless I may say. So, Jamie, I think optimize is an optional data compaction. Yeah. That's its primary business. I mean, it came around even before we had iceberg. It was around a compact we had these concepts in Hive too. Had go back sometimes and rewrite things that maybe showed up in smaller sizes. So optimize for Iceberg is is in essence, it's a rewrite. They go back and look, and it primarily unless you teased it to do something different, it primarily looks at file sizes and says, hey. There's a threshold. Anyone that's below that threshold is a good candidate, you know, to bring up and rewrite. But it also does those delete files. It actually reads it because they fall into the small category. And he goes, okay. And we didn't go into those semantics, but the delete is really a or an update is really a delete and a rewrite. You know? So it does create a those in many scenarios, if you do lots and lots of updates, those are the ones, as you see in the bullets, they're gonna create a lot of very small files if you do very surgical strikes. But it it depends. Likely, for most people, it's gonna be the how frequently we ingest, how much data there was in that frequency that you usually gonna drive the small files for most people. Alright. Lots of great questions. I'm gonna keep working backwards. Sebastian, on the v three, the variant types today today in Starburst, we actually while writing to Parquet files, the iceberg, it's truly a variant to us. We're tackling with a strategy that 99% of the time people today are talking about storing JSON in there. So my understanding variants can be really anything, like an m p four movie or an m p three song or something, but majority of people are using JSON. So we tackle it today. We're talking about what do you do for true, true, true variant, but we're saying, hey. We already have semantics for our JSON data type. So we're saying when you create a table, you create as JSON, it'll be the variant. So the gotcha that I see if someone in Spark or some other magic world creates a variant column and puts an m p four in there or something, we probably wouldn't know what to do with that, but we're working on, working on that. So we tackle as a JSON today, and then we have lots of options and features. We don't do any kind of shredding, all that, you know, variant shredding or index shredding and stuff. We have some open PRs on ourselves to work on those kinds of things, but that's I don't think anyone's doing it. That's very, very logical. I mean, very theoretical stuff that people talk about. Yes. I think this is still still something we are debating, so to say. So I think in the next few weeks or with maximum months, we will settle them. So I think that by the end of this quarter, we should have a clear a clear on how to deal with variants. Yeah. We already have one released, but we are still debating about the shredding part and about all the other. details. Right. A 100%. Hey. I'm gonna rattle through a few more quick ones. I did put a comment right now. It's a link to our forum site. Our q and a site. There is the Trino Slack, and I'm not discouraging you from going there. But we have a Starburst for them, and it's a good very easy to find things. They do they'll be durable. They'll be around for others, so feel free to yes. It's the the management yes. It's not a map column type. It is a JSON. And and I've got some, I'll put a chat I'll put a message in here if you wanna play with our v three. I got a lab that works. Walk I you through every single step. So I'll put that in there in just a moment, if you're looking for that. Sebastian, hang on just a second here. The manage iceberg tables, yes, you would not have to do maintenance. The manage iceberg pipelines, as they're called, manage iceberg pipelines comes from Kafka, comes from s three. Those are automatically being automatically maintained. So that's there. And then we have a separate thing, which is the managed iceberg pipelines uses. If you have your own iceberg table that you create yourself, not technically an external iceberg table, but an iceberg table, you absolutely could turn on our we can say this table or this schema or this whole catalog, whatever. You know, you target it in what you want. You target which options and frequency, but it's, it's primarily kind of a yeah. That yeah. Yes. So anything you let us do for you, then you can stop. And in fact, I would encourage you not to run maintenance on top of something that's all someone else is already running. So because there was that map map one, Optimization, we talked about that one. And if you see any of that I'm missing there, various, please jump in. I think, keep a look at over partitioning, that was my comment other day. I I 100% agree. That's something I've I've been doing this for twelve, thirteen, fourteen years since Hive and Hadoop. And everyone wants to use the cool toys, but the cool toys are often for the biggest of tables. Everyone that wants to do z order, but to Marius's point, the size of the data and the use case might not really fan out, but for a few people and a few scenarios. So we wanna use the tools, but sometimes something simple is always easier. Got that one. Got that one. There was a long query, I guess, you know, that's from Marius. Can you share the SQL that you used to get yeah. It's all in that chat, but if you come back and ping us DevRel at I can I I actually, you did. I see it. I see it. Thank. you. Thank yeah. Thank you. Thank you for that. There's metadata tables to tell you about the files, the files of, you know, that that metadata. You see it all. Okay. I think I'm a go back to the bottom. Think so, Jamie, you said that I don't think you need to compute stats with iceberg tables like you do with the HiveTable. Yes and no, Jamie. The reality is there's a lot going on here. The stats are still useful for the engines like Trito to make some earlier decisions too. Like, hey. Am I even gonna think about doing things like like, you know, join reorder, know, that kind of stuff. So they're still important, and then there's this fine line of the just automatically since arguably at right time, these engines know a lot of details about the metadata, but it depends. You can turn it just say, hey. Keep it current all by yourself, or you can find a kind of a sweet spot of, I'll I'll turn that off and then maybe periodically come back and update those stats. So some things, you're right. There's so much information in the metadata tables themselves, metadata files themselves that provide you that, but I would not say, originally, that was my thinking that we don't need stats at all. I think Trina and Starburst still want you to have some stats that are maintained at at a little higher level. So I encourage you to do that, but Marius is closer to it than I, he wants to say something. Alright. So I got z order and my v three post. I'm gonna put those in there right now. I'm gonna be quiet. I did take I did already share them. Oh, good. Thank you. Nice. Sure. Alright. Last plug for me. Oh, yeah. Good stuff. My last plug is, again, come back and, feel free to put us some stuff right there in our forum site. This is general questions. This is a scenario. This is a I got a user group. Anything and everything, we'd love to see you there. I'm trying to drive some traction there. The Trino Slack gets lots of traction. I I love our little forum here. Keeps track of things, all the gamification, all that silly fun stuff. So, you know, if you don't wanna do it, that's okay. But I encourage you, and I promise you that I stare at every one of those. So if something comes in, I'm gonna try my best to if I can't answer, I'm a find somebody. I'm a try my very best to get an answer. With that, we're gonna say, Maris, any final thoughts before we do a wave bye bye? Thank you very much for the engagement, and thank you for the opportunity, Esther. Thank you for the time, man. I appreciate it. And most probably the thanks go back to everyone that was here, and this was a great set of questions. We appreciate them so much more than you know. We're gonna go back and read them a few times and make sure we addressed everything. We're not missing something, work on things that we could do better to educate and teach folks. You got any questions at all? One last thing, debrel@starburst.io will find me and a few others. But there's just if you run out of places to ask a question, you got my permission to send an email right there. And, we thank y'all. So appreciate everybody. Y'all have a beautiful whatever part of the day it is, and, we'll see you very, very soon. Bye bye now. Thank you. Bye bye.