Video: Unlocking Data Ingestion into Apache Iceberg | Duration: 3360s | Summary: Unlocking Data Ingestion into Apache Iceberg | Chapters: Starburst Platform Introduction (0s), Managed Ingestion Overview (504.508s), New Feature Announcements (947.998s), Managed Ingestion Walkthrough (1040.878s), Starburst Performance Advantages (1952.763s), Upcoming Events Overview (2132.538s), Streaming Data Handling (2359.843s), Data Transformation Pipeline (2953.353s), Addressing User Concerns (3139.678s), JSON Array Processing (3194.363s), Concluding Remarks (3301.948s)
Transcript for "Unlocking Data Ingestion into Apache Iceberg":
So, again, my name is Lester Martin. It's, it's about 12:01 here in the East Coast US, and I guess I'll go ahead and get us started here. Today's session, today's webinar is really about our platform called the Manage Iceberg ingestion platform that has a lot of great features. Really, what we're doing today is we've had one of these a couple months ago, Ahmed and myself. And this one's really kind of a follow-up. We'll we'll present a lot of the same information, but we're absolutely gonna show, and demo. We have Grant, you can see over there. Gonna do our demos today. We're gonna get a lot of bunch of new features. So I'm gonna let you know about them here in just a few moments, and then I will, excuse me, I will step in there. I clearly don't know how to share the slides, but, there we go. There's the folks that you're seeing on the screen, names, titles, that kind of good stuff. I would head for me. I put an email address devrel@starburst.io. That's a message that could find me. If you just don't know who you talk to, you can't find the right person, you know, maybe I don't have the answer, maybe I do. But, if I don't, I will do my best to help you find, navigate through Starburst and get somebody to help you about if it can't be me. And, again, Ahmed Niyaz gonna step in here just a minute and introduce himself. I'll just, go ahead and do my bit, and then we'll let Grant come and do the the the hard part, the good stuff, which is, showing us the the demonstration. So if you see a quick agenda, there we go. I'll do the Starburst introduction, do a little bit of a iceberg with Starburst, iceberg with Trino kind of stuff, and then we'll jump right into it. So just a few minutes of that, I promise. I'll keep it down to a low roar, get there quickly. And we'll let Ahmed Niyaz, the product manager here, really give you a good vibe, a good understanding of our offering here, which will include right in the middle. As you see, there are some some really interesting demos that said Grant will take care of. And we'll come back after that and, you know, tie it off with some performance numbers, some road map items. I'll probably just mention there are other events in the future, and then we're gonna open it up, for q and a at that point. Now there is no reason that you shouldn't and couldn't use our chat as we're talking. If I were to offer questions, I see folks, shouting out from where they're from. Thank you all for that. I'm from Atlanta, Georgia here in The US. And like I said, feel free to use that chat window for questions. We'll be tackling those as we can during the session, and then I'll try to kinda shepherd them all at the very end in case we might have missed something. So if you get to the end and you don't hear us asking about your earlier question, that was my fault. So, you know, chime in again, and we'll get there. Alright. So the obligatory, the obligatory Starburst information, all that good stuff. Well, I'm gonna throw on the screen a very big slide, and I'm gonna stay on it very shortly. And what I wanna really make the point is Starburst as a company, as a platform play is a lot of things. So we're gonna talk about a sub some subsets of that. We're not gonna go into all the really awesome things that we do with AI like everyone else does. We definitely have an AI place and story. We definitely have performance enhancements, improvements, accelerators. In fact, the features we're gonna talk about are, you know, unique to us. They're not Trino. And then there we go in that purple. You see the gold Trino, commander Bun Bun, if you're familiar with the mascot for Trino. That is kind of core where Starburst has come from. It's kind of core to our engine itself. That's our core query engine and the things that it can talk to you. So I just wanna make a moment while I have it on screen and say, we have a lot of things, but the things that we're gonna hone in is that kinda, tree know area that I've mentioned as a core compute engine, etcetera. And then we're gonna bolt on top of that with some ingestion. So the, you know, normal stuff that you think about separate storage and compute, lots of file formats, lots of table formats, and then that query engine Starburst on top of Treno, to access all that with the fact that there are other things than just data lakes. There are relational systems. There are applications out there. There are messaging platforms. So we like to, we feel strongly that if you avoid that information, you missed the mark. So we support connectivity to all those things, a single point of access. And then by not just direct one to one access, we allow federation across a variety of sources. You'll join your data lake, stuff in s three with your MongoDB tables with your Snowflake tables with Salesforce objects, blah blah blah. Good stuff. Good stuff. Alright. Last big slide on Starburst is that, today, at least, we offer we have our product line kinda split in two, but they're very much the same. Our Starburst, that's just run it where you need to run it or the way you wanna run it, and then an opinionated version of Starburst. And we'll probably do our demos, I think, out of Starburst Galaxy today. And if not, we'll we'll look Grant or someone, make a little subtle note to that. And for the most part, think of them as the same deployment choices definitely from a road map perspective. Some one day in the near future, we'll just say Starburst, and that may mean you choose to have a couple on prem, a couple in the cloud, a couple in that cloud, and you see them all from the one plane. We're not there today, but think of these solutions as the same. Install it yourself or let us, run it for you. Alright. Starburst and Iceberg. If you're familiar with Iceberg, I hope you are already. Modern table format, the one that's kinda taken industry by storm. Think of what, I guess, this slide is about trying to tell you what do we, and do do with or invested in, Iceberg. I would absolutely say that upper corner mentions, you know, Trino was the engine. I like to say that Trino was the engine that was one of the primary engines, if not the only or the primary engine over at Netflix when those folks, that back then were building, Iceberg, and we just have been on that journey from the get go. Absolutely great place to run it. As you work around from the lower left, we're gonna do our darndest to stay on top of where treat where Iceberg evolves to. So the most recent revision was version three, and we were, I think, arguably, the first vendor to come out with v three. So we often say one of the first, but I think we could say pretty safe to say it first, version three specs, that kind of good stuff. On the bottom left, we're gonna hear a little bit about this today in the context of our ingestion pipelines. But for your non ingestion pipeline, I use the word tables, absolutely, we're gonna support all those table maintenance features that you can run manually. You can run via your schedule or your workflow system, or you could use our automated tooling to do that. So you might hear a little bit more about that, but I wanna make sure you get even in outside of these live kind of tables you're gonna hear a lot about today. We do have a a a managed automated, iceberg table maintenance solution. Absolutely. Our big plug is performance, performance, performance. As you see in the top corner, iceberg is no different. And kinda as you work your way around, I'm gonna hit the very bottom and say it, absolutely, we're gonna do things with AI. What we can do with AI in iceberg is we can store, things like vector embeddings, if we choose to, as just another column on a on a on a iceberg table. So we call that lake Lakeside Lakeside AI, bringing all those vectorizations, and then you have the vectors, all those embeddings. Then you have all the same access, governance, security, all those things at play. And then last in the middle, when the right just says, hey. The reality is if you know about Iceberg, you know that part of this is not just the engines, not just the the locations, but this notion of a catalog, this metastore kind of solution. And, absolutely, we provide you tons of options in that space because there are different options out there. In fact, I think I'm ending my conversation, and I'll end it on that word options. Just think of us as the optionality engine, the optionality company. Anything and everything we do has that word in mind, and it's a lot bigger than flexibility. It's a lot bigger than configurability. It's saying you have a bunch of options, and we can make it work in your space. With that said, I think I might shut up and turn it over to Ahmed. Thanks, Ahmed. Thank you, Lester. That was awesome. Hi, everybody. Great to be with you all today. My name is Ahmed Niyaz. I am the product lead for Iceberg Breakhouse Features at Starburst. I am excited to share details today about our managed ingestion feature, which is specifically focused on Iceberg. So Lester has eloquently shared how Starburst is the platform of choice to query Iceberg tables, especially in hybrid and federated use cases. Right? However, we have been hearing from prospects and customers about their overall experience with Iceberg, and a significant pinpoint in the Iceberg adoption journey is data ingestion into Iceberg tables, otherwise known as Iceberg Lake hydration. And as we talk to customers about their specific pinpoints, here in this slide are some of the main themes that came up. Right? Number one was support for different types of ingestion based on the use case. The two main use cases that continuously come up are batch file ingestion for higher latency use cases and Kafka streaming ingestion for near real time or lower latency use cases. And prospects and customers were struggling to find a platform or a tool set that is focused specifically on Iceberg, but also allows you to build different types of ingestion pipelines based on your workload requirement. Right? The second pain point was around Iceberg table maintenance. And folks on this call who are Iceberg users know the rude awakening they got around poor table health and poor query performance related to Iceberg table maintenance, especially when it came to high throughput Kafka streaming use cases. Right? And this includes things like data compaction, snapshot expiration, often file removal, manifest file rewriting, also optimizing for delete files, like position deletes and equality deletes, etcetera, like the whole shebang. Right? And so our customers were sharing feedback that managing this operational burden of cable maintenance efficiently is really something that is a significant pain point because it's not letting them realize the full benefits of Iceberg. Right? The next pain point was related to all of the Iceberg configurations needed to make sure that the Iceberg table is kept optimized and the manual effort to manage these configs. So these are things like schema management of raw data into Iceberg data types, Iceberg partitioning specification, sorting specifications, standard data retention of tables, etcetera. Right? So that was a significant pinpoint as well. And finally, the higher level pinpoint that decision makers were looking at was the total cost of ownership of an end to end iceberg ingestion pipeline. And also baked into that cost was quality and reliability. So sometimes an entire engineering team is allocated just for managing pipelines when they can be focused on much higher business value initiatives. Right? The cost also includes tooling and platform costs, whether that's Kafka costs, pipeline costs, maintenance costs, etcetera. From a reliability standpoint, it's critical to have good observability to to get near real time metrics and also error handling of your pipelines to make sure your production workloads are running smoothly. So when it comes to reliability, users were also thinking about specific guarantees such as exactly once Kafka message processing. And depending on the tool that you use, it is not guaranteed to have exactly once processing. Many systems guarantee at least once processing, which would require you to do post ingestion reprocessing of removing duplicates. Right? Okay. More operational overhead and a lower quality pipeline. So all of these pain points resulted in friction in iceberg adoption within organizations of organizations that went with the do it yourself kind of approach, a pretty hefty operational burden. Right? And this image here is a good illustration of a common Kafka to iceberg ingestion pipeline. Typically, this was, involved a Kafka message being routed through Kafka Connect of LinkSQL into s three buckets as JSON files. And then these JSON files are written into Iceberg tables in Parquet format using an engine like Spark. Right? Again, this requires significant orchestration, and we haven't even included the table maintenance and the observability aspects in this sort of image. So at Starburst, we have been spending the past eighteen months looking at solving for these pain points. And in late twenty twenty five, we moved our managed ingest feature from experimentation to general availability. The diagram here is an illustration of the of a managed ingestion ingestion pipeline built via Starburst Galaxy data ingest tool, the experience here is pretty straightforward, where you connect to a source and configure a destination. We allow connection to various Kafka sources for near real time streaming and connection to an s three bucket for batch file injection. So these would be the sources. Right? So these raw data files are initially landed in a raw Iceberg table. Think about this as your bronze layer, so to speak, and then automatically transformed into schematized Iceberg tables, which we call transform tables. Think about this as your bron, silver or gold layer depending on your need for post processing. Right? So we will obviously go through a demo here shortly, but just wanted to give a mental model of how this works within our platform. So the core fundamentals here are focused on a specific set of benefits to solve for the pain points that we had talked about before. Right? This entire system is run on serverless elastic compute that Starburst scales up and down as needed. Users pay only for the compute they use, and there is zero infrastructure that that a customer has to manage. These are live pipelines requiring zero orchestration. You set this up initially. It runs automatically. The entire pipeline is built specifically for the Iceberg table format. So this includes all of the required Iceberg, capabilities to reduce the operational burden and the overhead of Iceberg adoption, things like table maintenance. And we actually have a serverless table maintenance implementation here, a proprietary implementation that we built to run on serverless compute, specifically focused on high throughput and high volume streaming ingestion use cases. So that will be for all Iceberg table maintenance operations, compaction, orphan file removal, snapshot expiration, manifest file rewriting, etcetera. Right? It also includes a user interface to manage the schema of your Iceberg tables. We are Starburst in first, the data type first automatically, and users have the ability to change them as needed. It also includes the ability to specify Iceberg partitioning and sorting for each table where there are defaults applied, and users can override as needed as well. And finally, under the Iceberg native umbrella, there is a built in yeah. There is a built in time travel feature via the UI where you can not only revert back to a previous version of your iceberg table, you can also re ingest the data from the raw layer or the bronze layer in case of data corruption or errors such as schema drift. Right? So, also, this entire system is fully API supported and is integrated into the OpenTelemetry standard of metrics where we currently support integration with AWS CloudWatch for real time observability. And last but not least, it comes with reliability guarantees like exactly once processing for Kafka streaming. And users can also take advantage of all all of these Starbase native governance features such as r r r r r r back, The adoption in 2025 was amazing. Thank you to all of the great feedback and the amazing adoption response for this feature from our customers. In 2025, we saw over 2,000,000,000,000 records records ingested for production workloads, resulting in over 10 petabytes of raw data ingested with an average compression of 90% for stored data. We are continuing to expand the different use cases that we can support with this feature, and we are really excited to announce today the release of three new features. CSV file ingestion, including other text delimited file types, AVROS support for streaming ingestion, and it comes in tandem with the release of schema registry integration as well. And without further ado, I will turn it over to Grant for a live demo of these features to see it in action. Hey, Grant. Hey there, Thank you. Yeah. Let me know if you all can see my screen. Not yet, but I think you're still working on it. Yeah. Let me try one thing. Alrighty. Looks good now? Yeah. I do. Yep. Perfect. We. see. Okay. Yeah. So hi, everyone. I'm Grant. I'm a engineer on the ingest team, and I'm gonna walk through today, how to use manage ingestion, for two different sources, so for both file ingest and streaming ingest. So let's get into it. So this is the data ingest page in in Starburst Galaxy. There are two main concepts that we wanna talk about, and those are sources and tables. So sources are where you can ingest data from. And today, that is as we talked about streaming systems like Kafka, so there's different flavors of that, Confluent or open source Kafka, MSK, and then object storage or file based systems like s three. And so so that's where you can ingest data from, and where you can ingest data to are iceberg tables. So Starburst is very pushing iceberg as as the kind of data lake table format to use. And so whenever you ingest data, they, through manage ingest, it will go into an iceberg table. And that's what we call them live tables because they're updated, you know, in semi real time. So let's walk through, kind of a file ingest use case first since I think everyone here probably uses object storage. And so you can configure a new source. You give it a name. You give it, you know, some bucket that you wanna configure from, and then maybe you only file that some prefix, and, and then you give it some credentials. So here, you know, AWS standard use access key or a cross count role. So I already have one preconfigured. This is how you would do it. You click test connection, and it would work. But I already have one preconfigured that we can use today. So let's go and once we have our source created, we can create a live table. And, great. So you you immediately see there's this concept of raw table versus transform table, and you might wonder what that is. This is kind of the pattern of I'm sure people have heard of, like, ETL or extract transform load. So we use a pattern called ELT, extract load and then transform. And so the benefit of that is you store all the raw data untransformed in one Iceberg table, and then you can create a transform table off of your raw table and do your transforms on it. So this kind of, like, two table model is beneficial because you can do things like reprocess your data. Your your, data is immutable over time. So, it gives you lots of benefits if you need to, make corrections to data or things like that. So let let's create the raw table, over this file source. So we could choose our subdirectory here, and so I know it's CSV slash Titanic. So what is this data? It is some Titanic CSV data that I I grabbed from Kaggle. So it's just some data about, you know, people who are on the Titanic and how old they were and what class they were, things like that. Great. So we so we wanna ingest this, data. We need to choose where we're going to ingest the data into. So imagine this is a source directory, and s a source bucket we wanna load data from where the target is a, a target we wanna write the data to. So those could be two completely separate s three buckets. We also need to choose a schema. This is like a a database schema that we wanna load data into and then give it a table name. So, you see here, we support JSON and now CSV. This is new, so I know this is a CSV file since I just showed it to you. There's one header line. We can configure oops. Miss clicked. And then you can configure, delimiter characters, quote characters. Very common to use CSV with a a quote character, so that's I'm gonna keep that default. And I guess the the other interesting bit here is you see this, polling frequency. So FileIngest is, not as real time as StreamingIngest. So the the lowest, polling interval we we support currently is thirty minutes. You if know, you wanna go lower than thirty minutes and you wanna go to a couple minutes, we'd suggest that you use streaming ingest. So we can now test this connection. Okay. It says it was successful, so we can save the raw table. And you notice we didn't actually have to do any schematization of the data, and that's because it was storing that, exact raw copy of the data in our in our data lake. So let's do titanic transform now. And, now we're gonna do validate again. This is creating a second transform table. This this one's actually schematizing the data. So it's gonna break out all, all the the CSV data into separate columns into a normal Iceberg tabular, table that you would expect. You know, some, for CSV, we, only support VARCHAR types or string. If you were to use JSON, JSON has more rich type support. So that's, based on the file format, you can have a different type support. So in CSV, really, the only thing you can configure is, like, the name. You you see if I just wanted to change this, passenger ID column to just passenger. That's a table sample. That's what it would look like. Or if I didn't want one of these columns, for example, I don't care about the passenger's class on the Titanic, I could delete it. So let's do let's create our transform table now. And so now we should be able to see, our raw and transform table, showing up in in this view. And instead of waiting the potential thirty minutes to ingest all the data, I I preloaded some so we can make this demo quick, and I I will show you what that looks like. So let's first query our our raw table. So what you'll notice about the raw table is, there is a row per line in the file. So the source file row position, you know, that that that's the row position. And then the raw file contents here are this is the the the string where where you can see the the actual, fields in the in the CSV file are not are not parsed. It's just the raw string. And then there's also some metadata, about the the the source file path and things like that. Now let's, look at the transform table. This is the one you really care about most. It is the, transformed tabular data that's more efficient to query. And you you see now each line is converted into an iceberg record with with properly parsed fields. And, there's some system metadata as well if you need to correlate it back to the raw table. So, yeah, that's how you'd ingest, kind of CSV data. Or, we also support JSON, but I didn't show that today. And you need to do this kind of manual schema mapping, from your CSV data to, to the iceberg table. Great. So let's go over streaming ingest now. So we talked about streaming ingest. A use case would be you want lower latency data. And let's, so let's, create a Kafka source just like we created a a file source. You can chew you have to put your broker information in. Oops. I put a space in. I'm not supposed to do that. So your broker information will look something like this. It's a broker with some port. Again, depending on if you're using Confluent or Redpanda or one of the other vendors we support, it might look a little different. And then, the different off types. So we support cross account rules for MSK and then also some, different, SASL off types for open source Kafka. Again, I already have one preconfigured, so let's go with that. This is configured to our Confluent demo cluster we have and kind of a similar pattern here, as file ingest. Right? There's still that raw table and that transform table. And so let me show you what the, the data in this topic we want to, ingest from looks like. So we are choosing a topic to ingest from Kafka. This one is sample data orders Avro. So So I will go to, let me close out of those. I don't need them anymore. This is what the sample data orders Avro topic looks like. It has about a million records in it. They're in the Avro format. And each record it's just some fake data that's generated with some, you know, order ID and time and some information about the order. What what's interesting about Avro, unlike kind of CSV or raw JSON, is Avro has a thing called a schema registry, and schema registry defines a schema for your data when you publish the data. So the benefit of that is instead of where JSON where, you know, maybe a producer publishes a JSON record oops. It looks like it refreshed the page. Where maybe the the the JSON record's published and a key is published as a string instead of an int. And then when your downstream pipeline ex expects a different data type, now it fails, and you need to do this kind of reprocessing. The benefit of the schema registry is you define the schema once upfront. When you publish your data, you cannot publish invalid data because the schema registry protects you from publishing invalid data. And you'll you'll see in a little bit how we use that schema registry to, not only prevent you from invalid data but also, auto evolve your schema for you over time. So now let's go and ingest this data. Again, we choose catalog, to ingest into. And just like file ingest, we have to give it a name. There's some there's some configuration. I'm gonna skip over a lot of this. We we just wanna choose the defaults. And then, starting from earliest messages is probably what we care about, so we will ingest all the million records. Much like before, we save our raw table, and then we go in and create the transform table. And, you notice here when we choose Avro, we have to connect the schema registry, and that's because we are going to automatically determine the schema for you from the schema registry definition. So you don't need to provide anything. That is, kind of a huge benefit of using the schema registry. And as your schemer evolves over time in the scheme registry, your iceberg table will auto evolve over time. And because scheme registry is guaranteed to be correct, you don't need to worry about, any issues. Great. So choose our scheme registry. Again, you you could go and, fill this out. You just need to provide a schema registry URL and access key and secret key. Again, we support RedPanda and and Confluent, today. Grant, quick question. Is the scheme of registry to help us understand what the raw data looks like, or is it the and or is it to help us map into the transform table itself? Yeah. Good question. So a schema registry is it's not a a concept that we in Starburst built. It's an open source concept, and it's used for, when you're publishing data to Kafka to ensure compatibility of data over time. So the reasons why you'd wanna do that in general with Kafka are to prevent issues, like I mentioned before, where different records have different schemas, and then you go to process them and and it's you run into these incompatibility issues. So to answer your question now, where it's used, it's used during the transform process, and it's used during transforms because the raw table only stores bytes. And. let me see if I can actually show you that. This might actually clear it up a little bit. So, for the the raw Avro table, there's some metadata about, okay, the topic and the partition, the offset, things like that. Right. But the the key bit is the raw value, and this is bytes. You know, we're, displaying it in the in the UI, so it's a base 64 encoded representation of bytes. But, this payload is a binary payload, and, it it contains a reference internally to the the schema in the scheme registry. Great. So when we go to then process these values in the trans in the raw table and convert it to the transform table, we we go to your scheme registry, look up the schema, and then use that schema to deserialize the message. Does does that help explain how it's used in the kind of in the transform process? It does. Thank you for that. I was actually trying to help set up another question in the in the chat, but your your your definite your your response plus Amit's comments, I think, got the question answered. So I appreciate you, share it for us. To echo it back, what I think I heard was, absolutely, it's truly about what is in the the inbound messages, either in Kafka or, you know, in Kafka in this situation. And what we're really saying is the raw table is a very structured table for us, where it came from, what topic, what offset, you know, what partition would offset, all that kind of stuff, and then the bytes. And then the transform table in this case, if we have a good schema registry, it's saying, hey. Let's trans let's make sure it fits that. And if it doesn't, you know, you you guys have already kinda tapped around errors and all that other kind of stuff as well. So awesome. Awesome. Thank you. Great. I will quickly run through this so we can get to the, the transform bit since that's interesting. So we've created the raw table already. We are now creating the transform. You see, when we were configuring CSV ingest, you you you had control over all these things. You could choose the data type. You could choose the column name. Here, we just provide a read only view saying, this is what we're gonna create, but but, again, what controls the schema is your schema registry. This this schema here gets mapped directly to the Iceberg schema. So we provide this read only view, but you, you just create the transform table. And then, great. So now let's look at the transform table. I mentioned the the raw table is the the binary value of the the the record payload, and then the transform table takes out, parses those bytes, uses the schema to break them apart into all the separate columns, and now you have a normal Iceberg table, with separate columns, and it supports, you know, nested fields like this nested struct of city and state. Excuse me. And and, yeah, so now you can run all sorts sorts of queries over this data. And because it is, you're using StreamingIngest, it's available in about three minutes instead of, thirty minutes. So I will stop there. I'm a I'm a little bit over time. But feel free to ask, questions in chat, and I will, send it back to, Ahmed. Alright. Thank you, Grant. That was amazing. I already see some questions coming in, but folks on the chat, please continue to post your questions and thoughts. We will be responding to these chats, but also expanding on these, in the q and a session here momentarily. Okay. So you all saw how easy it was to configure and manage your ingestion pipeline, pipelines from Grant Nicholas demo. But the lingering question is, why should I use Starburst instead of other solutions out there? Right? We outlined the benefits, how it's, like, specific to Iceberg, all of the, you know, serverless aspects of it, reliability, observability, all of that. But it's also good to compare ourselves head to head against other services in the market. Right? So we, in the recent months, we engaged with a third party benchmarking service, ConcurrencyLabs. Shout out to Ernesto at ConcurrencyLabs for his thorough process and documentation, to compare ourselves against Confluent table flow and AWS DataFy host for a core use case. So more details about the use case, can we we can share that, after this, after this webinar as part of the post follow ups. But the results were pretty amazing. Seven times faster ingestion rate, 72% better average compression, and most importantly, the total cost of the ingestion workload was about 80% cheaper than competitors. So, yeah, very, very good validation about this about this platform. We are really excited about the core fundamentals that we put together when it comes to cost and performance and getting that balance right. But now it's about expanding out to a lot more different use cases, And that's why we just released ABRA support and CSV support today, and more to come here shortly. Alright. So drum roll. We have been hearing from our Starburst Enterprise customers, and the feedback has been overwhelming over the past six months. Mostly around asking us when they can test this out in Starburst Enterprise, in self hosted, and on premise, platforms, deployment models. Right? So to that end, we have made the commitment to move all of these components to Starburst Enterprise in the new multi cluster paradise paradigm. So the first release of this will be the automated table maintenance feature that will run on a dedicated cluster. This will also come with a lake ops feature that provides metrics and metadata about your iceberg tables, and we have the initial release for this plan for the August LTS. And we have high confidence in delivering one of either file ingestion or streaming ingestion for the November LTS with the remaining one to follow shortly. So look forward to that. Please share your feedback, your use cases with us so that we can cater the road map to to make sure that we can capture the excitement on this feature and make sure that you guys get access as a preview user. With with that, thank you for your attendance today and listening in, and I will give this back to Lester. Hey. Awesome. I will say, absolutely, you wanna work through your account teams and get all that stuff for you, Starburst Enterprise customers. But, I know, this is always a topic. I mean, you know, with saying it, but I'll I'll just echo. It's been a a question and comment concern and idea of request. So if that's of interest to you, just for fun, feel free to shout out. Oh, I've been waiting for that, you know, in the comments, but, we we know we know you have for those that are existing customers. If we move along, you know, the good news is we're rapidly going to the rapidly getting to the q and a, but I wanna mention a couple things before I get there. And we've been addressing most of the questions so far. Iceberg summit is coming up in a couple days here, a couple weeks, I guess. So, yeah, a couple more weeks, out there in San Francisco. I hope you're going. If you're going, please, please, or anyone on your team or colleagues or people just know in industry, make sure they swing by the Starburst booth. Give us a holler, see what's going on. We'd love to share all kinds of good stuff, do demos, do whatever we can to get you excited and remind you, of our commitment to iceberg. In fact, iceberg and what is what brought me to Starburst level four years ago, and I've been focused on ever since. So good stuff. I'm looking forward to that as well. And I mentioned there are other events out there. I didn't give you a good answer. I didn't I didn't wanna overwhelm you with QR codes here, but if you just type Starburst events on your favorite search engine or or Google or whichever or or your AI tool of choice, you'll find this list. And I just put a couple up on there on screen. This week, we're doing a migrating from Apache, Hive to, did I write there? Two Iceberg from Hive. Yes. From Hive to Iceberg. Those will be a hands on. We do it's a a workshop series I do, and, we give you all the instructions, make sure you have an environment and that kind of good stuff. And then you see some other cool stuff. Product management's gonna talk about what they're doing. We could definitely have, some more things in our AI space that interest a lot of folks. We'd love to share and see and touch, including our own agent, including the MCP server, all that kind of good fun stuff. And lastly, before I give you the q and a slide, absolutely come to starburst.io at any time. I encourage everyone, even if you're if you're a Starburst enterprise customer and you're wanna have your own Starburst Galaxy setup, I have mine, Lester dot galaxy dot starburst dot I o. Come get yours. And understand that free trial isn't a trial that turns Galaxy off. It just means that the credits, the $500 of free credits will run out for too long, so use them up. But the environment will be there. You can have, as it says, three forever free clusters. I use it all the time, and it costs no money at all other than your storage system and that kind of good stuff. So please, please, please. With that said, absolutely. We're gonna take a hard look at the q and a. I I was gonna shepherd it, but I saw at the moment at the end here, a whole bunch of questions rolled in. So I'm gonna pause in case any of my colleagues want to shout out. I'm gonna catch myself up here, and then we'll work through these questions, and then we'll wind her down. I guess I'll mention one thing. I did see a lot of questions around the Google ecosystem. So I'm in Greg can jump in if they want, but I will say that primarily, we were pigeon we're not pigeonhole. We're limiting the features we go after initially with all the great CPs like instances of GCS or Azure, but object stores as a concept, pretty straightforward, see those as a future thing. Same thing with schema registries. You know, we're tied targeting one today, but as Grant said, these are things that we expect multiple people have schema registry. So over time, we could have access to more than more than the one that we're doing today. Alright. I'm catching. up. Cool. So, yeah, I have a couple of things, that I got from the chat here. I think you you mentioned the optionality. Obviously, we will be planning to support across the board, different, destinations, objects to destinations, and objects to sources, and also streaming ingest sources as well. So that is coming down the line. And if anyone has, like, specifics they want to share, we can definitely work with you to prioritize those, as we evolve our road map. One question came in from Charlton. Hey, Charlton. Great to see you. In JSON, how is schema drift handled? For example, if we are all set up with raw to transform to a materialized view, for post processing, I'm assuming that the raw is ingested as normal. Does transform build a new column and guess, or does it fail dead letter or something else? So, Grant, do you wanna take this so I can take a stab as well? Yeah. Yeah. I can take it. So for your example of, there is a field you defined in your schema, but it's not present in that JSON object for some reason, then the value will default to null. And we also support, dead lettering. So I didn't get that, today, but there is a dead letter table or an errors table associated with with each table. And so if it, if there are invalid records associated, then we'll also publish our record to the dead letter table. So feel free to to try it out. Like, Ahmed said, there are some credits you can use for to to try it out, and it's, just difficult to give that demo in ten minutes. Cool. Thank you, Grant. One other question from Ivana here. Grant mentioned that streaming ingest data would be available after three minute three minutes. Does that mean that the checkpoint is set to three minutes? Because I'll take the latency portion of this here first, and then, maybe you can chime in with the checkpointing here, Grant. But from a latency perspective excuse me. From a latency perspective, we have we are trying to figure out the right balance between ingesting to iceberg at lower latency, but also making sure that the data is in a compacted manner so that the immediate data that you ingest, if it's, very high throughput, where it's, one gigabyte per second, for example, is queryable in our, you know, performance manner. Right? So we made the decision to sort of take a sixty to ninety min a ninety second, sort of, application based on, you know, the throughput of data coming in. We batch them a little together, micro batching, and then push that into Iceberg. So our current averages go from ninety seconds all the way up to three minutes depending on, you know, how long it takes to publish the data and transform the data, etcetera. We are currently, in the near future, planning to release a low latency mode offering where it will be within sub, you know, sixty seconds. We are planning a thirty second, latency offering. Essentially, as the data comes in, we would publish to Iceberg immediately. This does mean that you would, you know, pay higher compaction costs and take an initial performance hit on your query. But, again, depends on your use case. Right? If that's what you really need from a near real time use case, let's we have no problem offering that. It's just trying to get the balance of performance versus, latency. So today, ninety seconds to three minutes is typically our average. Coming soon, you have the option of, selecting low latency mode, which will allow you around thirty second latency, hopefully. And, Grant, did you have anything specific on the checkpointing question here? I think you answered it, which is we chose three minutes, but there's some trade off between latency and performance. And we thought that was a reasonable first pass, but there are ways to get that lower for those extremely late latency sensitive customers. Yep. So it's a question for Vijay. I started to answer it out loud instead of typing it. He was asking, in case you're not reading the chat, anyone, about talking to other metastores. And the short answer is yes because, ultimately, in in this process, we're saying where do we wanna land this kind of stuff and, you know, whatever that connection or we use the word catalog often, but I'm gonna use the word connection to whatever we're storing that. It will have a configured, you know, Iceberg catalog, and that catalog could be Glue. That catalog could be Unity. Now with Unity, I'd have to go back and make a 100% sure. I know there were some there was a period that we could really primarily read, but I think we're past that with Databricks. I think we're reading and writing and other things. So so the short answer is should be, Vijay. But in general, the answer is yes. We want to support any and all catalogs we can get our hands on, and it's not dependent. The the ingestion pipeline tool isn't coupled to only works with a, b, or c. So if you have a if we have a connection that set up a catalog set up that goes to Iceberg and you ring right to it, then there we go. Let's see if. it's for that one. Yeah. Just to add to that, like, when you're creating an s three catalog in Starburst, you do have to select the and and you select Iceberg as the ice table format of choice or even Delta Lake or anything else. You do have to, select the Metasto. We currently support Unity, Glue, Polaris, our own Starburst Galaxy Metasto, and and one other, if I if I remember correctly. So all of these options are open to you. We want to be the, you know, open platform, across all of these, all of these, functionalities. So yeah. A definite yes there. I don't know if you saw. There's a great question from Charlton in there about schema drift and all you guys are good example. I'll let the experts here chime in. But generally speaking, absolutely, the schema is gonna evolve. We're gonna be able to evolve with that. And what Ahmed earlier was kinda mentioning is you could even go back and say, you know what? Now that the schema changed, seems like things went a little awry or something, you can go back and replay from a you know, you can say roll back to a point and then replay not from Kafka direct, but cough the fact that we made a copy of that binary data and put in that in that raw table. So I'll let if the experts have something more sophisticated than that to answer, I'll let them, and I'm gonna go back. There's a couple of the questions I would think of while they're while we're thinking. Yeah. I think we answered the schema drift question. Okay. It it has. the. it been. There's had this. options. Yeah. There's a dead letter queue, and then there's also pause and notify. so that folks can come and replay the data after resetting the iceberg table. So anyone touch that credential already? Credentials from, you know, work and do again, I don't know if I've had an answer for that. Credentials can be retrieved from Hvald, and I guess I don't know the answer to that one. Sure. Grant, do you know about that? I'm looking to find the question. One second. Sure. It's essentially asking, can credentials be retrieved from EdgeWalt? Yeah. I'm I'm unsure what h Vault is, but, I can say that in, Starburst Galaxy, we store credentials for you. So we have our own secret store. It's encrypted properly. So you input, credentials either through the API or the Starburst Galaxy UI. And then, when when you saw that I was referencing, like, a source or or or a live table, you didn't actually see the credentials in UI, but there were credentials behind those things. So so that's how credentials are managed. Hey, Akhil. Right. Akhil had questions, and and I think it's somewhat more generic to Iceberg as a whole, but may maybe I misunderstood. So two part question. He said, if I deleted Iceberg table, is it gonna delete the tape the data as well? Iceberg tables are managed tables in general, and I'm not gonna talk about the the raw and the live we saw there. If you wanna change those and get rid of those, let's go back to the tooling. Let's wind down the processes that are creating them. Let's not try to go delete anything. I'm not even sure, if we have a safeguard that prevents something like that from happening and delete. But, but conceptually, yes, iceberg tables when they're deleted, wind themselves down. But, again, if you use it, these are those raw those transform tables, again, I would recommend strongly I beg you to go back to the tooling, stop stop the activities, decide, you know, is it time to throw those away, that kind of stuff. And before my colleagues jump in, I'll say, the next question was still kind of generic. How to convert an already existing Parquet file to an Iceberg format? Now this tooling isn't tackling that for us today here. So now, theoretically, over time, instead of just ingesting CSV and some other things off s three, if someone's publishing raw parquet files could be theoretically slightly enhanced this to pick up another file format. Sure. We're not doing that per se. We have other tools that can do that. Schema our score schema oh, goodness. What's it called? Yeah. Guys, remember I can yeah. I can share. an update here as well? So Parquet file ingestion is actually, heavily requested feature. It is part of our road map, coming soon. Please reach out to us and, sort of we can work through your specific use case, and make sure that's reflected, in in the evolution of the road map. So, yes, in the near future, you will be able to ingest Parquet as it lands into an Iceberg table using this same feature with all of the, you know, schema registry, partitioning, sorting applications applied there, and table maintenance as well. So yep. I'm gonna make a just a general comment just just so I did see two questions earlier there. While we call you know, the we have a raw that gets ingested, dropped, and we have this natural transform table. This doesn't eliminate the possibility or need if you have a really rich set of transformations. I mean, things that are beyond this. You heard earlier with, edit error checking, conversion of things, enrichment from three other tables. That's still part of your transformation pipeline, and I would see our our transform table as the input for something else. So don't hesitate to continue to do those activities. And I don't believe, I wouldn't look for us to try to enhance that transform table creation process to be a good old fashioned, you know, two massive graphical tooling and handled custom rules and all that stuff. We're not trying to rebuild the, you know, an informatic of the nineteen nineties here per se. We would still people have that problem today, and they solve it in the tooling and choice. And, again, the tooling could be as simple as SQL. The tooling could be use, know, DBT or other things like that. So don't totally misunderstand. I think a few people kinda took it like, could we do this, this, and this? We could, but I don't think we really want to tackle that in this tool. It'd be an open ended forever tool. Yeah. Good great color, Lester. I think also touches on Charlton's question here. Can we delete from the raw table if the file has ceased to exist and we want it to stop showing up in the transform table? If not, what's the process for that? So very timely question. This has come up multiple times both on just bad data that has come through from Kafka, for example, where it should not have been, ingested, and also just like GDPR, CCPA, and PII data use cases, where the data has to get deleted. So we are currently working on, allowing support for deleting using the standard Trino dialect. So you can just go to your query editor and then write a Trino delete command to on the live table. It is not yet supported. We plan to deliver that in the next couple of months here. So, Chardan, let's talk through the timeline. And if Roy is the first thing that you need, we'll we'll figure out, you know, how to make that happen. But on top of that, generally, for more advanced aggregations, you know, specific rules and stuff like that, we have the materialized view, and view, process within a Starburst Galaxy that you can use for post processing. You can use any tool that you want on top of the live tables, to, post process your data as well. We will continue to slowly add, critical transformations like filtering, things like fanning out the data into different live tables, etcetera, as we think about this as, like, a end to end pipeline that requires very fast latency. But, yeah, please share your feedback, and we will continue to add those, you know, in the near future here. I am looking back. I don't know if you guys address David, David's question. Let me know. It seems pretty generic. Seems to be isolated and a bug. He's like, David, if if there's more to than this, say the word, it sounds like you're saying, yeah, I'm using iceberg. I guess three and some points in time, you know, the metadata seems to disappear. That's not a behavior I see any you know, that's not a behavior I've seen ever. Not saying you're not having an issue or a bug by any means. Feel free if other folks have some ideas, chime in. But it didn't I don't think it was necessary directed toward these live tables, that kind of stuff. It was just a conceptual problem. So, David, chime in if there's more to than that. That does not seem normal. That does seem bad. If I'm not sure what the answer is when, you know, everything should be there and it just disappears. We need to, you know, help you figure that out outside of outside of this webinar, I think, the best answer. Yeah. One other question just came in from Andrea. If JSON messages contain arrays of data in fields, is the new ingestion functionality able to flatten those arrays as well? So we support nested JSON, in general. So test out the functionality via our our product, in in the sense that with one click, you can unnest the data that is nested, and it would all already have the inferred data types. Some JSON types like arrays are still yet to be supported, and I've seen in some cases, it is not supported. So depending on your specific use case, it might be. So look through that. Give us the feedback if it's not, you know, supported, and and we can sort of incrementally improve that, in the near future here. Yeah. I can provide a little bit more context on this. So, yes, we just to be clear, yes, we support arrays in JSON. What you're asking is can you, unnest and and pull out just, like, a couple values from an in individual array into a single field? And, the answer is, you can. There's some different methods to do it. It's kinda complicated, so I'm not gonna walk through it. But, try it out yourself. Yeah. Is the answer roughly, though? I remember prior to the schema registry that in that transform definition, now that we have editability, that was our place that we could start cherry picking and and pull things out and that kind of stuff. Is that part of the intro they're granted? Not not not to drag you into it. Because if we. use this registry didn't it wants to be nested, it's gonna be nested. Right? yeah. I I didn't, good good call out. So I didn't mention scheme registry because, scheme registry, you have, back tables. You have no control over. the schema. It is exactly what you said. We wanna adhere. Yeah. So this is only talking about JSON. Yeah. Awesome. Awesome. Well, folks, I am definitely not trying to discourage any last questions, any last alibis. I'm sneaking in. This is me rambling in case they do. But while I'm rambling, let me make sure I I think my cohost here that did all the hard work, Grant, and and Ahmed who really live and breathe this product set every day, all day. Reinfer that what I said earlier, devrel@starburst.io. If you just don't know who to go to, send me an email through that, and I hope you find folks, help you get your answers, that kind of good stuff. And I think with that, I didn't see other alibis. I'll just salute you and tell again, thank my friends, for being here with me, and we're looking forward to doing it again. Thank you, everybody. Talk to you guys soon. Take care. Bye bye now.