Iceberg V4: A Write Optimized Persistent B-Tree

This past week, I read the Iceberg community’s V4 spec design document. It looks to be one of the more important designs in data and analytics space right now. There was a talk on this at the recent Iceberg summit https://lnkd.in/gyYtP-D7

I took me some time to understand and see that underneath the Iceberg jargon, this problem can be tied back to a classical algorithms problem of building and efficiently maintaining a dynamic and persistent index over immutable files on disk (the files being Parquet). And the operations that this index needs to support can be boiled down to three simple operations:

  • UpdateTable(TableVersion, Delta) -> NewTableVersion. The Delta is files added/removed and tombstones over rows with a caveat: the index doesn’t need to track specific rows, the update is “add a file + mark an old row ID dead.” This is the DML path.

  • Scan(TableVersion, Predicate) -> Files. Given a version and a predicate, return the matching files so engine can prune aggressively and read only what it needs. This the standard query path.

  • GetTableVersion(Time) -> Version, query a past version, basically time travel.

Requirement: UpdateTable should cost the order of the delta, not the size of the table while still being able to track the interior metadata (file-level column min/max, counts etc) to make the pruning work. This is so small updates to very large tables are fast, and queries remain fast all the time.

Hey, wait a second (referring to my graduate school notes from Prof Michael Bender from 14 years ago at Stony Brook), IT’S a persistent B-tree!

..but hang-on, B-trees are not good at writes of large immutable leaf nodes (in this case parquet files)

but so what, Prof Bender has an answer for that too, It’s a write-optimized B-tree!

← all writing