Building an HTAP Data Warehouse with Apache Arrow
Scott Dykstra, CTO and Co-Founder of Space and Time, interviews Principal Software Engineer Brent Gardner on becoming an Apache Arrow committer.
Brent is an experienced software engineer with extensive experience spanning from big data, to scientific modeling, streaming, and data analytics. Brent has been involved in building columnar databases since 2009, working on cryptocurrency code since 2017, and using Rust since pre-1.0. In addition to development, Brent is talented in public speaking, teaching, leadership and a variety of programming languages and domains. Before joining Space and Time, he used his guidance and technical vision to take multiple startups from zero to one. Brent has been an active contributor to Apache Arrow, Apache DataFusion, and Apache Ballista since joining Space and Time.
Scott: Welcome to Exploring Space and Time. I'm Scott Dykstra, CTO of Space and Time, and I'm here with my friend Brent Gardner, a lead engineer at Space and Time. Brent, do you want to introduce yourself?
Brent: Hi, I'm Brent. I've been working on the OLAP side of Space and Time. We've been working with the Apache Foundation on some cool columnar Arrow storage, and I'm excited to talk about that today.
Scott: Awesome. Well, one of the things we're most excited about is your contributions to the Apache Arrow codebase and becoming a contributor, a committer, to Apache Arrow. Tell us what that means. What does it mean to be contributing to Apache Arrow?
Brent: I'm really honored. I've got to start with that. It's been a heck of a journey. I've been wanting to contribute to open source for most of my career, but not many opportunities come up. So, I'm really grateful that Space and Time has sponsored me full-time to contribute to the Arrow Project, Arrow and DataFusion and Ballista, as well as some upcoming research in those areas as well.
Scott: Yeah, we are as well. It's exciting to have one of our lead engineers contributing and giving back to the open-source community. We believe strongly in Apache Arrow as the future of vectorized in-memory operations and also kind of an interface for lots of different database technologies to communicate with each other in a sort of succinct way. So, I’m really excited about this work. How did this start? Tell me a little background on DataFusion and how you came across that project and got involved with DataFusion, and what is DataFusion?
Brent: Let's start with what is DataFusion? The Arrow Project is a language-agnostic way to share memory between processes, and specifically in a columnar format. And so that's very attractive for all of the different databases that have needed to interoperate and communicate. To have a standard interchange format is like having 110 volts or a six-foot-wide railroad track; once you can lay that down, you can do all sorts of stuff on top of it. And so what that does have is some cool vectorized operations. There was a CMU reference to that recently saying that unless your database has vectorized operations, then you're going to spend millions of dollars recreating it. And so DataFusion was on the shortlist. Arrow has some of these vectorized operations where you can keep your CPU cache lines full and you can fully utilize these wide bit lanes in modern processors, but Arrow just has the ability to perform columnar operations. If you want to run SQL or anything higher level on top of that or do Python bindings with data frames and all the stuff that data scientists are used to doing, then that's kind of where the DataFusion project comes in.
And then finally, Apache Ballisa stacks on top of that to serialize execution plans and execute them in a parallel format across all kinds of nodes and really starts to be a competitor to Apache Spark.
Scott: Yep, that makes sense. So Arrow is the in-memory vectorized format. DataFusion is essentially a database engine in a way, a query engine. And Ballista is distributed processing in a cluster for DataFusion. How did you get involved with DataFusion in the first place?
Brent: It can't hurt to know the author. Myself and Andy Grove go way back to when we were working at a company that was building our own system that kind of rivaled Apache Spark. Spark took off, and Andy and I were both responsible for porting our engine onto Spark. And there are a lot of great things about Spark, but there are some downsides as well. And I think the limitations of—I mean, this is six years ago, maybe—the limitations of the JVM were already starting to crop on. We would run 20-hour-long jobs and then crash due to an off-heap OOM OutOfMemoryError, and then we would give more memory to the operating system and run it again, and then 20 hours later it would crash due to the JVM running out of memory. And so it became clear when we were spending all this time trying to get the Java Virtual Machine to have access to native memory that maybe there was a better way.
This is also about the time when Rust was starting to enter the scene as a programming language. So I think Andy, as Andy tends to do, spent a lot of his spare time on a personal hobby project creating what he knows best, a SQL parser and execution engine. Sort of taking all the good parts of Spark, a directed acyclic graph of operations, and slicing that up and distributing it, but doing that in a columnar way instead. So he built that execution engine on top of Apache Arrow, and it turns out that it's like peanut butter and jelly. They both worked really well with each other. And so that's sort of how we got to where we are today. And as far as how I ended up working on it, again, that's thanks to Space and Time. And I'm a sort of jack of all trades. I've been hopping around doing some embedded Android development, some AI object detection, and various things. But when the opportunity came up to be able to contribute back to sort of the cutting edge of distributed databases, I was excited and honored. So here I am.
Scott: And we're excited about the nine PRs to Arrow, your 29 PRs to DataFusion, and 21 to Ballista. With all these PRs and all this engagement up the stack from Arrow to DataFusion to Ballista, how do you engage with the different contributors globally? I assume a couple hundred people committing to Arrow on a regular basis, probably tens to DataFusion, and single digits to Ballista. What does that look like?
Brent: Yeah, I guess it's kind of hard to say on a daily basis, but yeah, there's 600 contributors to Arrow, there's 300 or 400 to DataFusion, and I don't know, probably a similar number to Ballista. So there are a lot of people working on this. They don't necessarily all contribute on a daily basis. I'm lucky because Space and Time's interest is sufficiently aligned with these projects that I'm able to contribute full-time. And I think as we build more and more on Space and Time on top of Arrow, then we're going to see more contributors coming over, getting contributors status. Because I know that we already have several developers that have been committing part-time as well. So as far as collaboration goes, Apache Software Foundation is a great out-of-the-box way to manage open-source projects, and the rules of governance have served them well over time. One of those rules is that everything must be archived and archivable. The way that works with GitHub is all of the GitHub messages go to their mailing list and that gets archived. So almost all of the communication is actually going through GitHub. It's pull requests, it's issues, it's discussion about code there. And so that's how I spend a lot of my day, to be honest, is reviewing pull requests, getting pull requests reviewed, filing them, filing issues, et cetera.
Scott: Awesome. And what's it like working with this global team? Are most of these contributors from large database companies? Who are these folks and where are they coming from?
Brent: It's hard to say, because there are so many of them, but some of the key contributors, I think, work at companies like Space and Time. As I was saying, if you're going to make a modern database, then you need a vectorized execution engine. And so the database problem is surprisingly not yet solved. It seems to be one of those things that's evergreen. Despite being an extremely mature technology that's been around for 40 or 50 years, they still keep coming up with new needs as far as big data and analytics and time series, and there's been no silver bullet so far. And so there are a bunch of companies like Space and Time building databases, and they need these core components. And so those are the people that I work with on a daily basis.
Scott: Gotcha. Awesome. So let's talk about how Space and Time can, will be, and is leveraging this technology. Arrow is really an in-memory, vectorized format, DataFusion is a framework for building a database, and Ballista is a distributed framework for distributing DataFusion. So as we build HTAP, how could some of these or all these technologies play into a modern HTAP data warehouse?
Brent: Wow, that's an interesting question. I think from the OLAP side, that's a fairly solved problem. Well, sort of Ballista answers that. If you just have a whole bunch of Parquet files, and they're sitting out on object store, and you need to query them and join them as quickly as possible, I'd certainly say Ballista is a contender; where it's not necessarily always the fastest, it is definitely the fastest in some types of queries. So that side of things is plumbed and getting better every day. As far as the OLTP side goes, that's a little bit more on the research side. Arrow sort of has a concept at its core about being immutable, and so trying to have DML statements modifying data is an interesting thing to try to add in the Arrow ecosystem. And I think that's certainly ruffled a few feathers. There are ways to accumulate data in Arrow formats while still honoring the memory-safety guarantees that the language of choice, and in particular Rust, expects. As long as you don't go changing historical data, you're probably pretty good. But that works well with the concept of multi-version concurrency control. So it’s an open area of research, and the ability to parse and plan inserts, updates, and deletes just went in a couple weeks ago.
Scott: And is this your first time becoming a named contributor to an Apache top-level project?
Brent: Yeah! Like I said, I don't think there are many companies that are as devoted to open source. I appreciate Space and Time's stance on this, but from what I understand, nobody in the crypto ecosystem really wants to trust closed-source code. So I think it ends up working out well for all parties involved. Space and Time is willing to contribute these innovations back to the broader community. And at previous jobs, that just hasn't been a thing that they were willing to do. So that's kind of hard, without it being a full-time job, to be a contributor.
Scott: And I think we'll continue to ramp up our contributions and our embracing of the open-source community as we go. The Web3 and the decentralized ecosystem that we're building in by nature is open source. And I think as we build our business, we build our technology, as we get through our roadmap, we'll open source more and more as we go. It'll be exciting to roll all that out. And I think we're going to embrace DataFusion and embrace Arrow more and more as we continue to build. It's exciting that you're already at the speartip of this, because it's only going to get more important as we focus on this kind of next generation of HTAP. One last question to wrap things up. Brent, it's been an absolute pleasure having you. I'm so curious: is this community active on Slack? You said it has to be archived, so everyone's emailing. What's the communication pattern look like? You've got folks everywhere from China to Europe to the US all contributing. What's the communication standard?
Brent: I don't mean to misguide anyone. There is a Slack, and I believe there's a Discord as well, and I'm definitely logged into that Slack all day long. And it's good for asking questions like "Hey, I want to do X, how do I go about doing that?" And we try to help out anyone who jumps in there and has questions to ask. I think what needs to be archived is decisions about which way the code is going and how architecture works. And so all of that is done through pull requests and discussions on pull requests, but there's absolutely a Slack and a Discord. And you can find all that through the /apache/arrow-rs website.
Scott: What's next? In the short term, what's next for the DataFusion community? What is the focus? What are the next set of PRs?
Brent: Wow. That's an interesting question, because there's always a lot going on, and different parties have a lot of different interests. I know from the Ballista side of things, there's a lot of work going on as far as trying to support streaming, not just batch, being able to allocate resources well. In DataFusion, as I said, there's some work being done to support writes as well as reads. There's always query-planner optimization going on. I'm probably not the best person to answer that question, because there's so much happening and I'm focused on the things that drive our database forward here at Space and Time.
Scott: Absolutely, and I'm glad you are. Well, it's thrilling and it's exciting to have a talented engineer like yourself contributing to this fast-growing ecosystem that is Apache Arrow, and DataFusion on top of Apache Arrow, and Ballista distributing DataFusion. We see extreme value in these technologies. We know this is the future. We're excited to be building on top of it, and we're really thrilled that a talented engineer at Space and Time is also contributing back. So, Brent, congratulations.
Brent: Thanks, Scott. I'm really excited that I get to do this, and it's been great talking to you today.