Persistent Data Structures - An Introduction

Feb 07, 2021

Before we start our essay on Persistent Data Structures, I just want to take this opportunity to share with you all something super interesting I have been working on.

I am starting an 8-week long Cohort-based Course on System Design that aims to cover a wide range of topics and help you ace designing systems that scale.

This is a prime cohort-based course aimed at providing an exclusive and crisp learning experience. The program will cover most of the topics under System Design and Software Architecture including but not limited to - Architecting Social Networks, Building Storage Engines and, Designing High Throughput Systems.

The program will have a blend of Live Classes happening on Weekends, 1:1 Mentorship sessions happening on weekdays, assignments and, group projects. The program is designed to be intense, crisp and interactive so as to accelerate the overall learning.

The first cohort commences on the 6th of March, 2021. So if you are interested in attending, please consider enrolling. You can find more details about the program, its structure, syllabus, and benefits here 👇

About System Design Masterclass

Thank you for supporting me and now we get back to our usual, an exhaustive introduction to Persistent Data Structures.

Ordinary data structures are ephemeral implying that any update made to it destroys the old version and all we are left with is the updated latest one. Persistent Data Structures change this notion and allow us to hold multiple versions of a data structure at any given instant of time. This enables us to go back in "time" and access any version that we want.

In this essay, we take a detailed look into the world of Persistent Data Structures and see the basics of its implementation along with where exactly we can find them in action. This essay is meant to act as an exhaustive introduction to the topic and in future essays, we will dive deeper into the specifics of each data structure.

This essay is loosely based on the iconic paper published in 1986 titled Making Data Structures Persistent by James R. Driscoll, Neil Sarnak, Daniel D. Sleator, and Robert E. Tarjan.

Persistence

Persistent Data Structures preserve previous versions of themselves allowing us to revisit and audit any historical version. Depending on the operations allowed on the previous versions, persistence is classified into three categories

Partially Persistent

Partially Persistent Data Structures allows access to all the historical versions but allows modification to only the newest one. This typically makes historical versions of the data structure immutable (read-only).

https://user-images.githubusercontent.com/4745789/107117960-e4a66480-68a3-11eb-97a0-c8c412527471.png

Fully Persistent

Fully Persistent Data Structures allows access and modification to all the historical versions. It does not restrict any modifications whatsoever. This means we can typically revisit any historical version and modify it and thus fork out a new branch.

https://user-images.githubusercontent.com/4745789/107117958-e112dd80-68a3-11eb-971b-58034e693f44.png

Confluently Persistent

Confluently Persistent Data Structures allow modifications to historical versions while also allowing them to be merged with existing ones to create a new version from the previous two.

https://user-images.githubusercontent.com/4745789/107117954-da846600-68a3-11eb-9c34-b9489c170710.png

Applications of Persistent Data Structures

Persistent Data Structures find their applications spanning the entire spectrum of Computer Science, including but not limited to - Functional Programming Languages, Computational Geometry, Text Editors, and many more.

Functional Programming Languages

Functional Programming Languages are ideal candidates for incorporating Persistent Data Structures as they forbid, while some discourage, the mutability of underlying structures. These languages pass around states within functions and expect that they do not update the existing one but return a new state. Programming languages like Haskell, Clojure, Elm, Javascript, Scala have native Persistent implementations of data structures like Lists, Maps, Sets, and Trees.

Computational Geometry

One of the fundamental problems in Computational Geometry is the Point Location Problem which deals with identifying the region where the query point lies. A simpler version of the problem statement is to find if a point lies within or outside a given polygon. A popular solution to determine a solution to the Point Location problem statement uses Persistent Red-Black Trees.

Text and File Editing

The most common operation required by any Text or File editing tool is Undo and Redo and having persisted all historical versions through a persistent data structure makes these most frequent operations very efficient and a breeze.

Implementing Partial Persistence

There are a few generic techniques that help in implementing Partial Persistence. Just to reiterate, partial persistence allows access to all the historical versions but allows modification to only the newest one, to create a newer updated copy of data.

Copy on Write Semantics

A naive way to implement Partial Persistence is by utilizing the Copy-on-Write semantics and naively creating a deep copy upon every update. This technique is inefficient because upon every writes the entire data structure is deep copied.

There are certain methods built upon certain storage paradigms that copy only what matters making the entire CoW efficient. I have already gone in-depth of Copy-on-Write semantics and I encourage you to check that out.

The Fat Node Method

Instead of creating copies of the entire data structure, the Fat Node method suggests that each cell holding the value within the data structure is modified to hold multiple values (one for each version) making it arbitrarily fat. Each value node thus holds a value and a version stamp.

The Node-Copying Method

The Node-Copying method eliminates all the problems with the Fat Node method. It allows each node to hold only a fixed number of values in it. Upon exhaustion of space, a new copy of Node is created and it holds only the newest values in it. The old node also holds pointers to the newly created node allowing browsing.

Having this structure helps in making update operation slightly efficient by reducing the number of nodes to be copied during writes, considering the in-degrees to each node is bounded.

Path-Copying Method

The path-copying method copies all the nodes coming in the path from the root to the node being modified. This way it tries to minimize the copy and promotes reusing some of the unmodified data. This method comes in super handy in Linked Data Structures like Lists and Trees.

https://user-images.githubusercontent.com/4745789/107144006-33b0d000-695e-11eb-9e13-959eaeba44f4.png

Implementation of Full Persistence is a mammoth of a topic on its own and deserves its own essay. Hence skipping it from the scope of this one.

Reducing the memory footprint

Since persistent data structures thrive on high memory usage, they require some garbage collection system to prevent memory leaks. Algorithms like Reference Counting or Mark and Sweep serves the purpose pretty well.

Thus when a historical version is not referenced anymore in the program space, the corresponding objects and nodes are freed up.

References

Other essays you might like

If you liked what you read, consider subscribing to my weekly newsletter at arpitbhayani.me/newsletter where, once a week, I write an essay about programming languages internals, or a deep dive on some super-clever algorithm, or just a few tips on building highly scalable distributed systems.

You can always find me browsing through Twitter @arpit_bhayani.

If my work adds value, consider supporting me

Arpit’s Newsletter

Discussion about this post

Ready for more?