What you need to know about data structures to be a better software engineer
From blood oxygen level sensors on the latest Apple 6 watch series to the latest video upload on social media to user activity tracking in a web application, data is created at a neck-breaking pace in our world today .
This massive amount of data generated has to flow through software systems from point of creation to the point where they are stored and on to the point of consumption, as efficiently as possible. I often think of software systems as a conduit for passage of data.
It therefore becomes critical that software applications and systems employ very optimal ways of representing data that passes through its system. Little wonder most technical interviews at companies that process massive data test the candidate’s knowledge of data structures.
I held over 58 free mock technical interviews in 2020 and during these interviews, one question I got asked a lot was: How much of data structures does one need to pass an interview and to become a sound software engineer?
Is it sufficient to know just a few basic data structures or do I need to know all data structures including red-black trees , tries etc? Which of the many data structures do I need to know? Is there a silver-bullet data structure that can solve all problems or like an MIT Professor once said, “when in doubt, throw in a hash map”?
This article will be my attempt to answer these questions.
I'll start by talking about what data structures are before moving on to what I think are the important things to know about each data structure. Finally, I'll pick a few fundamental data structures and do a deep dive on them especially highlighting their uniqueness and the use cases they are best suited for.
I recognize that some of the ideas in this article are subjective and that there could be a different perspective about this. I'd love to hear your thoughts on this.
Let’s jump right in!
What are data structures?
Data structures are a way of representing data.
Sorry, did you expect something overly technical? Hang on for a second.
Imagine you have a set of books. How would you arrange them?
You could decide to place them on top of each other as shown in the image above.
Or you could place them side-by-side (an array, a list), as in the image shown below.
Imagine if you were searching for the last book, which of the two layouts would make it easier for you to find the last book?
Hopefully, you said the side-by-side arrangement because you could just go straight to the book whereas, with the first layout, you’d have to remove every book on top of it first before getting to the last book.
So basically if you know you would be searching for books a lot, the side-by-side arrangement seems like a better choice.
However, imagine that the first book is the one you are currently reading and all you care about is that you want to be able to access that book. Then either arrangement would be fine right?
This is what data structures are about really.
A data structure presents a unique layout in memory for storing and organizing data. And just as we’ve seen with the example above, the arrangement of the data in determines the efficiency of certain operations.
Do you still want a more ‘technical’ definition? Alright then, here’s one:
A data structure is a data organization, management, and storage format that enables efficient access and modification.
What data structures should you know?
I’ll intentionally skip this part and answer this later, so let’s move on.
Okay I’m kidding.
There’s a lot of existing content that tells you what data structures you should know. I’m careful not to add to it because I don’t think anyone should put a cap on how much you can know.
That said, I believe in the 80–20 principle. What 20% of data structures do you need to know to not just excel at a technical interview, but to be a better software engineer at your job?
Here’s what I think that 20% would look:
- Arrays
- Linked Lists
- Hash Tables
- Trees
- Graphs
Sorry if that looks like 50 not 20.
The data structures listed above are important because they form the building blocks for a number of more sophisticated data structures. Red-black trees for example, are first trees, before anything else; and trees are just a bunch of connected linked lists with some defined hierarchy. Similarly, sets use hash tables for their internal implementation.
In addition to knowing these fundamental data structures, it is crucial that you have an in-depth knowledge of the data structure(s) you use every day at work or at a company you hope to work for. For example, if you hope to work at a graph database company or on the Google Maps team, having a rich knowledge of graphs and trees would better set you up for success.
With that said, let’s talk about the important details you should know about a data structure.
What about a data structure should you know?
If you had infinite time, infinite mental resources and you love learning, you may want to try to learn everything about all the data structures. Sadly, infinite time and resources are still a utopian dream.
To utilize our finite resources efficiently, we need to focus on the more important things, the biggest bang for our buck. What then are the important things a good software engineer (and that’s you because you are reading this article) should know about data structures?
Here are the main things you should know about a data structure:
- Read/Access time: How long does it take to read from the data structure?
- Write time: How long does it take to update an existing entry in the data structure?
- Insert time: How long does it take to insert a new entry into the data structure?
- Delete time: How long does it take to remove an entry from the data structure?
- Search time: How long does it take to find an entry in the data structure?
It is important to note that the times listed above are measured using something called the Big-O Notation.
One other thing I will add to this list will be: does each record in the data structure have an associated overhead? For example a node entry in a linked list, has more memory footprint than an entry in an array.
Knowing these things help you understand what each data structure is best suited for and making better decisions about when to use which.
Some fundamental data structures
Now that we know what we should be looking for in a data structure, let us apply this to a few fundamental data structures in a bid to understand their peculiarities and what use cases they are best suited for.
1. Arrays
An array is a data structure that stores a collection of elements in contiguous memory, with each element identifiable by its index or position.
Contiguous is a big word for saying side-by-side. Arrays use a memory layout that allows elements to be placed side by side (think to our side-by-side book arrangement).
The more important things to know about the array are:
- Read/Access: Constant time
- Write: Constant time for fixed-size arrays. Constant amortized time for dynamic arrays with linear time as worst case.
- Insert/Delete: Linear time because though you can directly access the element to delete using its index, you have to move elements after the target element.
- Search: Linear time - you have to explore each element.
- When to use: You get the gains of an array (constant time read/write) when you have a fixed number of elements to store, e.g. number of alphabets, countries in Asia etc.
- When not to use: Arrays aren't best suited for search operations. They also aren't best for cases with frequent inserts and delete operations because it has to copy all the elements after the index of the inserted/deleted element.
2. Linked Lists
A linked list is a linear collection of elements where each element (often called a node) is 'linked' to the next element. Each node stores a reference (something that points it) to the next node, something almost akin to a chain.
Traversing a linked list starts from the first node and uses the reference to the next node, all the way till the final node in the list.
Here are the things to note about the linked list:
- Read/Access: Linear time
- Write: Linear time.
- Insert/Delete: Constant time if inserting/deleting at the beginning. Linear otherwise.
- Search: Linear time as you have to explore each node.
- When to use: Linked lists are best suited for cases when you expect frequent insert/delete operations at the top or end of the list. A good example would be stacks and queues.
- When not to use: Don't think linked lists if you want fast search or read operations, except you are reading from the top or the end of the list.
3. Hash Tables
A hash table is a data structure that stores data in an associative manner by storing key-value pairs. By using a hashing function for each key, the hash table ensures that no two keys in the hash table are the same.
Sets and hash maps make use of hash tables as their internal implementation.
There’s more to say about hash tables and I will do a deeper dive in my upcoming series on data structures but for now, the key things to know about the hash table are:
- Read/Access: Constant time
- Write: Constant time.
- Insert/Delete: Constant time
- Search: Constant time.
- When to use: Always baby, always! But seriously, hash tables have a broad range of use cases because of its constant read, write and insert time.
- When not to use: Hash tables aren’t the best when it comes to iterating over its entries. Also because of its associative nature, each record in a hash table comes with extra memory overhead so use it when you absolutely need the keys. For example, if all you need is constant read/access time for a fixed data, then a simple array is more memory-optimal.
4. Trees
A tree is a non-linear data structure that represents the relationship between a root node and zero or more sub-trees in a hierarchical structure.
There are different implementations of the tree data structure including but not limited to: binary trees, binary search trees, red-black trees, AVL trees and n-ary trees. If you will like to learn more about these implementations, I’ve added some links in the resources section.
The time efficiency for the different operations using the tree data structure are:
- Read/Access: Best-case: Logarithmic, Worst-case: Linear
- Write: Best-case: Logarithmic, Worst-case: Linear.
- Insert/Delete: Constant time if inserting/deleting at the beginning. Linear otherwise.
- Search: Linear time as you have to explore each node.
- When to use: When you want to represent hierarchical relationships and when you want faster search on sorted data (binary search tree).
- When not to use: Unsuited for searching through unsorted data.
As you may have noticed, I have skipped the section on Graphs because graphs are simply a collection of connected trees and so the same things mostly apply.
TL;DR
I once worked on a project where a big part of the optimization that needed to be done in reducing the page load time of the app was to change the data structure used for storing a part of the app state from an array to a hash-map. We were carrying out a lot of search operations and an array data structure wasn’t the best for it.
It is therefore important to know the unique features of each data structure and what use cases they are best suited for.
So summarily,
- For fast lookup operations, a hash-map should be one data structure to consider.
- For numeric search operations, think binary search trees.
- For character search operations showing relationships between the elements, think tries.
- For fast reads and writes, think hash tables and arrays (when the data is of a known size).
- For frequent inserts and deletes at the start/end of the data structure, linked lists are your best bet.
- To represent relationships between entities, trees would be the most suitable.
- To represent directed or undirected non-hierarchical relationships between entities, think graphs.
Conclusion
Understanding data structures is not just another item to mark off your checklist for a technical interview. Data structures are a powerful tool in the repertoire of a software engineer to help you write cleaner, more efficient code.
The best engineers know their tools and know when to wield which to solve a specific problem.
A good tool improves the way you work, a great tool improves the way you think
Let’s all aim to use great tools today. 😊
If you are interested in a more detailed study on each data structure, I will be starting a series next month where I’ll be doing a deep-dive into the data structures I’ve listed here and a few others. Please subscribe here so you don’t miss out on it.
I'll also be resuming free mock interviews for 2021 by next month, March. So if you ever need to do a free mock technical interview, please feel free to reach out on Twitter @meekg33k .
I hope this was worth your time. E go be! ✌️