Data Structures
Last updated on May 1, 2022
import numpy as np
import pandas as pd
For most of this course we are going to work with pandas
DataFrames. However, it’s important to start with an introduction to the different types of data structures available in Python, their characteristics, their differences and their comparative advantages.
The most important data structures in Python are:
- lists
- tuples
- sets
- dictionaries
- numpy arrays
- pandas DataFrames
- pyspark DataFrames
Lists
A list is a mutable array data structure in Python with the following characteristics:
- can hold any type of data
- can hold different types of data at the same time
- can be modified
We can generate lists using square brackets.
l = [12, "world", [3,4,5]]
print(l)
[12, 'world', [3, 4, 5]]
Since lists are ordered, we can access their element by calling the position of the element in the list.
l[0]
12
Since lists are mutable, we can modify their elements.
l[0] = 'hello'
print(l)
['hello', 'world', [3, 4, 5]]
We can add elements to a list using .append()
.
l.append(23)
l
['hello', 'world', [3, 4, 5], 23]
We can remove elements by calling del
on it.
del l[0]
print(l)
['world', [3, 4, 5], 23]
We can combine two lists using +
. Note that this operation does not modify the list but generates a new one.
l + [23]
['world', [3, 4, 5], 23, 23]
We can also generate lists using comprehensions.
l = [n for n in range(3)]
print(l)
[0, 1, 2]
Comprehensions are a powerful tool!
l = [n+10 for n in range(10) if (n%2==0) and (n>4)]
print(l)
[16, 18]
Tuples
A tuple is an immutable array data structure in Python with the following characteristics:
- can hold any type of data
- can hold different types of data at the same time
- can not be modified
We can generate tuples using curve brackets.
# A list of different data types
t = (12, "world", [3,4,5])
print(t)
(12, 'world', [3, 4, 5])
Since tuples are ordered, we can access their element by calling the position of the element in the list.
t[0]
12
Since tuples are unmutable, we cannot modify their elements.
# Try to modify element
try:
t[0] = 'hello'
except Exception as e:
print(e)
'tuple' object does not support item assignment
# Try to add element
try:
t.append('hello')
except Exception as e:
print(e)
'tuple' object has no attribute 'append'
# Try to remove element
try:
del t[0]
except Exception as e:
print(e)
'tuple' object doesn't support item deletion
We can combine two tuples using +
. Note that this operation does not modify the tuple but generates a new one. Also note that to generate a 1-element tuple we need to insert a comma.
t + (23,)
(12, 'world', [3, 4, 5], 23)
We can generate tuples using comprehensions, but we need to specify it’s a tuple
.
t = tuple(n for n in range(3))
print(t)
(0, 1, 2)
Sets
A set is a mutable array data structure in Python with the following characteristics:
- can only hold hashable types
- can hold different types of data at the same time
- cannot be modified
- cannot contain duplicates
We can generate using curly brackets.
s = {12, "world", (3,4,5)}
print(s)
{(3, 4, 5), 12, 'world'}
Since sets are unordered and unindexed, we cannot access single elements by calling their position.
# Try to access element by position
try:
s[0]
except Exception as e:
print(e)
'set' object is not subscriptable
Since sets are unordered, we cannot modify their elements by specifying the position.
# Try to modify element
try:
s[0] = 'hello'
except Exception as e:
print(e)
'set' object does not support item assignment
However, since sets are mutable, we can add elements using .add()
.
s.add('hello')
print(s)
{'hello', (3, 4, 5), 12, 'world'}
However, we cannot add duplicates.
s.add('hello')
print(s)
{'hello', (3, 4, 5), 12, 'world'}
We can delete elements of a set using .remove()
.
s.remove('hello')
print(s)
{(3, 4, 5), 12, 'world'}
We can also generate sets using comprehensions.
s = {n for n in range(3)}
print(s)
{0, 1, 2}
Dictionaries
A dictionary is a mutable array data structure in Python with the following characteristics:
- can hold any type
- can hold different types at the same time
- can be modified
- items are named
We can generate dictionaries using curly brackets. Since elements are indexed by keys, we have to provide one for each element.
Dictionary keys can be of any hashable type. A hashable object has a hash value that never changes during its lifetime, and it can be compared to other objects. Hashable objects that compare as equal must have the same hash value.
Immutable types like strings and numbers are hashable and work well as dictionary keys. You can also use tuple objects as dictionary keys as long as they contain only hashable types themselves.
d = {"first": 12, 2: "world", (3,): [3,4,5]}
print(d)
{'first': 12, 2: 'world', (3,): [3, 4, 5]}
Since dictionaries are indexed but not ordered, we can only access elements by the corresponding key. If the corresponding key does not exist, we do not access any element.
try:
d[0]
except Exception as e:
print(e)
0
We can access all values of the dictionary using .values()
.
d.values()
dict_values([12, 'world', [3, 4, 5]])
We can access all keys of the dictionary using .keys()
.
d.keys()
dict_keys(['first', 2, (3,)])
We can access both values and keys using .items()
.
d.items()
dict_items([('first', 12), (2, 'world'), ((3,), [3, 4, 5])])
This gives us a list of tuples which we can iterate on.
[f"{key}: {value}" for key, value in d.items()]
['first: 12', '2: world', '(3,): [3, 4, 5]']
Since dictionaries are mutable, we can modify their elements.
d["first"] = 'hello'
print(d)
{'first': 'hello', 2: 'world', (3,): [3, 4, 5]}
If we try to modify an element that does not exist, the element is added to the dictionary.
d[0] = 'hello'
print(d)
{'first': 'hello', 2: 'world', (3,): [3, 4, 5], 0: 'hello'}
We can remove elements using del
.
del d[0]
print(d)
{'first': 'hello', 2: 'world', (3,): [3, 4, 5]}
We can cannot combine two dictionaries using +
. We can only add one element at the time.
try:
d + {"fourth": (1,2)}
except Exception as e:
print(e)
unsupported operand type(s) for +: 'dict' and 'dict'
We can also generate dictionaries using comprehensions.
d = {f"k{n}": n+1 for n in range(3)}
print(d)
{'k0': 1, 'k1': 2, 'k2': 3}
Numpy Arrays
A numpy array is a mutable array data structure in Python with the following characteristics:
- can hold any type of data
- can only hold one type at the same time
- can be modified
- can be multi-dimensional
- support matrix operations
We can generate numpy arrays are generated the np.array()
function on a list.
a = np.array([1,2,3])
print(a)
[1 2 3]
We can make 2-dimensional arrays (a matrix) as lists of lists.
m = np.array([[1,2,3] , [4,5,6]])
print(m)
[[1 2 3]
[4 5 6]]
Since numpy arrays are mutable, we can modify elements by calling the index of the numpy array.
m[0,0] = 89
print(m)
[[89 2 3]
[ 4 5 6]]
We can check the shape of a numpy array using .shape
m.shape
(2, 3)
We can expand dimensions of a numpy array using .expand_dims()
.
a = np.expand_dims(a,1)
print(a)
[[1]
[2]
[3]]
We can transpose matrices using .T
.
a = a.T
print(a)
[[1 2 3]]
We add elements to a numpy array using np.concatenate()
. All elements must have the same number of dimensions and must have the same number of elements along the concatenation axis.
m = np.concatenate((m, a), axis=0)
print(m)
[[89 2 3]
[ 4 5 6]
[ 1 2 3]]
We cannot remove elements of numpy arrays.
try:
del a[0]
except Exception as e:
print(e)
cannot delete array elements
We can do matrix operations between numpy arrays. For example, we can do multiplication with @
.
a @ m
array([[100, 18, 24]])
If we use *
we get dot (or element-wise) multiplication instead.
a * m
array([[89, 4, 9],
[ 4, 10, 18],
[ 1, 4, 9]])
There is a wide array of functions available in numpy. For example np.invert()
inverts a squared matrix.
np.invert(m)
array([[-90, -3, -4],
[ -5, -6, -7],
[ -2, -3, -4]])
Comparison
Which data structure should you use and why? Let’s compare different data types
K = 100_000
l = [n for n in range(K)]
t = tuple(n for n in range(K))
s = {n for n in range(K)}
a = np.array([n for n in range(K)])
Size
Which data type is more efficient?
import sys
def compare_size(list_objects):
for o in list_objects:
print(f"Size of {type(o)}: {sys.getsizeof(o)}")
compare_size([l, t, s, a])
Size of <class 'list'>: 800984
Size of <class 'tuple'>: 800040
Size of <class 'set'>: 4194520
Size of <class 'numpy.ndarray'>: 800104
Size is very similar for lists, tuples and numpy arrays.
Speed
Which data type is faster?
import time
def compare_time(list_objects):
for o in list_objects:
start = time.time()
[x**2 for x in o]
end = time.time()
print(f"Time of {type(o)}: {end - start}")
compare_time([l, t, s, a])
Time of <class 'list'>: 0.021067142486572266
Time of <class 'tuple'>: 0.02071404457092285
Time of <class 'set'>: 0.021012067794799805
Time of <class 'numpy.ndarray'>: 0.010493993759155273
Numpy arrays are faster at math operations.