What are the differences between std::set and std::unordered_set ?

Here is a graphical and quantitative explanation of

- how unordered_set works,
- how fast it can access data,
- how much memory it occypies, and
- how it compares to std::set.

### Exactly how many steps does it take unordered_set to access its data ?

I filled an unordered_set with random numbers. I've plotted a graph of how the data are distributed in the unordered_set.

I counted the number entries in each bucket, while entering random values in the unordered_set.

empty and filled buckets, as a percentage of number of entries in the hash table |

I used this code to get the above graph.

You can see that:

You can see that:

- 50% of buckets contain 1 entry
- 17% of buckets contain 2 entries
- 3% of buckets contain 3 entries
- 0.4% of buckets contain 4 entries

In other words,

- 40%-60% of values in a hash table is in a bucket with 1 entry
- 30%-40% of values in a hash table is in a bucket with 2 entries
- 9% of values in a hash table is in a bucket with 3 entries
- 1.8% of values in a hash table is in a bucket with 4 entries
- 0.3% of values in a hash table is in a bucket with 5 entries

So,

**for the 97% of the values in a hash table it takes (on average) 1.3 steps to access them**.calculation method:

therefore

- 50% of values can be accessed in 1 step (avg = 1)
- 35% of values can be accessed in 1-2 steps (avg = 1.5)
- 9% of values can be accessed in 1-3 steps (avg = 2)
avg number of steps= 50%*1 + 35%*1.5 + 9%*2 =1.295

#### Why are there peaks in the graph?

Each time the number of entries in the set reached the bucket count, the number of buckets are doubled and the table is

*rehashed*. Note, however, that**rehashing does**, it only redistibutes the nodes of the linked lists of the bucket.__not__copy the actual entries of the unordered_set#### Why are there more than 100% empty buckets?

The number of buckets varies 1-2 times the number of actual entries that the set stored. About half the buckets in the table are generally empty. Having a lot of empty buckets is important because it increases the probability of a value not entering a full bucket.

### How much space does std::unordered_set occupy?

The bucket count varies from 1-2 times the actual entries.

Each bucket contains a forward_list.

The size of a forward list is just one pointer to its first node.

The size of each node in the list is:

- the size of data itself, plus

- the size of a pointer pointing to the next list node.

Summung up, an unordered_set of n-entries may occupy

#bytes = bucket_count * sizeof(forward_list) +

n * (sizeof(data) + sizeof(forward_list_node))

#bytes = bucket_count * sizeof(forward_list) +

n * (sizeof(data) + sizeof(forward_list_node))

therefore # of bytes may be

- from
**n*(16+sizeof(data))** - to
**n*(32+sizeof(data))**

### How much space does a std::set occupy?

Each node of a std::set has a

*left*,*rught*, and*parent*pointer. Plus the data. Thus:#bytes = n * (3*sizeof(tree_node) + sizeof(data)

therefore # of bytes in a set is

- exactly
**n*(24+sizeof(data))**

### How does memory and performance compare between a std::set and a std::unordered_set ?

On average, you can expect same memory footprint for both std::set and std::unordered_set. Although unordered_set may vary, if you know beforehand the number of entries you're going to store,, you can reserve() the appropriate size, and you'll be better off than using a std::set.

Performance-wise,

- std::unordered_set has O(1) complexity for insert and search, while
- std::set has O(logN) complexity.

## No comments:

## Post a Comment