Many Dictionaries Using Massive Amounts Of RAM

November 28, 2022 Post a Comment

I have a very simple Python script to create (for test purposes), 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg. {'Name': 'Jo

Solution 1:

The overhead for a dict object is quite large. It depends on your Python version and your system architechture, but on Python 3.5 64bit

In [21]: sys.getsizeof({})
Out[21]: 288

So guesstimating:

250*36e6*1e-9 == 9.0

So that is a lower limit on my ram usage in gigabytes if I created that many dictionaries, not factoring in the list!

Rather than use a dict as a record type, which isn't really the use case, use a namedtuple.

And to get a view of how this compares, let's set up an equivalent list of tuples:

In [23]: Record = namedtuple("Record", "name age")

In [24]: records = [Record("john", 28) for _ in range(36000000)]

In [25]: getsizeof = sys.getsizeof

Consider:

In [31]: sum(getsizeof(record)+ getsizeof(record.name) + getsizeof(record.age)  for record in records)
Out[31]: 5220000000

In [32]: _ + getsizeof(records)
Out[32]: 5517842208

In [33]: _ * 1e-9
Out[33]: 5.517842208

So 5 gigs is an upper limit that is quite conservative. For example, it assumes that there is no small-int caching going on, which for a record-type of ages will totally matter. On my own system, the python process is registering 2.7 gigs of memory usage (via top).

So, what is actually going on in my machine is better modeled by being conservative for strings assuming -- unique strings that have an average size of 10, so no string interning -- but liberal for ints, assuming int-caching is taking care of our int objects for us, so we just have to worry about the 8-byte pointers!

In [35]: sum(getsizeof("0123456789") + 8  for record in records)
Out[35]: 2412000000

In [36]: _ + getsizeof(records)
Out[36]: 2709842208

In [37]: _ * 1e-9
Out[37]: 2.709842208

Which is a good model for what I'm observing from top.

If you really want efficient storage

Now, if you really want to cram data into ram, you are going to have to lose the flexibility of Python. You could use the array module in combination with struct, to get C-like memory efficiency. An easier world to wade into might be numpy instead, which allows for similar things. For example:

In [1]: import numpy as np

In [2]: recordtype = np.dtype([('name', 'S20'),('age', np.uint8)])

In [3]: records = np.empty((36000000), dtype=recordtype)

In [4]: records.nbytes
Out[4]: 756000000

In [5]: records.nbytes*1e-9
Out[5]: 0.756

Note, we are now allowed to be quite compact. I can use 8-bit unsigned integers (i.e. a single byte) to represent age. However, immediately I am faced with some inflexibility: if I want efficient storage of strings I must define a maximum size. I've used 'S20', which is 20 characters. These are ASCII bytes, but a field of 20 ascii characters might very well suffice for names.

Now, numpy gives you a lot of fast methods wrapping C-compiled code. So, just to play around with it, let's fill our records with some toy data. Names will simply be string of digits from a simple count, and age will be selected from a normal distribution with a mean of 50 and a standard deviation of 10.

In [8]: for i in range(1, 36000000+1):
   ...:     records['name'][i - 1] = b"%08d" % i
   ...:

In [9]: import random
   ...: for i in range(36000000):
   ...:     records['age'][i] = max(0, int(random.normalvariate(50, 10)))
   ...:

Now, we can use numpy to query our records. For example, if you want the indices of your records given some condition, use np.where:

In [10]: np.where(records['age'] > 70)
Out[10]: (array([      58,      146,      192, ..., 35999635, 35999768, 35999927]),)

In [11]: idx = np.where(records['age'] > 70)[0]

In [12]: len(idx)
Out[12]: 643403

So 643403 records that have an age > 70. Now, let's try 100:

In [13]: idx = np.where(records['age'] > 100)[0]

In [14]: len(idx)
Out[14]: 9

In [15]: idx
Out[15]:
array([ 2315458,  5088296,  5161049,  7079762, 15574072, 17995993,
       25665975, 26724665, 28322943])

In [16]: records[idx]
Out[16]:
array([(b'02315459', 101), (b'05088297', 102), (b'05161050', 101),
       (b'07079763', 104), (b'15574073', 101), (b'17995994', 102),
       (b'25665976', 101), (b'26724666', 102), (b'28322944', 101)],
      dtype=[('name', 'S20'), ('age', 'u1')])

Of course, one major inflexibility is that numpy arrays are sized. Resizing operations are expensive. Now, you could maybe wrap a numpy.array in some class and it will act as an efficient backbone, but at that point, you might as well use a real data-base. Lucky for you, Python comes with sqlite.

Solution 2:

Let's look at this

>>> import sys 
>>> sys.getsizeof({'Name': 'Jordan', 'Age': 25}) * 35000000
10080000000

So ~10 GB. Python is doing exactly what you are asking it to do.

You need to split this up into chucks and check them sequentially.Try this as a starting point

Solution 3:

... 35 million dictionary objects within a list. Each dictionary object contains two key/value pairs. eg. {'Name': 'Jordan', 'Age': 35}

You're right that this manner of storage has considerable overhead.

The Flyweight Design Pattern suggests that the solution involves factoring-out the commonalities. Here are two ideas for alternative storage of the same data with better space utilization.

You can use __slots__ to save space on instances of classes (this suppresses the creation of per-instance dictionaries):

class Person(object):
    __slots__ = ['Name', 'Age']

s = [Person('Jordan', 35), Person('Martin', 31), Person('Mary', 33)]

It is even more space-efficient to use dense data structures like a pair of parallel lists:

s_name = ['Jordan', 'Martin', 'Mary']
s_age = [35, 31, 33]

If there duplicates in the data, you save even more space by interning the values:

s_name = map(intern, s_name)

Or in Python 3:

s_name = list(map(sys.intern, s_name)

Solution 4:

There is compact approach to have millions small objects, which require less memory than with __slots__.

pip3 install recordclass

from recordclass import dataobject

class Person(dataobject):
   name:str
   age:int

Each instance save 16 bytes for python>=3.8 and 24 bytes for python<3.8

Python Library