<a href="https://colab.research.google.com/github/restrepo/PythonTipsAndTricks/blob/master/numpy/Append_Dictionary_to_File.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Append list of Python dictionaries to a file without loading it
From: https://stackoverflow.com/a/36246957/2268280

If you are looking to not actually **load** the file, going about this with `json` is not really the right approach.  You could use a memory mapped fileâ€¦ and never actually load the file to memory -- a `memmap` array can open the file and build an array "on-disk" without loading anything into memory.

Create a memory-mapped array of dicts:

In [1]:
import numpy as np
a = np.memmap('mydict.dat', dtype=object, mode='w+', shape=(4,))
a[0] = {'name':"Joe", 'data':[1,2,3,4]}
a[1] = {'name':"Guido", 'data':[1,3,3,5]}
a[2] = {'name':"Fernando", 'data':[4,2,6,9]}
a[3] = {'name':"Jill", 'data':[9,1,9,0]}
a.flush()
del a

Now read the array, without loading the file:

In [2]:
a = np.memmap('mydict.dat', dtype=object, mode='r')

The contents of the file are loaded into memory when the list is created, but that's not required -- you can work with the array on-disk without loading it.

In [3]:
a.tolist()

[{'name': 'Joe', 'data': [1, 2, 3, 4]},
 {'name': 'Guido', 'data': [1, 3, 3, 5]},
 {'name': 'Fernando', 'data': [4, 2, 6, 9]},
 {'name': 'Jill', 'data': [9, 1, 9, 0]}]

It takes a negligible amount of time (e.g. nanoseconds) to create a memory-mapped array that can index a file regardless of size (e.g. 100 GB) of the file.

## Aplications
* Filter some data

In [6]:
a[0]['data']

[1, 2, 3, 4]

* Load a file into one memmpap:

See [loading csv column into numpy memmap (fast)](https://stackoverflow.com/a/36779509/2268280)



In [8]:
import pandas as pd

In [10]:
pd.DataFrame(a.tolist()).to_json(
    'kk.json',orient='records',
    lines=True)

In [11]:
from numpy.lib.format import open_memmap

In [15]:
# we need to specify the shape and dtype in advance, but it would be cheap to
# allocate an array with more rows than required since memmap files are sparse.
mmap = open_memmap('/tmp/arr.npy', mode='w+', dtype=np.double, shape=(100000, 2))

# parse at most 10000 rows at a time, write them to the memmaped array
n = 0
for chunk in pd.read_json(
    'kk.json',orient='records',
    lines=True):#, chunksize=10000):
    mmap[n:n+chunk.shape[0]] = chunk.values
    n += chunk.shape[0]

print(np.allclose(data, mmap))
# True

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/arr.npy'