Basic Usage#

Basics of File Structure#

In this section we will be continuing with the example of households from the introduction but diving into more details about writing and reading hepfiles with the data. In hepfile, all the data can be grouped into buckets, which in this case can be associated with households (and the Household ID). people and vehicles would be separated groups, with all their data being contained in datasets inside each group. With houses, there are two separate options: one could either make a new group houses, or you could include all of its data in the _SINGLETONS_GROUP_, a special group designed to store all datasets where each bucket has one and only one entry from it. Because we require every household to only have one house, each household has only one entry from # of bedrooms, for example.

Writing data#

Initialize the data dictionary#

To create the data dictionary, run

my_data = hepfile.initialize()

Create groups#

To create groups, run

hepfile.create_group(my_data, 'my_group', counter = 'my_counter')

Be aware that if nothing is set for counter = , hepfile will set 'N_' + '{group_name}' (or its replacement) as the counter.

Create datasets for those groups#

To create a dataset inside of 'my_group', run

hepfile.create_dataset(my_data, 'my_dataset', group = 'my_group', dtype = str)

If nothing is set for group =, then the dataset created will be put into the _SINGLETONS_GROUP_, as shown below

hepfile.create_dataset(my_data, 'my_unique', dtype = int)

If nothing is set for dtype =, then it will be assumed that the dataset is storing floats. This will cause problems when writing the data to the HDF5 file, so make sure to set the dataset type correctly. Dataset types cannot be changed after the fact.

An additional feature for convenience is that multiple datasets in the same group and of the same type can be created at the same time by inputting a list of names instead of a single dataset name, like so:

hepfile.create_dataset(my_data, ['data1', 'data2'] , group = 'my_group')

Create a single bucket dictionary#

To create a bucket dictionary with the same structure as the overall data dictionary, run

my_bucket = hepfile.create_single_bucket(my_data)

Note that the entire structure of the data dictionary must be finalized, since any additional datasets or groups created in my_data will not be reflected in my_bucket.

Loop over data and pack the buckets as you go#

An example of writing data into a bucket and then packing it into a data dictionary is shown below. Note that hepfile.pack can handle numpy arrays and python lists but python lists are more time and space efficient.

for i in range(5)
    my_bucket['my_group/my_dataset'] = 'yes'
    my_bucket['my_group/data1'] = 1.0
    my_bucket['my_group/data2'] = 2.0

my_bucket['my_unique'] = 3

hepfile.pack(my_data, my_bucket)

Unlike in ROOT, there is no need to set the group counter at all! hepfile does so automatically, looking at the first non-counter dataset in each group and setting the group counter for that bucket to be the length of the bucket’s dataset. This is unlike ROOT, and would take some cycles, so this can be turned off by setting AUTO_SET_COUNTER to False. Simply replace the last line of the previous codeblock with the following to demonstrate:

my_bucket['my_group/my_counter'] = 5
hepfile.pack(my_data, my_bucket, AUTO_SET_COUNTER = False)

Note that the flag must be set to false, or anything you put for the counter will be overwritten.

If for debugging/peace of mind, you want to make sure that all datasets belonging to one group in the bucket are the same length, simply set the STRICT_CHECKING flag to True. If you were to run the following code where two datasets in my_group have different lengths, pack would not update the data dictionary and would warn the user about their mistake:

for i in range(5)
    my_bucket['my_group/my_dataset'].append('yes')
    my_bucket['my_group/data1'].append(1.0)
    my_bucket['my_group/data1'].append(1.5)
    my_bucket['my_group/data2'].append(2.0)

my_bucket['my_unique'] = 3

hepfile.pack(my_data, my_bucket, STRICT_CHECKING = True)

Normally, pack clears the bucket after writing the data to the data dictionary. To remove this behavior for debugging purposes, set the flag EMPTY_OUT_BUCKET to False. The following two lines are equivalent to hepfile.pack(my_data, my_bucket)

hepfile.pack(my_data, my_bucket, EMPTY_OUT_BUCKET = False)
hepfile.clear_bucket(my_bucket)

Finally, if you want to look at the structure of the bucket dictionary while packing it, you can set the verbose flag to True. Note that this will have no effect unless AUTO_SET_COUNTER is left untouched or is set to True.

Adding metadata to groups and datasets#

To add metadata to the groups and datasets you can use hepfile.add_meta. This function takes the data dictionary, the full path to the group or dataset, and a list of the metadata to add to that group or dataset. Here is an example:

hepfile.add_meta(data, 'my_group', ['Some data description'])
hepfile.add_meta(data, 'my_group/data1', ['Some unit for the data'])

Write the data to file#

To write the data dictionary to a file, run

hepfile.write_to_file('my_file.hdf5', my_data)

Note that the data dictionary must be complete, as you cannot edit the file once it has been created.

Write metadata to file#

On execution of write_to_file, some metadata will automatically be be written to the file. This will include the date the file is created and the version numbers of hepfile, numpy, h5py, and python used while creating the file. If more metadata is needed, it can be added with the following line of code:

hepfile.write_file_metadata('my_file.hdf5', mydict = {'author':'John Doe'})

Due to limitations placed on hepfile by h5py, only 60k bytes of metadata can be added into the attributes of a HDF5 file.

If you do not want hepfile to rewrite the default metadata while adding your own, you can set the flag write_default_values to False like so:

hepfile.write_file_metadata('my_file.hdf5', mydict = {'author': 'John Doe'},
                            write_default_values = False)

If you want to delete all existing metadata from an HDF5 file, you can set the flag append to False. Note that this will delete the default metadata as well, so it must be added again. This can be done by passing in nothing for mydict and either setting write_default_values to True or leaving it unchanged. An example is shown below:

hepfile.write_file_metadata('my_file.hdf5', mydict = {'author': 'John Doe'}, append = False)
hepfile.write_file_metadata('my_file.hdf5')

Adding header information to a file#

Many other file formats (ROOT, FITS, etc.) store information about the file, experiment, or observation in a header. hepfile provides this functionality with the hepfile.write_file_header method. Just like hepfile.write_file_metadata this takes a filename and a dictionary of data to store in the header:

hepfile.write_file_header('my_file.hdf5', mydict={'Observer': 'John Doe', 'Observation Time': '00:00:01'})

Reading data#

Load in the data#

To load the data in from the file my_file.hdf5, run

data, bucket = hepfile.load('my_file.hdf5')

data is a dictionary with all the data from the file (organized in the hepfile schema), and bucket is an empty dictionary with the same structure ready to be filled with specific buckets from data.

Let’s say you want to only see the datasets my_unique and data1. We can limit memory use by only pulling in these datasets from the file using the desired_datasets variable. Simply call

data, bucket = hepfile.load('my_file.hdf5', desired_datasets = ['my_unique', 'data1'])

data and bucket will contain the datasets (empty or not) ‘my_unique’ and ‘my_group/data1’. Note that desired_datasets works on the basis of string matching: only putting in ‘data’ would extract both ‘my_group/data1’ and ‘my_group/data2’. To extract some specific group, putting in the group name will work, since 'my_group' in 'my_group/my_dataset' == True, as well as any other dataset in it.

Additionally, the file may contain more expansive ranges of data than you want to analyze. In this case, simply set the subset variable equal to the range of bucket counters you want to study. For example, if you cared about buckets 2-5, you would run

data, bucket = hepfile.load('my_file.hdf5', subset = [2,5])

Additionally, if you want to load in the first N buckets, you could run

data, bucket = hepfile.load('my_file.hdf5', subset = N)

If N is greater than the total number of buckets, the upper range will be set at the last bucket in the data file.