Working with hepfile Metadata and Headers#

[1]:
import hepfile as hf

Let’s start by writing a simple hepfile, just like the one from the Writing hepfiles from Dictionaries tutorial.

[2]:
# create the data

event1 = {
    'jet': {
        'px': [1,2,3],
        'py': [1,2,3]
     },
    'muons': {
        'px': [1,2,3],
        'py': [1,2,3]
     },
    'nParticles': 3
    }

event2 = {
    'jet': {
        'px': [3,4,6,7],
        'py': [3,4,6,7]
     },
    'muons': {
        'px': [3,4,6,7],
        'py': [3,4,6,7],
        },
    'nParticles': 4
    }

events = [event1, event2]
[3]:
# write the data to a hepfile
filename = 'output_from_dict.hdf5'
data = hf.dict_tools.dictlike_to_hepfile(events, filename, how_to_pack='awkward')
data.show()
[{jet: {px: [1, ..., 3], py: [...]}, muons: {px: ..., ...}, ...},
 {jet: {px: [3, ..., 7], py: [...]}, muons: {px: ..., ...}, ...}]

Hepfile Metadata#

Metadata is stored using the hdf5 attributes ability and can be accessed at the file level, group level, and dataset level. The following subsections will give examples on writing and accessing each of these types of metadata.

Writing File Level Metadata#

To write file level metadata, we use the hepfile.write_file_metadata function.

[4]:
help(hf.write_file_metadata)
Help on function write_file_metadata in module hepfile.write:

write_file_metadata(filename: 'str', mydict: 'dict' = None, write_default_values: 'bool' = True, append: 'bool' = True, verbose: 'bool' = False) -> 'h5.File'
    Writes file metadata in the attributes of an HDF5 file

    Args:
        filename (string): Name of output file

        mydict (dictionary): Metadata desired by user

        write_default_values (boolean): True if user wants to write/update the
                                        default metadata: date, hepfile version,
                                        h5py version, numpy version, and Python
                                        version, false if otherwise.

        append (boolean): True if user wants to keep older metadata, false otherwise.
        verbose (boolean): True to print out statements as it goes

    Returns:
        h5py.File: HDF5 File with new metadata

As you can see above, the write_file_metadata function has an optional mydict argument for you to write additional file metadata to. By default, the date, hepfile version, h5py version, numpy version, and python version are written to the file metadata as well. This can be changed by setting the write_default_values = False. The write_file_metadata function also requires the filename to which the metadata is saved. Finally, the append argument allows the user to decide whether to append or overwrite existing metadata. Turning append=False can be very dangerous because important information for reading and writing hepfiles is stored in the metadata, only do this if you know what you’re doing!

To update the file metadata, let’s just add an author name and institution to the metadata.

[5]:
meta = {'Author': 'Your Name',
        'Institution': 'Siena College'}

hf.write_file_metadata(filename, meta);

Reading File Level Metadata#

To view the file level metadata, we can use the hf.print_file_metadata function which simply takes a file name and prints out the metadata.

[6]:
hf.print_file_metadata(filename);
date                 : 2023-07-20 21:53:24.666306
_NUMBER_OF_BUCKETS_  : 2
h5py_version         : 3.9.0
hepfile_version      : 0.1.7
numpy_version        : 1.23.5
python_version       : 3.9.17 (main, Jul  5 2023, 21:05:34)
[GCC 11.2.0]
Author               : Your Name
Institution          : Siena College

As you can see, the default information is in this metadata and so is the Author’s name and Institution that we added earlier! To instead get the metadata as a dictionary, we can use the hf.get_file_metadata function.

[7]:
out_meta = hf.get_file_metadata(filename)
print(out_meta['Author'])
print(out_meta['Institution'])
Your Name
Siena College

Writing Group and Dataset Metadata#

Just like classic hdf5 files, hepfiles can also have metadata attached directly to the groups and/or datasets. This allows us to include important information about a specific group or things like units to the datasets. This needs to be done directly on a hepfile data object. So, let’s try to edit the data object from above and add some group metadata.

First, we need to convert the data object from an awkward array into a more classical form using hf.awkward_tools.awkward_to_hepfile

[8]:
newdata = hf.awkward_tools.awkward_to_hepfile(data, write_hepfile=False)
newdata
[8]:
{'_GROUPS_': {'_SINGLETONS_GROUP_': ['COUNTER', 'nParticles'],
  'jet': ['njet', 'px', 'py'],
  'muons': ['nmuons', 'px', 'py']},
 '_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER',
  'jet': 'jet/njet',
  'jet/px': 'jet/njet',
  'jet/py': 'jet/njet',
  'muons': 'muons/nmuons',
  'muons/px': 'muons/nmuons',
  'muons/py': 'muons/nmuons',
  'nParticles': '_SINGLETONS_GROUP_/COUNTER'},
 '_LIST_OF_COUNTERS_': ['_SINGLETONS_GROUP_/COUNTER',
  'jet/njet',
  'muons/nmuons'],
 '_SINGLETONS_GROUP_/COUNTER': array([1, 1]),
 '_MAP_DATASETS_TO_DATA_TYPES_': {'_SINGLETONS_GROUP_/COUNTER': int,
  'jet/njet': int,
  'jet/px': dtype('int64'),
  'jet/py': dtype('int64'),
  'muons/nmuons': int,
  'muons/px': dtype('int64'),
  'muons/py': dtype('int64'),
  'nParticles': dtype('int64')},
 '_META_': {},
 'jet/px': array([1, 2, 3, 3, 4, 6, 7]),
 'jet/njet': array([3, 4], dtype=int32),
 'jet/py': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/px': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/nmuons': array([3, 4], dtype=int32),
 'muons/py': array([1, 2, 3, 3, 4, 6, 7]),
 'nParticles': array([3, 4])}

Now that we have a more classical data object, we can use the hf.add_meta to add metadata to the protected _META_ group. hf.add_meta takes in a data object, a group (or singleton or dataset) name, and the metadata to add. Let’s first add metadata to the muons group.

[9]:
hf.add_meta(newdata, 'muons', 'This is data for a subatomic particle')
newdata
[9]:
{'_GROUPS_': {'_SINGLETONS_GROUP_': ['COUNTER', 'nParticles'],
  'jet': ['njet', 'px', 'py'],
  'muons': ['nmuons', 'px', 'py']},
 '_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER',
  'jet': 'jet/njet',
  'jet/px': 'jet/njet',
  'jet/py': 'jet/njet',
  'muons': 'muons/nmuons',
  'muons/px': 'muons/nmuons',
  'muons/py': 'muons/nmuons',
  'nParticles': '_SINGLETONS_GROUP_/COUNTER'},
 '_LIST_OF_COUNTERS_': ['_SINGLETONS_GROUP_/COUNTER',
  'jet/njet',
  'muons/nmuons'],
 '_SINGLETONS_GROUP_/COUNTER': array([1, 1]),
 '_MAP_DATASETS_TO_DATA_TYPES_': {'_SINGLETONS_GROUP_/COUNTER': int,
  'jet/njet': int,
  'jet/px': dtype('int64'),
  'jet/py': dtype('int64'),
  'muons/nmuons': int,
  'muons/px': dtype('int64'),
  'muons/py': dtype('int64'),
  'nParticles': dtype('int64')},
 '_META_': {'muons': 'This is data for a subatomic particle'},
 'jet/px': array([1, 2, 3, 3, 4, 6, 7]),
 'jet/njet': array([3, 4], dtype=int32),
 'jet/py': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/px': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/nmuons': array([3, 4], dtype=int32),
 'muons/py': array([1, 2, 3, 3, 4, 6, 7]),
 'nParticles': array([3, 4])}

Notice how that metadata is now stored in the _META_ group!

We can also add metadata for singletons. Let’s add some to nParticles

[10]:
hf.add_meta(newdata, 'nParticles', 'This is how many muons were observed in each event.')
newdata
[10]:
{'_GROUPS_': {'_SINGLETONS_GROUP_': ['COUNTER', 'nParticles'],
  'jet': ['njet', 'px', 'py'],
  'muons': ['nmuons', 'px', 'py']},
 '_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER',
  'jet': 'jet/njet',
  'jet/px': 'jet/njet',
  'jet/py': 'jet/njet',
  'muons': 'muons/nmuons',
  'muons/px': 'muons/nmuons',
  'muons/py': 'muons/nmuons',
  'nParticles': '_SINGLETONS_GROUP_/COUNTER'},
 '_LIST_OF_COUNTERS_': ['_SINGLETONS_GROUP_/COUNTER',
  'jet/njet',
  'muons/nmuons'],
 '_SINGLETONS_GROUP_/COUNTER': array([1, 1]),
 '_MAP_DATASETS_TO_DATA_TYPES_': {'_SINGLETONS_GROUP_/COUNTER': int,
  'jet/njet': int,
  'jet/px': dtype('int64'),
  'jet/py': dtype('int64'),
  'muons/nmuons': int,
  'muons/px': dtype('int64'),
  'muons/py': dtype('int64'),
  'nParticles': dtype('int64')},
 '_META_': {'muons': 'This is data for a subatomic particle',
  'nParticles': 'This is how many muons were observed in each event.'},
 'jet/px': array([1, 2, 3, 3, 4, 6, 7]),
 'jet/njet': array([3, 4], dtype=int32),
 'jet/py': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/px': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/nmuons': array([3, 4], dtype=int32),
 'muons/py': array([1, 2, 3, 3, 4, 6, 7]),
 'nParticles': array([3, 4])}

Finally, we can add some units to a dataset by giving it metadata! Let’s add units to all of the momentums.

[11]:
for key in ['jet/px', 'jet/py', 'muons/px', 'muons/py']: # loop over all the momentums
    hf.add_meta(newdata, key, 'kg * m / s') # add units to each of these momentums
newdata
[11]:
{'_GROUPS_': {'_SINGLETONS_GROUP_': ['COUNTER', 'nParticles'],
  'jet': ['njet', 'px', 'py'],
  'muons': ['nmuons', 'px', 'py']},
 '_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER',
  'jet': 'jet/njet',
  'jet/px': 'jet/njet',
  'jet/py': 'jet/njet',
  'muons': 'muons/nmuons',
  'muons/px': 'muons/nmuons',
  'muons/py': 'muons/nmuons',
  'nParticles': '_SINGLETONS_GROUP_/COUNTER'},
 '_LIST_OF_COUNTERS_': ['_SINGLETONS_GROUP_/COUNTER',
  'jet/njet',
  'muons/nmuons'],
 '_SINGLETONS_GROUP_/COUNTER': array([1, 1]),
 '_MAP_DATASETS_TO_DATA_TYPES_': {'_SINGLETONS_GROUP_/COUNTER': int,
  'jet/njet': int,
  'jet/px': dtype('int64'),
  'jet/py': dtype('int64'),
  'muons/nmuons': int,
  'muons/px': dtype('int64'),
  'muons/py': dtype('int64'),
  'nParticles': dtype('int64')},
 '_META_': {'muons': 'This is data for a subatomic particle',
  'nParticles': 'This is how many muons were observed in each event.',
  'jet/px': 'kg * m / s',
  'jet/py': 'kg * m / s',
  'muons/px': 'kg * m / s',
  'muons/py': 'kg * m / s'},
 'jet/px': array([1, 2, 3, 3, 4, 6, 7]),
 'jet/njet': array([3, 4], dtype=int32),
 'jet/py': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/px': array([1, 2, 3, 3, 4, 6, 7]),
 'muons/nmuons': array([3, 4], dtype=int32),
 'muons/py': array([1, 2, 3, 3, 4, 6, 7]),
 'nParticles': array([3, 4])}

Reading Group and Dataset Metadata#

To view the metadata, all we need to do is retrieve the _META_ group from data.

[12]:
# get all the metadata
print(f"All Metadata:\n{newdata['_META_']}")
print()

# metadata of muons
print(f"Muons Metadata:\n{newdata['_META_']['muons']}")
All Metadata:
{'muons': 'This is data for a subatomic particle', 'nParticles': 'This is how many muons were observed in each event.', 'jet/px': 'kg * m / s', 'jet/py': 'kg * m / s', 'muons/px': 'kg * m / s', 'muons/py': 'kg * m / s'}

Muons Metadata:
This is data for a subatomic particle

Hepfile Headers#

While headers are not directly built in to hdf5, we add the ability to write headers for hepfiles because they can hold important information about the data in the file. Additionally, other file structures that users may want to translate to hepfiles have header information that will need to be stored in the hepfile. We allow for the use of headers by just saving the information as a set of datasets underneath a protected group name (_HEADER_). There are some useful functions built into the hepfile read module to help with writing and reading these protected groups.

Writing Header Information#

To add a header to a hepfile, we just need to use the hf.write_file_header function. This takes in a filename and a dictionary of the header information. Let’s instead add the author and institution to the hepfile header as an example.

[13]:
hdr = {'Author': 'Your Name',
       'Institution': 'Siena College',
       'Phone Number': 1234567890}

hf.write_file_header(filename, hdr);

Reading Header Information#

To show the header, we can use hf.print_file_header which just takes in a filename and returns the formatted string.

[14]:
hf.print_file_header(filename);
################################################################
###                      Hepfile Header                      ###
################################################################
################################################################
Author:                 Your Name
Institution:                    Siena College
Phone Number:                   1234567890

To instead return the header, we can use hf.get_file_header which takes in a filename and a return_type. The return_type can either be dict, which returns a dictionary, or df, which returns a pandas dataframe.

[15]:
header = hf.get_file_header(filename, return_type='df')
header
[15]:
Author Institution Phone Number
0 Your Name Siena College 1234567890

More advanced headers#

Let’s say we instead have the following header from another file type, like a FITS file, that has a field, values, and comments.

[16]:
fits_hdr = '''
AUTHOR:\t\tYour Name / This is a comment that says this fields is the authors name
INSTITUTION:\t\tSiena College / This is another institution
BEAM ENERGY:\t\t13 / TeV
BEAM TYPE:\t\tprotons / Beam Type
'''

print(fits_hdr)

AUTHOR:         Your Name / This is a comment that says this fields is the authors name
INSTITUTION:            Siena College / This is another institution
BEAM ENERGY:            13 / TeV
BEAM TYPE:              protons / Beam Type

We can then parse this header and organize it into three different datasets to be stored in the header: fields, values, and comments.

[17]:
hdr = {
'fields' : [],
'values' : [],
'comments' : []
}

for line in fits_hdr.split('\n'):
    if len(line) == 0: continue

    hdr['fields'].append(line.split(':')[0])
    hdr['values'].append(line.split(':')[1].split('/')[0].strip())
    hdr['comments'].append(line.split('/')[-1].strip())

print(hdr)
{'fields': ['AUTHOR', 'INSTITUTION', 'BEAM ENERGY', 'BEAM TYPE'], 'values': ['Your Name', 'Siena College', '13', 'protons'], 'comments': ['This is a comment that says this fields is the authors name', 'This is another institution', 'TeV', 'Beam Type']}

This is now in a workable format to pass into hf.write_file_header

[18]:
hf.write_file_header(filename, hdr)
hf.print_file_header(filename);
################################################################
###                      Hepfile Header                      ###
################################################################
################################################################
comments:                       This is a comment that says this fields is the authors name
                        This is another institution
                        TeV
                        Beam Type
fields:                 AUTHOR
                        INSTITUTION
                        BEAM ENERGY
                        BEAM TYPE
values:                 Your Name
                        Siena College
                        13
                        protons