Converting a ROOT File to a hepfile#

Since an important application of hepfile is storing high energy physics data, then many users may want to convert their ROOT files to hepfiles. This tutorial walks through how to do that.

[1]:
# imports
# note that you may need to pip install uproot
import hepfile as hf
import uproot
import awkward as ak
import numpy as np
import pandas as pd

First, we need to download a ROOT file from CERN’s open data repository. This file is large so it may take some time to download.

[2]:
# Down load a file for us to play with
!curl http://opendata.cern.ch/record/12361/files/SMHiggsToZZTo4L.root --output SMHiggsToZZTo4L.root
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 40.4M  100 40.4M    0     0  11.1M      0  0:00:03  0:00:03 --:--:-- 11.1M

Now we can use uproot to read in the ROOT data and look at it.

[3]:
# This is all for demonstration purposes, to show people how this type of
# writing could be done.
# But of course people could just create their own awkward arrays.

f = uproot.open('SMHiggsToZZTo4L.root')
events = f['Events']
print('Events:')
print(events)
print()
print('Keys in Events:')
print(events.keys())
Events:
<TTree 'Events' (32 branches) at 0x7f8d9838f1c0>

Keys in Events:
['run', 'luminosityBlock', 'event', 'PV_npvs', 'PV_x', 'PV_y', 'PV_z', 'nMuon', 'Muon_pt', 'Muon_eta', 'Muon_phi', 'Muon_mass', 'Muon_charge', 'Muon_pfRelIso03_all', 'Muon_pfRelIso04_all', 'Muon_dxy', 'Muon_dxyErr', 'Muon_dz', 'Muon_dzErr', 'nElectron', 'Electron_pt', 'Electron_eta', 'Electron_phi', 'Electron_mass', 'Electron_charge', 'Electron_pfRelIso03_all', 'Electron_dxy', 'Electron_dxyErr', 'Electron_dz', 'Electron_dzErr', 'MET_pt', 'MET_phi']

uproot reads in the ROOT file as a TTree object so we need to parse this into a form that is easier to work with. # While not all the entries in the ROOT file naturally lend themselves to group/dataset breakdowns, some do. Let’s find those “automatically”, just to make it easier to write them to the hepfile.

[4]:
# Find groups
def make_groups_and_datasets(fields):

    groups = {}

    for field in fields:
        if field.find('_')>=0:

            # Do this in case there is more than one underscore
            idx = field.find('_')

            #print(field)
            grp = field[0:idx]
            dset = field[idx+1:]

            if grp not in groups.keys():
                groups[grp] = [[field,dset]]
            else:
                groups[grp].append([field,dset])

    return groups


############################################################

groupings = make_groups_and_datasets(events.keys())

# Groupings gives us a nice mapping of the names from the ROOT file
# to how we're going to store them in our hepfile as
# group/datasets
print(groupings)
print()
print(groupings['Muon'])
{'PV': [['PV_npvs', 'npvs'], ['PV_x', 'x'], ['PV_y', 'y'], ['PV_z', 'z']], 'Muon': [['Muon_pt', 'pt'], ['Muon_eta', 'eta'], ['Muon_phi', 'phi'], ['Muon_mass', 'mass'], ['Muon_charge', 'charge'], ['Muon_pfRelIso03_all', 'pfRelIso03_all'], ['Muon_pfRelIso04_all', 'pfRelIso04_all'], ['Muon_dxy', 'dxy'], ['Muon_dxyErr', 'dxyErr'], ['Muon_dz', 'dz'], ['Muon_dzErr', 'dzErr']], 'Electron': [['Electron_pt', 'pt'], ['Electron_eta', 'eta'], ['Electron_phi', 'phi'], ['Electron_mass', 'mass'], ['Electron_charge', 'charge'], ['Electron_pfRelIso03_all', 'pfRelIso03_all'], ['Electron_dxy', 'dxy'], ['Electron_dxyErr', 'dxyErr'], ['Electron_dz', 'dz'], ['Electron_dzErr', 'dzErr']], 'MET': [['MET_pt', 'pt'], ['MET_phi', 'phi']]}

[['Muon_pt', 'pt'], ['Muon_eta', 'eta'], ['Muon_phi', 'phi'], ['Muon_mass', 'mass'], ['Muon_charge', 'charge'], ['Muon_pfRelIso03_all', 'pfRelIso03_all'], ['Muon_pfRelIso04_all', 'pfRelIso04_all'], ['Muon_dxy', 'dxy'], ['Muon_dxyErr', 'dxyErr'], ['Muon_dz', 'dz'], ['Muon_dzErr', 'dzErr']]

The datasets that do not fit in this group/dataset structure can be written as singletons.

[5]:
# There are some others. THese will be SINGLETONS that we pass in separately.
# 'run',
# 'luminosityBlock',
# 'event',

print(events['run'].array())
print(events['luminosityBlock'].array())
print(events['event'].array())
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..., 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
[156, 156, 156, 156, 156, 156, 156, 156, ..., 996, 996, 996, 996, 996, 996, 996]
[46501, 46502, 46503, 46504, 46505, ..., 298796, 298797, 298798, 298799, 298800]

We can then write this data to a hepfile using hf.awkward_tools.pack_multiple_awkward_arrays. First, we need to initialize a data dictionary:

[6]:
# Initialize the data dictionary
data = hf.initialize()

Then, we pack the groups and dataset pairs into data:

[7]:
# Pack these groups of awkward arrays

# This is what it would look like "by hand"
# A dictionary with the name of the dataset as it is to appear inside the hepfile
# and then the actual awkward array (not just the Branch object returned by uproot)

# Here I'm packing all the data that are groups/datasets
for groups_to_write in ['Muon', 'Electron', 'MET', 'PV']:
    ak_arrays = {}
    for grouping in groupings[groups_to_write]:
        ak_arrays[grouping[1]] = events[grouping[0]].array()

    hf.awkward_tools.pack_multiple_awkward_arrays(data, ak_arrays, group_name=groups_to_write)

The, we can pack the singletons into the hepfile.

[8]:
# Now the SINGLETONS
ak_arrays = {"run":events['run'].array(), \
             "luminosityBlock":events['luminosityBlock'].array(), \
             "event":events['event'].array()}

# Note that there is no group name passed in.
hf.awkward_tools.pack_multiple_awkward_arrays(data, ak_arrays)

Let’s take a look at the keys in data and see how we did!

[9]:
print(data.keys())
dict_keys(['_GROUPS_', '_MAP_DATASETS_TO_COUNTERS_', '_LIST_OF_COUNTERS_', '_SINGLETONS_GROUP_/COUNTER', '_MAP_DATASETS_TO_DATA_TYPES_', '_META_', 'Muon/pt', 'Muon/nMuon', 'Muon/eta', 'Muon/phi', 'Muon/mass', 'Muon/charge', 'Muon/pfRelIso03_all', 'Muon/pfRelIso04_all', 'Muon/dxy', 'Muon/dxyErr', 'Muon/dz', 'Muon/dzErr', 'Electron/pt', 'Electron/nElectron', 'Electron/eta', 'Electron/phi', 'Electron/mass', 'Electron/charge', 'Electron/pfRelIso03_all', 'Electron/dxy', 'Electron/dxyErr', 'Electron/dz', 'Electron/dzErr', 'MET/pt', 'MET/nMET', 'MET/phi', 'PV/npvs', 'PV/nPV', 'PV/x', 'PV/y', 'PV/z', 'run', 'luminosityBlock', 'event'])

It looks good! So, finally, we write this data to a hepfile!

[10]:
# Try it with no compression
hf.write_to_file('root_to_hepfile.h5', data, verbose=False)
[10]:
<Closed HDF5 file>

This is a rudimentary example and you can imagine making this process more automated, especially if you need to do this on lots of files. But, for now, this is an efficient way to convert a large ROOT file into a hepfile!