Reading hepfiles#

Note: If you have not run through the write_hepfile_from_scratch.ipynb do that first to generate the output file from that. That output file will be used as the input here!

Reading the Entire File#

[1]:
# import the load function
import os
from hepfile import load

We begin with a file, and load it into an empty data dictionary:

[2]:
infile = 'output_from_scratch.hdf5'
if not os.path.exists(infile):
    raise FileNotFoundError('Make sure you ran through `write_hepfile_from_scratch.ipynb` before this tutorial!')
else:
    data, event = load(infile)

data is a dictionary containing counters, indices, and data for all the features we care about. event is an empty dictionary waiting to be filled by data from some new event.

[3]:
print(data.keys())
dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'METpx', 'METpy', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', 'muons/e', 'muons/px', 'muons/py', 'muons/pz', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])
[4]:
print(event)
{'METpx': None, 'METpy': None, '_SINGLETONS_GROUP_/COUNTER': None, 'jet/algorithm': None, 'jet/e': None, 'jet/njet': None, 'jet/px': None, 'jet/py': None, 'jet/pz': None, 'jet/words': None, 'muons/e': None, 'muons/nmuon': None, 'muons/px': None, 'muons/py': None, 'muons/pz': None}

Reading Part of a File#

If you only want to read part of a file, you can load only certain groups. This is especially useful for very large datasets.

To do this, you can use the desired_groups and subset arguments to load:

[5]:
data,event = load(infile,desired_groups=['jet'],subset=(5,10))
[6]:
print(data.keys())
dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])

Reading into Awkward Arrays#

Awkward arrays are a very fast datatype for heterogeneous datasets. It is relatively easy to read hepfiles into them, all you need to do is add the flag return_type='awkward' to load. Note: the event return will still just be a simple dictionary.

[7]:
data,event = load(infile, return_type='awkward')
[8]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset
[{METpx: 0.797, METpy: 0.286, jet: {...}, muons: {...}},
 {METpx: 0.92, METpy: 0.322, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.97, METpy: 0.428, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.67, METpy: 0.554, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.537, METpy: 0.124, jet: {...}, muons: {...}},
 {METpx: 0.967, METpy: 0.903, jet: {...}, muons: {...}},
 {METpx: 0.0275, METpy: 0.594, jet: {...}, muons: {...}},
 {METpx: 0.0261, METpy: 0.937, jet: {...}, muons: {...}},
 {METpx: 0.0508, METpy: 0.457, jet: {...}, muons: {...}},
 {METpx: 0.239, METpy: 0.157, jet: {...}, muons: {...}},
 ...,
 {METpx: 0.366, METpy: 0.548, jet: {...}, muons: {...}},
 {METpx: 0.551, METpy: 0.106, jet: {...}, muons: {...}},
 {METpx: 0.303, METpy: 0.00645, jet: {...}, muons: {...}},
 {METpx: 0.678, METpy: 0.283, jet: {...}, muons: {...}},
 {METpx: 0.462, METpy: 0.882, jet: {...}, muons: {...}},
 {METpx: 0.389, METpy: 0.819, jet: {...}, muons: {...}},
 {METpx: 0.747, METpy: 0.134, jet: {...}, muons: {...}},
 {METpx: 0.869, METpy: 0.257, jet: {...}, muons: {...}},
 {METpx: 0.286, METpy: 0.957, jet: {...}, muons: {...}}]

[{algorithm: [-1, 0, -1, -1, ..., -1, 0, 0], e: [0.235, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, -1, -1], e: [0.776, ...], px: [...], ...},
 {algorithm: [-1, 0, 0, 0, ..., 0, 0, -1], e: [0.753, ...], px: [...], ...},
 {algorithm: [-1, 0, 0, -1, ..., 0, 0, -1], e: [0.812, ...], px: [...], ...},
 {algorithm: [0, 0, -1, -1, ..., 0, -1, 0], e: [0.73, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, 0, ..., 0, 0, 0], e: [0.715, ...], px: [...], ...},
 {algorithm: [0, -1, -1, -1, ..., 0, 0, -1], e: [0.57, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, ..., 0, -1, -1], e: [0.833, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, -1, -1], e: [0.986, ...], px: [...], ...},
 {algorithm: [-1, -1, -1, ..., 0, -1, -1], e: [0.551, ...], px: [...], ...},
 ...,
 {algorithm: [0, 0, 0, 0, ..., 0, 0, -1], e: [0.668, ...], px: [...], ...},
 {algorithm: [-1, 0, 0, 0, ..., 0, 0, 0, 0], e: [0.89, ...], px: [...], ...},
 {algorithm: [0, -1, -1, 0, ..., 0, -1, 0], e: [0.0858, ...], px: [...], ...},
 {algorithm: [0, 0, 0, -1, ..., 0, -1, 0], e: [0.56, ...], px: [...], ...},
 {algorithm: [0, 0, -1, 0, ..., 0, 0, 0], e: [0.201, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., -1, -1, 0], e: [0.059, ...], px: [...], ...},
 {algorithm: [0, 0, -1, 0, ..., -1, 0, 0], e: [0.0394, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, -1, 0], e: [0.133, ...], px: [...], ...},
 {algorithm: [-1, -1, -1, ..., -1, 0, 0], e: [0.072, ...], px: [...], ...}]

[[0.377, 0.354, 0.385, 0.837, 0.67, ..., 0.827, 0.47, 0.606, 0.91, 0.481],
 [0.807, 0.348, 0.685, 0.589, 0.867, ..., 0.1, 0.985, 0.949, 0.194, 0.213],
 [0.501, 0.205, 0.792, 0.203, 0.547, ..., 0.82, 0.0124, 0.496, 0.259, 0.796],
 [0.85, 0.0258, 0.724, 0.578, 0.539, ..., 0.829, 0.411, 0.652, 0.463, 0.279],
 [0.635, 0.627, 0.266, 0.875, 0.808, ..., 0.569, 0.518, 0.407, 0.678, 0.829],
 [0.368, 0.133, 0.207, 0.0936, 0.658, ..., 0.735, 0.0749, 0.765, 0.842, 0.42],
 [0.815, 0.794, 0.473, 0.222, 0.792, ..., 0.785, 0.399, 0.725, 0.267, 0.0654],
 [0.849, 0.425, 0.171, 0.663, 0.763, ..., 0.395, 0.0242, 0.381, 0.387, 0.347],
 [0.372, 0.318, 0.906, 0.974, 0.825, ..., 0.78, 0.598, 0.314, 0.56, 0.381],
 [0.107, 0.629, 0.55, 0.256, 0.361, ..., 0.152, 0.42, 0.795, 0.0826, 0.296],
 ...,
 [0.565, 0.268, 0.167, 0.701, 0.676, ..., 0.639, 0.93, 0.873, 0.152, 0.889],
 [0.548, 0.311, 0.467, 0.0944, 0.0539, ..., 0.434, 0.211, 0.153, 0.833, 0.0633],
 [0.239, 0.968, 0.263, 0.634, 0.393, ..., 0.0982, 0.772, 0.743, 0.0389, 0.152],
 [0.68, 0.26, 0.198, 0.655, 0.413, ..., 0.889, 0.29, 0.604, 0.126, 0.869],
 [0.416, 0.241, 0.838, 0.028, 0.308, ..., 0.748, 0.922, 0.345, 0.0838, 0.449],
 [0.244, 0.123, 0.146, 0.664, 0.801, ..., 0.244, 0.275, 0.905, 0.868, 0.124],
 [0.906, 0.0939, 0.99, 0.0212, 0.0361, ..., 0.0558, 0.886, 0.264, 0.687, 0.315],
 [0.644, 0.296, 0.218, 0.154, 0.575, ..., 0.694, 0.212, 0.55, 0.14, 0.805],
 [0.613, 0.0443, 0.586, 0.253, 0.319, ..., 0.52, 0.617, 0.412, 0.477, 0.76]]
[9]:
event
[9]:
{'METpx': None,
 'METpy': None,
 '_SINGLETONS_GROUP_/COUNTER': None,
 'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None,
 'muons/e': None,
 'muons/nmuon': None,
 'muons/px': None,
 'muons/py': None,
 'muons/pz': None}

With the return_type=awkward flag, you can still select a subset of the data in the same way!

[10]:
data,event = load(infile, return_type='awkward', desired_groups=['jet'], subset=(5,10))
[11]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset
[{jet: {algorithm: [-1, 0, -1, ..., 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, -1, ..., 0, -1], e: [...], px: [...], ...}},
 {jet: {algorithm: [-1, 0, -1, ..., -1, -1], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, 0, ..., -1, -1], e: [...], px: [...], ...}},
 {jet: {algorithm: [-1, -1, ..., -1, -1], e: [...], px: [...], ...}}]

[{algorithm: [-1, 0, -1, 0, ..., 0, 0, 0], e: [0.715, ...], px: [...], ...},
 {algorithm: [0, -1, -1, -1, ..., 0, 0, -1], e: [0.57, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, ..., 0, -1, -1], e: [0.833, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, -1, -1], e: [0.986, ...], px: [...], ...},
 {algorithm: [-1, -1, -1, ..., 0, -1, -1], e: [0.551, ...], px: [...], ...}]

[[0.368, 0.133, 0.207, 0.0936, 0.658, ..., 0.735, 0.0749, 0.765, 0.842, 0.42],
 [0.815, 0.794, 0.473, 0.222, 0.792, ..., 0.785, 0.399, 0.725, 0.267, 0.0654],
 [0.849, 0.425, 0.171, 0.663, 0.763, ..., 0.395, 0.0242, 0.381, 0.387, 0.347],
 [0.372, 0.318, 0.906, 0.974, 0.825, ..., 0.78, 0.598, 0.314, 0.56, 0.381],
 [0.107, 0.629, 0.55, 0.256, 0.361, ..., 0.152, 0.42, 0.795, 0.0826, 0.296]]
[12]:
event
[12]:
{'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None}

Reading into a Dictionary of Pandas DataFrames#

To read into a dictionary of pandas dataframes where each dataframe represents data on a different group all we need to do is provide return_type='pandas' to load.

[13]:
data, event = load(infile, return_type='pandas')
[14]:
print(f'Group Names: {data.keys()}')
Group Names: dict_keys(['_SINGLETONS_GROUP_', 'jet', 'muons'])
[15]:
print('jet information:')
print(data['jet'])
jet information:
       algorithm         e        px        py        pz     words  event_num
0             -1  0.235265  0.377286  0.013758  0.281951  b'aloha'          0
1              0  0.716575  0.353589  0.301027  0.192722     b'hi'          0
2             -1  0.540661  0.384670  0.169350  0.680987   b'ciao'          0
3             -1  0.565090  0.837155  0.872730  0.316655  b'aloha'          0
4              0  0.276296  0.669866  0.650584  0.834869  b'aloha'          0
...          ...       ...       ...       ...       ...       ...        ...
16995          0  0.777277  0.519738  0.584161  0.512100    b'bye'        999
16996         -1  0.762006  0.616540  0.758635  0.266155  b'aloha'        999
16997         -1  0.880660  0.411910  0.524687  0.899915    b'bye'        999
16998          0  0.680288  0.476806  0.177154  0.999250    b'bye'        999
16999          0  0.785057  0.760136  0.653502  0.202684     b'hi'        999

[17000 rows x 7 columns]

Once again, we can use a subset of the data with specific groups. However, note how the event numbers get reset to 0-4 when we use a subset with 5 rows. If this is a problem, you should look at converting the default output of load to a dictionary of pandas dataframes by hand using the hf.df_tools.hepfile_to_df method.

[16]:
data,event = load(infile, return_type='pandas', desired_groups=['jet'], subset=(5,10))
[17]:
print(data['jet'])
    algorithm         e        px        py        pz     words  event_num
0          -1  0.714961  0.368254  0.305177  0.703897   b'ciao'          0
1           0  0.905348  0.132723  0.118752  0.267158   b'ciao'          0
2          -1  0.992758  0.206741  0.666685  0.884588     b'hi'          0
3           0  0.951870  0.093641  0.923420  0.872557  b'aloha'          0
4           0  0.620971  0.658409  0.721826  0.350951   b'ciao'          0
..        ...       ...       ...       ...       ...       ...        ...
80          0  0.051709  0.151747  0.109117  0.555908     b'hi'          4
81          0  0.874053  0.419595  0.305876  0.975816     b'hi'          4
82          0  0.446853  0.794686  0.824664  0.775379     b'hi'          4
83         -1  0.527434  0.082579  0.447411  0.370318  b'aloha'          4
84         -1  0.597947  0.296388  0.991767  0.248695   b'ciao'          4

[85 rows x 7 columns]
[ ]: