Reading hepfiles#

Note: If you have not run through the write_hepfile do that first to generate the output file from that. That output file will be used as the input here!

Reading the Entire File#

[1]:

# import the load function
from hepfile import load

We begin with a file, and load it into an empty data dictionary:

[2]:

infile = 'output_from_scratch.hdf5'
data, event = load(infile)

data is a dictionary containing counters, indices, and data for all the features we care about. event is an empty dictionary waiting to be filled by data from some new event.

[3]:

print(data)

{'_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER', 'jet': 'jet/njet', 'muons': 'muons/nmuon', 'jet/e': 'jet/njet', 'jet/px': 'jet/njet', 'jet/py': 'jet/njet', 'jet/pz': 'jet/njet', 'jet/algorithm': 'jet/njet', 'jet/words': 'jet/njet', 'muons/e': 'muons/nmuon', 'muons/px': 'muons/nmuon', 'muons/py': 'muons/nmuon', 'muons/pz': 'muons/nmuon', 'METpx': '_SINGLETONS_GROUP_/COUNTER', 'METpy': '_SINGLETONS_GROUP_/COUNTER'}, '_MAP_DATASETS_TO_INDEX_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet': 'jet/njet_INDEX', 'muons': 'muons/nmuon_INDEX', 'jet/e': 'jet/njet_INDEX', 'jet/px': 'jet/njet_INDEX', 'jet/py': 'jet/njet_INDEX', 'jet/pz': 'jet/njet_INDEX', 'jet/algorithm': 'jet/njet_INDEX', 'jet/words': 'jet/njet_INDEX', 'muons/e': 'muons/nmuon_INDEX', 'muons/px': 'muons/nmuon_INDEX', 'muons/py': 'muons/nmuon_INDEX', 'muons/pz': 'muons/nmuon_INDEX', 'METpx': '_SINGLETONS_GROUP_/COUNTER_INDEX', 'METpy': '_SINGLETONS_GROUP_/COUNTER_INDEX'}, '_LIST_OF_COUNTERS_': ['_SINGLETONS_GROUP_/COUNTER', 'jet/njet', 'muons/nmuon'], '_LIST_OF_DATASETS_': ['METpx', 'METpy', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', 'jet', 'jet/algorithm', 'jet/e', 'jet/njet', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', 'muons', 'muons/e', 'muons/nmuon', 'muons/px', 'muons/py', 'muons/pz'], '_META_': {}, '_NUMBER_OF_BUCKETS_': 10000, '_SINGLETONS_GROUP_': array(['METpx', 'METpy'], dtype='<U5'), '_SINGLETONS_GROUP_/COUNTER': array([1, 1, 1, ..., 1, 1, 1]), '_SINGLETONS_GROUP_/COUNTER_INDEX': array([   0,    1,    2, ..., 9997, 9998, 9999]), 'jet/njet': array([17, 17, 17, ..., 17, 17, 17]), 'jet/njet_INDEX': array([     0,     17,     34, ..., 169949, 169966, 169983]), 'muons/nmuon': array([0, 0, 0, ..., 0, 0, 0]), 'muons/nmuon_INDEX': array([0, 0, 0, ..., 0, 0, 0]), 'METpx': array([0.3857817 , 0.12295702, 0.86312515, ..., 0.03597181, 0.16764084,
       0.02867685], dtype=float32), 'METpy': array([0.21301576, 0.92661446, 0.17768492, ..., 0.05196636, 0.3594888 ,
       0.80965936], dtype=float32), 'jet/algorithm': array([-1,  0,  0, ...,  0,  0, -1]), 'jet/e': array([0.51882523, 0.06649859, 0.8128549 , ..., 0.27712795, 0.40421703,
       0.99385214], dtype=float32), 'jet/px': array([0.72576076, 0.16795638, 0.55153126, ..., 0.49512222, 0.76267385,
       0.88863903], dtype=float32), 'jet/py': array([0.13361722, 0.6626456 , 0.12324467, ..., 0.7238448 , 0.06445312,
       0.40084764], dtype=float32), 'jet/pz': array([0.4981057 , 0.4544982 , 0.05688357, ..., 0.48145318, 0.6470511 ,
       0.5942736 ], dtype=float32), 'jet/words': array([b'aloha', b'hi', b'ciao', ..., b'bye', b'bye', b'bye'],
      dtype=object), 'muons/e': array([], dtype=float32), 'muons/px': array([], dtype=float32), 'muons/py': array([], dtype=float32), 'muons/pz': array([], dtype=float32), '_GROUPS_': {'_SINGLETONS_GROUP_': ['METpx', 'METpy'], 'jet': ['algorithm', 'e', 'njet', 'px', 'py', 'pz', 'words'], 'muons': ['e', 'nmuon', 'px', 'py', 'pz']}, '_MAP_DATASETS_TO_DATA_TYPES_': {'METpx': dtype('float32'), 'METpy': dtype('float32'), '_SINGLETONS_GROUP_': dtype('<U5'), '_SINGLETONS_GROUP_/COUNTER': dtype('int64'), 'jet/algorithm': dtype('int64'), 'jet/e': dtype('float32'), 'jet/njet': dtype('int64'), 'jet/px': dtype('float32'), 'jet/py': dtype('float32'), 'jet/pz': dtype('float32'), 'jet/words': dtype('O'), 'muons/e': dtype('float32'), 'muons/nmuon': dtype('int64'), 'muons/px': dtype('float32'), 'muons/py': dtype('float32'), 'muons/pz': dtype('float32')}, '_PROTECTED_NAMES_': {'_SINGLETONS_GROUP_/COUNTER', '_PROTECTED_NAMES_', '_META_', '_SINGLETONSGROUPFORSTORAGE_', '_MAP_DATASETS_TO_DATA_TYPES_', '_GROUPS_', '_SINGLETONS_GROUP_', '_MAP_DATASETS_TO_COUNTERS_', '_LIST_OF_COUNTERS_', '_HEADER_'}}

[4]:

print(event)

{'METpx': None, 'METpy': None, '_SINGLETONS_GROUP_/COUNTER': None, 'jet/algorithm': None, 'jet/e': None, 'jet/njet': None, 'jet/px': None, 'jet/py': None, 'jet/pz': None, 'jet/words': None, 'muons/e': None, 'muons/nmuon': None, 'muons/px': None, 'muons/py': None, 'muons/pz': None}

Reading Part of a File#

If you only want to read part of a file, you can load only certain groups. This is especially useful for very large datasets.

To do this, you can use the desired_groups and subset arguments to load:

[5]:

data,event = load(infile,desired_groups=['jet'],subset=(5,10))

[6]:

print(data.keys())

dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])

Reading into Awkward Arrays#

Awkward arrays are a very fast datatype for heterogeneous datasets. It is relatively easy to read hepfiles into them, all you need to do is add the flag return_type='awkward' to load. Note: the event return will still just be a simple dictionary.

[7]:

data,event = load(infile, return_type='awkward')

[8]:

data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset

[{METpx: 0.386, METpy: 0.213, jet: {...}, muons: {...}},
 {METpx: 0.123, METpy: 0.927, jet: {...}, muons: {...}},
 {METpx: 0.863, METpy: 0.178, jet: {...}, muons: {...}},
 {METpx: 0.0628, METpy: 0.754, jet: {...}, muons: {...}},
 {METpx: 0.161, METpy: 0.408, jet: {...}, muons: {...}},
 {METpx: 0.217, METpy: 0.853, jet: {...}, muons: {...}},
 {METpx: 0.539, METpy: 0.761, jet: {...}, muons: {...}},
 {METpx: 0.631, METpy: 0.723, jet: {...}, muons: {...}},
 {METpx: 0.376, METpy: 0.846, jet: {...}, muons: {...}},
 {METpx: 0.091, METpy: 0.517, jet: {...}, muons: {...}},
 ...,
 {METpx: 0.0554, METpy: 0.0152, jet: {...}, muons: {...}},
 {METpx: 0.404, METpy: 0.41, jet: {...}, muons: {e: [], ...}},
 {METpx: 0.825, METpy: 0.688, jet: {...}, muons: {...}},
 {METpx: 0.392, METpy: 0.177, jet: {...}, muons: {...}},
 {METpx: 0.187, METpy: 0.311, jet: {...}, muons: {...}},
 {METpx: 0.873, METpy: 0.356, jet: {...}, muons: {...}},
 {METpx: 0.036, METpy: 0.052, jet: {...}, muons: {...}},
 {METpx: 0.168, METpy: 0.359, jet: {...}, muons: {...}},
 {METpx: 0.0287, METpy: 0.81, jet: {...}, muons: {...}}]

[{algorithm: [-1, 0, 0, -1, ..., 0, -1, 0], e: [0.519, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, -1, ..., -1, 0, 0], e: [0.214, ...], px: [...], ...},
 {algorithm: [0, -1, 0, -1, ..., 0, 0, -1], e: [0.0682, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., -1, -1, 0], e: [0.518, ...], px: [...], ...},
 {algorithm: [-1, -1, 0, 0, ..., 0, 0, 0], e: [0.0882, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, 0, ..., 0, -1, 0], e: [0.667, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, 0, 0], e: [0.75, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., 0, -1, 0], e: [0.564, ...], px: [...], ...},
 {algorithm: [0, -1, 0, -1, ..., -1, 0, 0], e: [0.344, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., -1, -1, 0], e: [0.289, ...], px: [...], ...},
 ...,
 {algorithm: [0, -1, -1, 0, ..., 0, -1, 0], e: [0.132, ...], px: [...], ...},
 {algorithm: [-1, -1, -1, 0, ..., 0, -1, 0], e: [0.951, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, 0, ..., 0, -1, 0], e: [0.124, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., -1, 0, -1], e: [0.935, ...], px: [...], ...},
 {algorithm: [-1, 0, -1, 0, ..., 0, 0, -1], e: [0.406, ...], px: [...], ...},
 {algorithm: [-1, 0, 0, 0, ..., 0, -1, 0], e: [0.321, ...], px: [...], ...},
 {algorithm: [-1, -1, -1, 0, ..., -1, 0, 0], e: [0.946, ...], px: [...], ...},
 {algorithm: [0, -1, -1, -1, ..., -1, 0, 0], e: [0.766, ...], px: [...], ...},
 {algorithm: [0, -1, 0, -1, ..., 0, 0, -1], e: [0.453, ...], px: [...], ...}]

[[0.726, 0.168, 0.552, 0.459, 0.521, ..., 0.4, 0.597, 0.735, 0.908, 0.135],
 [0.582, 0.0818, 0.678, 0.416, 0.555, ..., 0.57, 0.233, 0.202, 0.543, 0.586],
 [0.224, 0.952, 0.747, 0.64, 0.0682, ..., 0.648, 0.796, 0.483, 0.546, 0.558],
 [0.939, 0.043, 0.0537, 0.326, 0.188, ..., 0.433, 0.829, 0.315, 0.217, 0.967],
 [0.0541, 0.392, 0.289, 0.738, 0.00133, ..., 0.699, 0.154, 0.552, 0.984, 0.798],
 [0.891, 0.684, 0.719, 1, 0.843, 0.973, ..., 0.74, 0.663, 0.204, 0.787, 0.666],
 [0.867, 0.442, 0.319, 0.476, 0.162, ..., 0.285, 0.848, 0.907, 0.99, 0.627],
 [0.47, 0.867, 0.454, 0.656, 0.66, ..., 0.381, 0.67, 0.232, 0.814, 0.883],
 [0.769, 0.144, 0.452, 0.582, 0.921, ..., 0.513, 0.978, 0.605, 0.103, 0.788],
 [0.584, 0.447, 0.955, 0.419, 0.613, ..., 0.688, 0.236, 0.206, 0.87, 0.596],
 ...,
 [0.61, 0.415, 0.617, 0.395, 0.282, ..., 0.282, 0.0597, 0.916, 0.809, 0.181],
 [0.0727, 0.447, 0.205, 0.481, 0.0703, ..., 0.735, 0.507, 0.891, 0.665, 0.753],
 [0.932, 0.0572, 0.802, 0.535, 0.87, ..., 0.682, 0.424, 0.708, 0.168, 0.122],
 [0.184, 0.374, 0.975, 0.0478, 0.616, ..., 0.55, 0.737, 0.537, 0.585, 0.989],
 [0.924, 0.544, 0.229, 0.996, 0.531, ..., 0.917, 0.961, 0.642, 0.081, 0.0539],
 [0.352, 0.569, 0.966, 0.96, 0.986, ..., 0.673, 0.964, 0.839, 0.187, 0.715],
 [0.732, 0.823, 0.751, 0.523, 0.934, ..., 0.339, 0.706, 0.499, 0.0917, 0.904],
 [0.605, 0.777, 0.791, 0.568, 0.073, ..., 0.0806, 0.614, 0.808, 0.152, 0.00686],
 [0.833, 0.473, 0.0947, 0.419, 0.0112, ..., 0.863, 0.846, 0.495, 0.763, 0.889]]

[9]:

event

[9]:

{'METpx': None,
 'METpy': None,
 '_SINGLETONS_GROUP_/COUNTER': None,
 'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None,
 'muons/e': None,
 'muons/nmuon': None,
 'muons/px': None,
 'muons/py': None,
 'muons/pz': None}

With the return_type=awkward flag, you can still select a subset of the data in the same way!

[10]:

data,event = load(infile, return_type='awkward', desired_groups=['jet'], subset=(5,10))

[11]:

data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset

[{jet: {algorithm: [-1, 0, -1, ..., -1, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, 0, ..., 0, 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, 0, 0, ..., 0, -1, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, -1, 0, ..., 0, 0], e: [...], px: [...], ...}},
 {jet: {algorithm: [0, 0, 0, ..., -1, 0], e: [...], px: [...], ...}}]

[{algorithm: [-1, 0, -1, 0, ..., 0, -1, 0], e: [0.667, ...], px: [...], ...},
 {algorithm: [0, -1, 0, 0, ..., 0, 0, 0], e: [0.75, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., 0, -1, 0], e: [0.564, ...], px: [...], ...},
 {algorithm: [0, -1, 0, -1, ..., -1, 0, 0], e: [0.344, ...], px: [...], ...},
 {algorithm: [0, 0, 0, 0, ..., -1, -1, 0], e: [0.289, ...], px: [...], ...}]

[[0.891, 0.684, 0.719, 1, 0.843, 0.973, ..., 0.74, 0.663, 0.204, 0.787, 0.666],
 [0.867, 0.442, 0.319, 0.476, 0.162, ..., 0.285, 0.848, 0.907, 0.99, 0.627],
 [0.47, 0.867, 0.454, 0.656, 0.66, ..., 0.381, 0.67, 0.232, 0.814, 0.883],
 [0.769, 0.144, 0.452, 0.582, 0.921, ..., 0.513, 0.978, 0.605, 0.103, 0.788],
 [0.584, 0.447, 0.955, 0.419, 0.613, ..., 0.688, 0.236, 0.206, 0.87, 0.596]]

[12]:

event

[12]:

{'jet/algorithm': None,
 'jet/e': None,
 'jet/njet': None,
 'jet/px': None,
 'jet/py': None,
 'jet/pz': None,
 'jet/words': None}

Reading into a Dictionary of Pandas DataFrames#

To read into a dictionary of pandas dataframes where each dataframe represents data on a different group all we need to do is provide return_type='pandas' to load.

[13]:

data, event = load(infile, return_type='pandas')

[14]:

print(f'Group Names: {data.keys()}')

Group Names: dict_keys(['_SINGLETONS_GROUP_', 'jet', 'muons'])

[15]:

print('jet information:')
data['jet']

jet information:

[15]:

	algorithm	e	px	py	pz	words	event_num
0	-1	0.518825	0.725761	0.133617	0.498106	b'aloha'	0
1	0	0.066499	0.167956	0.662646	0.454498	b'hi'	0
2	0	0.812855	0.551531	0.123245	0.056884	b'ciao'	0
3	-1	0.292169	0.459002	0.758781	0.953022	b'ciao'	0
4	0	0.512365	0.520725	0.240334	0.485343	b'bye'	0
...	...	...	...	...	...	...	...
169995	0	0.625886	0.863305	0.751234	0.550784	b'aloha'	9999
169996	0	0.069708	0.845667	0.879986	0.359886	b'bye'	9999
169997	0	0.277128	0.495122	0.723845	0.481453	b'bye'	9999
169998	0	0.404217	0.762674	0.064453	0.647051	b'bye'	9999
169999	-1	0.993852	0.888639	0.400848	0.594274	b'bye'	9999

170000 rows × 7 columns

Once again, we can use a subset of the data with specific groups. However, note how the event numbers get reset to 0-4 when we use a subset with 5 rows. If this is a problem, you should look at converting the default output of load to a dictionary of pandas dataframes by hand using the hf.df_tools.hepfile_to_df method.

[16]:

data,event = load(infile, return_type='pandas', desired_groups=['jet'], subset=(5,10))

[17]:

data['jet']

[17]:

	algorithm	e	px	py	pz	words	event_num
0	-1	0.667102	0.891190	0.718513	0.768162	b'bye'	0
1	0	0.146582	0.683707	0.756508	0.472253	b'ciao'	0
2	-1	0.865275	0.718874	0.927169	0.794849	b'hi'	0
3	0	0.776318	0.999773	0.350176	0.440168	b'hi'	0
4	0	0.462614	0.843460	0.351398	0.929219	b'hi'	0
...	...	...	...	...	...	...	...
80	-1	0.544807	0.688257	0.273543	0.637789	b'ciao'	4
81	-1	0.239300	0.235639	0.579898	0.609811	b'bye'	4
82	-1	0.971387	0.206133	0.797268	0.155473	b'aloha'	4
83	-1	0.576240	0.870437	0.513300	0.039285	b'bye'	4
84	0	0.126184	0.596369	0.360929	0.931362	b'bye'	4

85 rows × 7 columns

Reading hepfiles

Contents

Reading hepfiles#

Reading the Entire File#

Reading Part of a File#

Reading into Awkward Arrays#

Reading into a Dictionary of Pandas DataFrames#