Reading hepfiles#
Note: If you have not run through the write_hepfile_from_scratch.ipynb
do that first to generate the output file from that. That output file will be used as the input here!
Reading the Entire File#
[1]:
# import the load function
import os
from hepfile import load
We begin with a file, and load it into an empty data dictionary:
[2]:
infile = 'output_from_scratch.hdf5'
if not os.path.exists(infile):
raise FileNotFoundError('Make sure you ran through `write_hepfile_from_scratch.ipynb` before this tutorial!')
else:
data, event = load(infile)
data is a dictionary containing counters, indices, and data for all the features we care about. event is an empty dictionary waiting to be filled by data from some new event.
[3]:
print(data.keys())
dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'METpx', 'METpy', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', 'muons/e', 'muons/px', 'muons/py', 'muons/pz', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])
[4]:
print(event)
{'METpx': None, 'METpy': None, '_SINGLETONS_GROUP_/COUNTER': None, 'jet/algorithm': None, 'jet/e': None, 'jet/njet': None, 'jet/px': None, 'jet/py': None, 'jet/pz': None, 'jet/words': None, 'muons/e': None, 'muons/nmuon': None, 'muons/px': None, 'muons/py': None, 'muons/pz': None}
Reading Part of a File#
If you only want to read part of a file, you can load only certain groups. This is especially useful for very large datasets.
To do this, you can use the desired_groups
and subset
arguments to load:
[5]:
data,event = load(infile,desired_groups=['jet'],subset=(5,10))
[6]:
print(data.keys())
dict_keys(['_MAP_DATASETS_TO_COUNTERS_', '_MAP_DATASETS_TO_INDEX_', '_LIST_OF_COUNTERS_', '_LIST_OF_DATASETS_', '_META_', '_NUMBER_OF_BUCKETS_', '_SINGLETONS_GROUP_', '_SINGLETONS_GROUP_/COUNTER', '_SINGLETONS_GROUP_/COUNTER_INDEX', 'jet/njet', 'jet/njet_INDEX', 'muons/nmuon', 'muons/nmuon_INDEX', 'jet/algorithm', 'jet/e', 'jet/px', 'jet/py', 'jet/pz', 'jet/words', '_GROUPS_', '_MAP_DATASETS_TO_DATA_TYPES_', '_PROTECTED_NAMES_'])
Reading into Awkward Arrays#
Awkward arrays are a very fast datatype for heterogeneous datasets. It is relatively easy to read hepfiles into them, all you need to do is add the flag return_type='awkward'
to load
. Note: the event return will still just be a simple dictionary.
[7]:
data,event = load(infile, return_type='awkward')
[8]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset
[{METpx: 0.797, METpy: 0.286, jet: {...}, muons: {...}},
{METpx: 0.92, METpy: 0.322, jet: {...}, muons: {e: [], ...}},
{METpx: 0.97, METpy: 0.428, jet: {...}, muons: {e: [], ...}},
{METpx: 0.67, METpy: 0.554, jet: {...}, muons: {e: [], ...}},
{METpx: 0.537, METpy: 0.124, jet: {...}, muons: {...}},
{METpx: 0.967, METpy: 0.903, jet: {...}, muons: {...}},
{METpx: 0.0275, METpy: 0.594, jet: {...}, muons: {...}},
{METpx: 0.0261, METpy: 0.937, jet: {...}, muons: {...}},
{METpx: 0.0508, METpy: 0.457, jet: {...}, muons: {...}},
{METpx: 0.239, METpy: 0.157, jet: {...}, muons: {...}},
...,
{METpx: 0.366, METpy: 0.548, jet: {...}, muons: {...}},
{METpx: 0.551, METpy: 0.106, jet: {...}, muons: {...}},
{METpx: 0.303, METpy: 0.00645, jet: {...}, muons: {...}},
{METpx: 0.678, METpy: 0.283, jet: {...}, muons: {...}},
{METpx: 0.462, METpy: 0.882, jet: {...}, muons: {...}},
{METpx: 0.389, METpy: 0.819, jet: {...}, muons: {...}},
{METpx: 0.747, METpy: 0.134, jet: {...}, muons: {...}},
{METpx: 0.869, METpy: 0.257, jet: {...}, muons: {...}},
{METpx: 0.286, METpy: 0.957, jet: {...}, muons: {...}}]
[{algorithm: [-1, 0, -1, -1, ..., -1, 0, 0], e: [0.235, ...], px: [...], ...},
{algorithm: [0, -1, 0, 0, ..., 0, -1, -1], e: [0.776, ...], px: [...], ...},
{algorithm: [-1, 0, 0, 0, ..., 0, 0, -1], e: [0.753, ...], px: [...], ...},
{algorithm: [-1, 0, 0, -1, ..., 0, 0, -1], e: [0.812, ...], px: [...], ...},
{algorithm: [0, 0, -1, -1, ..., 0, -1, 0], e: [0.73, ...], px: [...], ...},
{algorithm: [-1, 0, -1, 0, ..., 0, 0, 0], e: [0.715, ...], px: [...], ...},
{algorithm: [0, -1, -1, -1, ..., 0, 0, -1], e: [0.57, ...], px: [...], ...},
{algorithm: [-1, 0, -1, ..., 0, -1, -1], e: [0.833, ...], px: [...], ...},
{algorithm: [0, -1, 0, 0, ..., 0, -1, -1], e: [0.986, ...], px: [...], ...},
{algorithm: [-1, -1, -1, ..., 0, -1, -1], e: [0.551, ...], px: [...], ...},
...,
{algorithm: [0, 0, 0, 0, ..., 0, 0, -1], e: [0.668, ...], px: [...], ...},
{algorithm: [-1, 0, 0, 0, ..., 0, 0, 0, 0], e: [0.89, ...], px: [...], ...},
{algorithm: [0, -1, -1, 0, ..., 0, -1, 0], e: [0.0858, ...], px: [...], ...},
{algorithm: [0, 0, 0, -1, ..., 0, -1, 0], e: [0.56, ...], px: [...], ...},
{algorithm: [0, 0, -1, 0, ..., 0, 0, 0], e: [0.201, ...], px: [...], ...},
{algorithm: [0, -1, 0, 0, ..., -1, -1, 0], e: [0.059, ...], px: [...], ...},
{algorithm: [0, 0, -1, 0, ..., -1, 0, 0], e: [0.0394, ...], px: [...], ...},
{algorithm: [0, -1, 0, 0, ..., 0, -1, 0], e: [0.133, ...], px: [...], ...},
{algorithm: [-1, -1, -1, ..., -1, 0, 0], e: [0.072, ...], px: [...], ...}]
[[0.377, 0.354, 0.385, 0.837, 0.67, ..., 0.827, 0.47, 0.606, 0.91, 0.481],
[0.807, 0.348, 0.685, 0.589, 0.867, ..., 0.1, 0.985, 0.949, 0.194, 0.213],
[0.501, 0.205, 0.792, 0.203, 0.547, ..., 0.82, 0.0124, 0.496, 0.259, 0.796],
[0.85, 0.0258, 0.724, 0.578, 0.539, ..., 0.829, 0.411, 0.652, 0.463, 0.279],
[0.635, 0.627, 0.266, 0.875, 0.808, ..., 0.569, 0.518, 0.407, 0.678, 0.829],
[0.368, 0.133, 0.207, 0.0936, 0.658, ..., 0.735, 0.0749, 0.765, 0.842, 0.42],
[0.815, 0.794, 0.473, 0.222, 0.792, ..., 0.785, 0.399, 0.725, 0.267, 0.0654],
[0.849, 0.425, 0.171, 0.663, 0.763, ..., 0.395, 0.0242, 0.381, 0.387, 0.347],
[0.372, 0.318, 0.906, 0.974, 0.825, ..., 0.78, 0.598, 0.314, 0.56, 0.381],
[0.107, 0.629, 0.55, 0.256, 0.361, ..., 0.152, 0.42, 0.795, 0.0826, 0.296],
...,
[0.565, 0.268, 0.167, 0.701, 0.676, ..., 0.639, 0.93, 0.873, 0.152, 0.889],
[0.548, 0.311, 0.467, 0.0944, 0.0539, ..., 0.434, 0.211, 0.153, 0.833, 0.0633],
[0.239, 0.968, 0.263, 0.634, 0.393, ..., 0.0982, 0.772, 0.743, 0.0389, 0.152],
[0.68, 0.26, 0.198, 0.655, 0.413, ..., 0.889, 0.29, 0.604, 0.126, 0.869],
[0.416, 0.241, 0.838, 0.028, 0.308, ..., 0.748, 0.922, 0.345, 0.0838, 0.449],
[0.244, 0.123, 0.146, 0.664, 0.801, ..., 0.244, 0.275, 0.905, 0.868, 0.124],
[0.906, 0.0939, 0.99, 0.0212, 0.0361, ..., 0.0558, 0.886, 0.264, 0.687, 0.315],
[0.644, 0.296, 0.218, 0.154, 0.575, ..., 0.694, 0.212, 0.55, 0.14, 0.805],
[0.613, 0.0443, 0.586, 0.253, 0.319, ..., 0.52, 0.617, 0.412, 0.477, 0.76]]
[9]:
event
[9]:
{'METpx': None,
'METpy': None,
'_SINGLETONS_GROUP_/COUNTER': None,
'jet/algorithm': None,
'jet/e': None,
'jet/njet': None,
'jet/px': None,
'jet/py': None,
'jet/pz': None,
'jet/words': None,
'muons/e': None,
'muons/nmuon': None,
'muons/px': None,
'muons/py': None,
'muons/pz': None}
With the return_type=awkward
flag, you can still select a subset of the data in the same way!
[10]:
data,event = load(infile, return_type='awkward', desired_groups=['jet'], subset=(5,10))
[11]:
data.show() # display data
print()
data['jet'].show() # display just the jet data
print()
data.jet.px.show() # display the px data from the jet dataset
[{jet: {algorithm: [-1, 0, -1, ..., 0, 0], e: [...], px: [...], ...}},
{jet: {algorithm: [0, -1, -1, ..., 0, -1], e: [...], px: [...], ...}},
{jet: {algorithm: [-1, 0, -1, ..., -1, -1], e: [...], px: [...], ...}},
{jet: {algorithm: [0, -1, 0, ..., -1, -1], e: [...], px: [...], ...}},
{jet: {algorithm: [-1, -1, ..., -1, -1], e: [...], px: [...], ...}}]
[{algorithm: [-1, 0, -1, 0, ..., 0, 0, 0], e: [0.715, ...], px: [...], ...},
{algorithm: [0, -1, -1, -1, ..., 0, 0, -1], e: [0.57, ...], px: [...], ...},
{algorithm: [-1, 0, -1, ..., 0, -1, -1], e: [0.833, ...], px: [...], ...},
{algorithm: [0, -1, 0, 0, ..., 0, -1, -1], e: [0.986, ...], px: [...], ...},
{algorithm: [-1, -1, -1, ..., 0, -1, -1], e: [0.551, ...], px: [...], ...}]
[[0.368, 0.133, 0.207, 0.0936, 0.658, ..., 0.735, 0.0749, 0.765, 0.842, 0.42],
[0.815, 0.794, 0.473, 0.222, 0.792, ..., 0.785, 0.399, 0.725, 0.267, 0.0654],
[0.849, 0.425, 0.171, 0.663, 0.763, ..., 0.395, 0.0242, 0.381, 0.387, 0.347],
[0.372, 0.318, 0.906, 0.974, 0.825, ..., 0.78, 0.598, 0.314, 0.56, 0.381],
[0.107, 0.629, 0.55, 0.256, 0.361, ..., 0.152, 0.42, 0.795, 0.0826, 0.296]]
[12]:
event
[12]:
{'jet/algorithm': None,
'jet/e': None,
'jet/njet': None,
'jet/px': None,
'jet/py': None,
'jet/pz': None,
'jet/words': None}
Reading into a Dictionary of Pandas DataFrames#
To read into a dictionary of pandas dataframes where each dataframe represents data on a different group all we need to do is provide return_type='pandas'
to load
.
[13]:
data, event = load(infile, return_type='pandas')
[14]:
print(f'Group Names: {data.keys()}')
Group Names: dict_keys(['_SINGLETONS_GROUP_', 'jet', 'muons'])
[15]:
print('jet information:')
print(data['jet'])
jet information:
algorithm e px py pz words event_num
0 -1 0.235265 0.377286 0.013758 0.281951 b'aloha' 0
1 0 0.716575 0.353589 0.301027 0.192722 b'hi' 0
2 -1 0.540661 0.384670 0.169350 0.680987 b'ciao' 0
3 -1 0.565090 0.837155 0.872730 0.316655 b'aloha' 0
4 0 0.276296 0.669866 0.650584 0.834869 b'aloha' 0
... ... ... ... ... ... ... ...
16995 0 0.777277 0.519738 0.584161 0.512100 b'bye' 999
16996 -1 0.762006 0.616540 0.758635 0.266155 b'aloha' 999
16997 -1 0.880660 0.411910 0.524687 0.899915 b'bye' 999
16998 0 0.680288 0.476806 0.177154 0.999250 b'bye' 999
16999 0 0.785057 0.760136 0.653502 0.202684 b'hi' 999
[17000 rows x 7 columns]
Once again, we can use a subset of the data with specific groups. However, note how the event numbers get reset to 0-4 when we use a subset with 5 rows. If this is a problem, you should look at converting the default output of load
to a dictionary of pandas dataframes by hand using the hf.df_tools.hepfile_to_df
method.
[16]:
data,event = load(infile, return_type='pandas', desired_groups=['jet'], subset=(5,10))
[17]:
print(data['jet'])
algorithm e px py pz words event_num
0 -1 0.714961 0.368254 0.305177 0.703897 b'ciao' 0
1 0 0.905348 0.132723 0.118752 0.267158 b'ciao' 0
2 -1 0.992758 0.206741 0.666685 0.884588 b'hi' 0
3 0 0.951870 0.093641 0.923420 0.872557 b'aloha' 0
4 0 0.620971 0.658409 0.721826 0.350951 b'ciao' 0
.. ... ... ... ... ... ... ...
80 0 0.051709 0.151747 0.109117 0.555908 b'hi' 4
81 0 0.874053 0.419595 0.305876 0.975816 b'hi' 4
82 0 0.446853 0.794686 0.824664 0.775379 b'hi' 4
83 -1 0.527434 0.082579 0.447411 0.370318 b'aloha' 4
84 -1 0.597947 0.296388 0.991767 0.248695 b'ciao' 4
[85 rows x 7 columns]
[ ]: