Working with Pandas DataFrames#

We can also convert from hepfile’s dictionary structure to a dictionary of pandas dataframes that are organized into the groups that were in the hepfile. This is possible because the datasets under each group are the same length even though the datasets across groups are not necessarily the same length.

First, make sure that you have installed hepfile using one of the following commands. The base installation does not have these pandas tools built in! 1. python -m pip install hepfile[all], or 2. python -m pip install hepfile[pandas]

[1]:
import hepfile as hf
import pandas as pd

Hepfiles from Pandas DataFrames#

Let’s create a hepfile from a dictionary of pandas dataframes. The key here is that the dictionary must have keys of what you want the group names to be in the hepfile and each Pandas DataFrame in the dictionary has columns with dataset names. All of the singletons should be stored in a table called ‘SINGLETONS_GROUP’ so that they are stored properly. Finally, we also need to add a column with the event number of each row so that we can keep it all straight.

Here are the three groups/singletons we will use:

[2]:
group1 = pd.DataFrame({
    'x': [1, 2, 3, 4],
    'y': [5, 6, 7, 8],
    'event_num': [0, 0, 1, 1]
})

group2 = pd.DataFrame({
    'z': [10.0, 11.0, 12.5],
    'w': [1600, 25, 16],
    'event_num': [0, 0, 1]
})

singletons = pd.DataFrame({
    'some_singleton': [1, 2],
    'event_num': [0, 1]
})

Now we pack these into a single dictionary where the keys are the group names

[3]:
 indata = {
     'group_name_1': group1,
     'group_name_2': group2,
     '_SINGLETONS_GROUP_': singletons
 }

indata
[3]:
{'group_name_1':    x  y  event_num
 0  1  5          0
 1  2  6          0
 2  3  7          1
 3  4  8          1,
 'group_name_2':       z     w  event_num
 0  10.0  1600          0
 1  11.0    25          0
 2  12.5    16          1,
 '_SINGLETONS_GROUP_':    some_singleton  event_num
 0               1          0
 1               2          1}

Once we have this data, we can pass it into hf.df_tools.df_to_hepfile to convert it to a hepfile. But first let’s check out the options for this function:

[4]:
help(hf.df_tools.df_to_hepfile)
Help on function df_to_hepfile in module hepfile.df_tools:

df_to_hepfile(df_dict: 'dict[pd.DataFrame]', outfile: 'str' = None, event_num_col='event_num', write_hepfile: 'bool' = True) -> 'dict'
    Converts a list of dataframes of group data to a hepfile. The opposite of
    hepfile_to_df. Must have an event_num column!

    Args:
        df_dict (dict): dictionary of pandas DataFrame groups to write to a hepfile
        outfile (str): output file name, required if write_hepfile is True
        event_num_col (str): name of a column in the pd.DataFrame to group by
        write_hepfile (bool): should we write the hepfile data to a hepfile?

    Returns:
        dict: hepfile data dictionary

    Raises:
        InputError: If something is wrong with the specific input.

So this function takes in the dictionary of pandas dataframes that we created. The other optional inputs are the outfile path, which is necessary if we want to write the hepfile data to a hepfile, the write_hepfile boolean which just says whether we want to write to a file, and the event_num_col which is defaulted to ‘event_num’ but can be changed to a different column name too!

[5]:
data = hf.df_tools.df_to_hepfile(indata, event_num_col='event_num', write_hepfile=True, outfile='pandas_test.h5')
data
[5]:
{'_GROUPS_': {'_SINGLETONS_GROUP_': ['COUNTER', 'event_num', 'some_singleton'],
  'group_name_1': ['ngroup_name_1', 'x', 'y'],
  'group_name_2': ['ngroup_name_2', 'z', 'w']},
 '_MAP_DATASETS_TO_COUNTERS_': {'_SINGLETONS_GROUP_': '_SINGLETONS_GROUP_/COUNTER',
  'group_name_1': 'group_name_1/ngroup_name_1',
  'group_name_1/x': 'group_name_1/ngroup_name_1',
  'group_name_1/y': 'group_name_1/ngroup_name_1',
  'event_num': '_SINGLETONS_GROUP_/COUNTER',
  'group_name_2': 'group_name_2/ngroup_name_2',
  'group_name_2/z': 'group_name_2/ngroup_name_2',
  'group_name_2/w': 'group_name_2/ngroup_name_2',
  'some_singleton': '_SINGLETONS_GROUP_/COUNTER'},
 '_LIST_OF_COUNTERS_': ['_SINGLETONS_GROUP_/COUNTER',
  'group_name_1/ngroup_name_1',
  'group_name_2/ngroup_name_2'],
 '_SINGLETONS_GROUP_/COUNTER': [1, 1],
 '_MAP_DATASETS_TO_DATA_TYPES_': {'_SINGLETONS_GROUP_/COUNTER': int,
  'group_name_1/ngroup_name_1': int,
  'group_name_1/x': numpy.int64,
  'group_name_1/y': numpy.int64,
  'event_num': numpy.int64,
  'group_name_2/ngroup_name_2': int,
  'group_name_2/z': numpy.float64,
  'group_name_2/w': numpy.int64,
  'some_singleton': numpy.int64},
 '_META_': {},
 'group_name_1/ngroup_name_1': [2, 2],
 'group_name_1/x': [1, 2, 3, 4],
 'group_name_1/y': [5, 6, 7, 8],
 'event_num': [0, 1],
 'group_name_2/ngroup_name_2': [2, 1],
 'group_name_2/z': [10.0, 11.0, 12.5],
 'group_name_2/w': [1600, 25, 16],
 'some_singleton': [1, 2]}

Pandas DataFrames from hepfiles#

Now that we have this data stored in a hepfile, we can load it into a Pandas DataFrame in two ways. First, the more restrictive (but easier) option is to pass return_type='pandas' to hf.load:

[6]:
dfs, bucket = hf.load('pandas_test.h5', return_type='pandas')
dfs
[6]:
{'_SINGLETONS_GROUP_':    event_num  some_singleton
 0          0               1
 1          1               2,
 'group_name_1':    x  y  event_num
 0  1  5          0
 1  2  6          0
 2  3  7          1
 3  4  8          1,
 'group_name_2':       w     z  event_num
 0  1600  10.0          0
 1    25  11.0          0
 2    16  12.5          1}

As you can see, this easily loads the hepfile into the same dictionary of dataframes that we wrote to it! But, there are two limitations to this method: 1. We can’t select a subset of the data (like a subset of groups or subset of event numbers) from the hepfile 2. The event number column must be called event_num

So, to fix this, we can also load the hepfile into a hepfile dictionary object and then convert it using the hf.df_tools.hepfile_to_df function. This function takes in a hepfile dictionary and optionally a list (or string) of group names and a list (or int) of event numbers.

[7]:
data, bucket = hf.load('pandas_test.h5', return_type='dictionary')
dfs = hf.df_tools.hepfile_to_df(data) # this loads all the data
dfs
[7]:
{'_SINGLETONS_GROUP_':    event_num  some_singleton
 0          0               1
 1          1               2,
 'group_name_1':    x  y  event_num
 0  1  5          0
 1  2  6          0
 2  3  7          1
 3  4  8          1,
 'group_name_2':       w     z  event_num
 0  1600  10.0          0
 1    25  11.0          0
 2    16  12.5          1}

Now say that we only want group 1, we would use the groups option. Notice how it returns a dataframe directly if only one group is specified!!

[8]:
df = hf.df_tools.hepfile_to_df(data, groups='group_name_1') # this loads just group 1
print(df)
   x  y  event_num
0  1  5          0
1  2  6          0
2  3  7          1
3  4  8          1

Finally, say we only want values associated with event 1:

[9]:
df = hf.df_tools.hepfile_to_df(data, groups='group_name_1', events=1) # this loads just group 1 and event 1
print(df)
   x  y  event_num
2  3  7          1
3  4  8          1

Working with awkward arrays of hepfiles and pandas dataframes#

We can also convert awkward arrays of hepfile data to pandas dataframes. (Note: this basically just wraps on the official awkward array to dataframe method so for better performance and more options check that out!)

Let’s start by converting data from above to an awkward array:

[10]:
awk = hf.awkward_tools.hepfile_to_awkward(data)
print(awk)
[{event_num: 0, some_singleton: 1, group_name_1: {...}, ...}, {...}]

Now we can convert that awkward array to a pandas dataframe:

[11]:
dfs = hf.df_tools.awkward_to_df(awk)
print(dfs)
{'group_name_1':                 x  y  event_num
entry subentry
0     0         1  5          0
      1         2  6          0
1     0         3  7          1
      1         4  8          1, 'group_name_2':                    w     z  event_num
entry subentry
0     0         1600  10.0          0
      1           25  11.0          0
1     0           16  12.5          1, '_SINGLETONS_GROUP_':    event_num  some_singleton
0          0               1
1          1               2}

It’s that easy to navigate between the different data types! One important thing to note is that the dataframes returned by awkward_to_df have nested indexes while the dataframes returned by hepfile_to_df just have a single index.