Source Code Documentation#

hepfile.read#

Functions to assist in reading and accessing information in hepfiles.

hepfile.read.get_file_header(filename: str, return_type: str = 'dict') dict#

Get the file header and return it as a dictionary or dataframe

Parameters:
  • filename (string) – HDF5 file to open and read the header information.

  • return_type (string) – If ‘dict’ return the header information as a dictionary. If ‘df’ or ‘dataframe’, return the information as a pandas dataframe.

Returns:

Dictionary with the header information.

Return type:

dict

Raises:
hepfile.read.get_file_metadata(filename: str) dict#

Get the file metadata and return it as a dictionary

Parameters:

filename (str) – The hepfile to open and read the metadata from.

Returns:

Dictionary of the hepfile’s metadata.

Return type:

dict

Raises:

MetadataNotFound – If there is not metadata in filename

hepfile.read.get_nbuckets_in_data(data: dict) int#

Get the number of buckets in the data dictionary.

This is useful in case you’ve only pulled out subsets of the data

Parameters:

data (dict) – data dictionary of hepfile data to get the number of buckets in

Returns:

number of buckets in data

Return type:

int

Raises:
  • InputError – If data is not a dictionary

  • AttributeError – If the _NUMBER_OF_BUCKETS_ key is not in data, this can happen if the hepfile data was read from was corrupted or if you are passing in a data dictionary before packing the bucket properly. This is because the number of buckets is not calculated until the hepfile data dictionary is actually packed.

hepfile.read.get_nbuckets_in_file(filename: str) int#

Get the number of buckets in the file.

Parameters:

filename (str) – filename to count the number of buckets in

Returns:

number of buckets in filename

Return type:

int

Raises:

InputError – if something is wrong with the input filename

hepfile.read.load(filename: str, verbose: bool = False, desired_groups: list[str] = None, subset: int = None, return_type: str = 'dictionary') tuple[dict, dict]#

Reads all, or a subset of the data, from the HDF5 file to fill a data dictionary. Returns an empty dictionary to be filled later with data from individual buckets.

Parameters:
  • filename (string) – Name of the input file

  • verbose (boolean) – True if debug output is required

  • desired_groups (list) – Groups to be read from input file,

  • subset (int) – Number of buckets to be read from input file

  • return_type (str) – Type to return. Options are ‘dictionary’, ‘awkward’, and ‘ ‘pandas’. Default is ‘dictionary’. Note: the ‘awkward’ option requires hepfile to be installed with the awkward or all option and the ‘pandas’ option requires hepfile to be installed with the pandas or all option!

Returns:

Selected data from HDF5, An empty bucket dictionary to be

filled by data from select buckets

Return type:

tuple(dict, dict)

Raises:
  • InputError – If something is wrong with the specified input.

  • MissingOptionalDependency – If return_type=’awkward’ and awkward isn’t installed or return_type=’pandas’ and pandas isn’t installed.

  • RangeSubsetError – Something is wrong with the input subset range (like the max is greater than the min)

  • Warning – Usually because something is unexpected in the way the hepfile is stored

hepfile.read.print_file_header(filename: str) str#

Pretty print the file header

Parameters:

filename (str) – filename to retrieve the header from.

Returns:

String representation of the header information, if it exists.

Return type:

str

Raises:
hepfile.read.print_file_metadata(filename: str) str#

Pretty print the file metadata

Parameters:

filename (str) – hepfile to read and pring the metadata from.

Returns:

String representation of the hepfile’s metadata.

Return type:

str

Raises:

MetadataNotFound – If there is not metadata in filename

hepfile.read.unpack(bucket: dict, data: dict, entry_num: int = 0)#

Fills the bucket dictionary with selected rows from the data dictionary.

Parameters:
  • bucket (dict) – bucket dictionary to be filled

  • data (dict) – Data dictionary used to fill the bucket dictionary

  • entry_num (integer) – 0 by default. Which entry should be pulled out of the data dictionary and inserted into the bucket dictionary.

hepfile.write#

Functions to assist in writing a hepfile “from scratch”

hepfile.write.add_meta(data: dict, name: str, meta_data: list)#

Create metadata for a group, singleton, or dataset and add it to data

Parameters:
  • data (dict) – a data object returned by hf.initialize()

  • name (str) – name of either a group, singleton, or dataset the metadata corresponds to. if passing a dataset name, make sure it is the full path (group/dataset)!

  • meta_data (list) – list of metadata to write to that group/dataset/singleton

Raises:

Warning – If the metadata is already in the hepfile data

hepfile.write.clear_bucket(bucket: dict) None#

Clears the data from the bucket dictionary

Parameters:

dict – The dictionary to be cleared. This is designed to clear the data from the lists in the bucket dictionary, but theoretically, it would clear out the lists from any dictionary.

hepfile.write.create_dataset(data: dict, dset_name: list, group: str = None, dtype: type = <class 'float'>, verbose=False, ignore_protected=False)#

Adds a dataset to a group in a dictionary. If the group does not exist, it will be created.

Parameters:
  • data (dict) – Dictionary that contains the group

  • dset_name (list/str) – Dataset to be added to the group (This doesn’t have to be a list)

  • group (string) – Name of group the dataset will be added to. None by default

  • dtype (type) – The data type. None by default - I don’t think this is every used

Raises:
  • InputError – If the dataset name is protected

  • Warning – If the code is doing something the user won’t expect (see the message)

hepfile.write.create_group(data: dict, group_name: str, counter: str = None, verbose=False, ignore_protected=False)#

Adds a group in the dictionary

Parameters:
  • data (dict) – Dictionary to which the group will be added

  • group_name (string) – Name of the group to be added

  • counter (string) – Name of the counter key. None by default

Raises:
  • InputError – If the group_name is protected

  • Warning – Usually if the code is doing something to the hepfile the user won’t expect

hepfile.write.create_single_bucket(data: dict) dict#

Creates a bucket dictionary that will be used to collect data and then packed into the the master data dictionary.

Parameters:

data (dict) – Data dictionary that will hold all the data from the bucket.

Returns:

The new bucket dictionary with keys and no bucket information

Return type:

dict

hepfile.write.initialize() dict#

Creates an empty data dictionary

Returns:

An empty data dictionary

Return type:

dict

hepfile.write.pack(data: dict, bucket: dict, AUTO_SET_COUNTER: bool = True, EMPTY_OUT_BUCKET: bool = True, STRICT_CHECKING: bool = False, verbose: bool = False)#

Takes the data from n bucket and packs it into the data dictionary, intelligently, so that it can be stored and extracted efficiently. (This is analagous to the ROOT TTree::Fill() member function).

Note: The data and bucket dictionaries can be made up of either lists or NumPy arrays but because of the nature of how NumPy arrays are saved this pack function is more time and space efficient with python lists.

Parameters:
  • data (dict) – Data dictionary to hold the entire dataset EDIT.

  • bucket (dict) – bucket to be packed into data.

  • EMPTY_OUT_BUCKET (bool) – If this is True then empty out the bucket container in preparation for the next iteration. We used to ask the users to do this “by hand” but now do it automatically by default. We allow the user to not do this, if they are running some sort of debugging.

Raises:
  • DatasetSizeDiscrepancy – If STRICT_CHECKING is True and two datasets in a single group have different lengths

  • MissingSingletonValue – If the bucket is missing a singleton value for data. Because of the nature of singletons, every event is expected to have a row in the singleton dataset.

hepfile.write.write_file_header(filename: str, mydict: dict, verbose: bool = False) File#

Writes header data to a protected group in an HDF5 file.

If there is already header information, it is overwritten by this function.

Parameters:
  • filename (string) – Name of file to write to (file should already exist and the group will be appended to it.)

  • mydict (dictionary) – Header data passed in by user

  • verbose (bool) – True to print out info as it runs

Returns:

Returns the HDF5 file with new metadata

Return type:

h5py.File

Raises:

InputError – If the header data in mydict is nonexistent or the header data can not be converted to a numpy array.

hepfile.write.write_file_metadata(filename: str, mydict: dict = None, write_default_values: bool = True, append: bool = True, verbose: bool = False) File#

Writes file metadata in the attributes of an HDF5 file

Parameters:
  • filename (string) – Name of output file

  • mydict (dictionary) – Metadata desired by user

  • write_default_values (boolean) – True if user wants to write/update the default metadata: date, hepfile version, h5py version, numpy version, and Python version, false if otherwise.

  • append (boolean) – True if user wants to keep older metadata, false otherwise.

  • verbose (boolean) – True to print out statements as it goes

Returns:

HDF5 File with new metadata

Return type:

h5py.File

hepfile.write.write_to_file(filename: str, data: dict, comp_type: str = None, comp_opts: list = None, force_single_precision: bool = True, verbose: bool = False) File#

Writes the selected data to an HDF5 file

Parameters:
  • filename (string) – Name of output file

  • data (dictionary) – Data to be written into output file

  • comp_type (string) – Type of compression

  • force_single_precision (boolean) – True if data should be written in single precision

Returns:

HDF5 File to which the data has been written

Return type:

h5py.File

Raises:

Warning – If two counters have a different number of entries. This usually means something is wrong with the data dictionary you are trying to write.

hepfile.dict_tools#

Functions to help convert dictionaries into hepfiles

hepfile.dict_tools.append(ak_dict: ak.Record, new_dict: dict) ak.Record#

Append a new event to an existing awkward dictionary with events

Note: This tool requires awkward to be installed. Make sure you installed with either:

  1. python -m pip install hepfile[awkward] or,

  2. python -m pip install hepfile[all]

Parameters:
  • ak_dict (ak.Record) – awkward Record of data

  • new_dict (dict) – Dictionary of value to append to ak_dict. All keys must match ak_dict!

Returns:

Awkward Record of awkward arrays with the new_dict appended

Return type:

ak.Record

Raises:
hepfile.dict_tools.dictlike_to_hepfile(dict_list: list[dict], outfile: str = None, how_to_pack='classic', **kwargs) dict#

This wraps on hepfile.awkward_tools.awkward_to_hepfile and writes a list of dictionaries to a hepfile.

Writes a list of dictlike object to a hepfile. Must have a specific format:

  • each dictlike object is a “event”

  • first level of dict keys are the groups

  • second level of dict keys are the datasets

  • entries in second level of dict object is the data (awkward array or list)

  • data entries in the first level of the dict are singleton objects

Parameters:
  • dict_list (list) – list of dictionaries or dataframes where each dictionary/df holds information on an event

  • outfile (str) – path to write output hepfile to

  • how_to_pack (str) – how to pack the input dataset. Options are ‘awkward’ or ‘classic’. ‘awkward’ called awkward_to_hepfile, ‘classic’ does it more traditional. default is ‘classic’. To use how_to_pack=’awkward’, make sure you installed hepfile with the ‘awkward’ or ‘all’ optional dependency!

  • **kwargs – passed to hepfile.write.write_to_file if ‘awkward’. Can only be ‘write_to_hepfile’ and ‘ignore_protected’ if ‘classic’.

Returns:

if how_to_pack=’awkward’ it is an ak.Array, if instead

how_to_pack=’classic’ if is a hepfile style dictionary

Return type:

ak.Array or dict

Raises:

hepfile.awkward_tools#

These are tools to make working with and translating between awkward arrays and hepfile data objects easier.

Note: The base installation package does not contain these tools! You must have installed hepfile with either

  1. python -m pip install hepfile[awkward], or

  2. python -m pip install hepfile[all]

hepfile.awkward_tools.awkward_to_hepfile(ak_array: Array, outfile: str = None, write_hepfile: bool = True, **kwargs) dict#

Write an awkward array with depth <= 2 to a hepfile

Parameters:
  • ak_array (ak.Array) – awkward array with fields of groups/singletons. Under the group fields are the dataset fields.

  • outfile (str) – path to where the hepfile should be written. Default is None and can only be None if write_hepfile=False.

  • write_hepfile (bool) – if True, write the hepfile and return the data dictionary. If False, just return the data dictionary without returning. Default is True.

  • **kwargs – passed to hepfile.write_to_file

Returns:

Data dictionary in the hepfile

Return type:

dict

Raises:
  • AwkwardStructureError – If the input awkward array is not formatted properly.

  • InputError – If something is wrong with the specified input

  • Warning – If write_hepfile is false but you still give an output path

hepfile.awkward_tools.hepfile_to_awkward(data: dict, groups: list = None, datasets: list = None) Record#

Converts all (or a subset of) the output data from hepfile.read.load to a dictionary of awkward arrays.

Parameters:
  • data (dict) – Output data dictionary from the hepfile.read.load function.

  • groups (list) – list of groups to pull from data and convert to awkward arrays.

  • datasets (list) – list of full dataset paths (ex. ‘jet/px’ not ‘px’) to pull from data and include in the awkward arrays.

Returns:

dictionary of awkward arrays with the data.

Return type:

dict

Raises:
  • AwkwardStructureError – If something is wrong with the awkward array being outputted

  • Warning – If it is returning an awkward Record rather than and awkward Array

hepfile.awkward_tools.pack_multiple_awkward_arrays(data: dict, arr: Array, group_name: str = None, group_counter_name: str = None) None#

Pack an awkward array of arrays into group_name or the singletons group

Parameters:
  • data (dict) – hepfile data dictionary that is returned from hepfile.initialize()

  • arr (ak.Array) – Awkward array of the group in a set of data

  • group_name (str) – Name of the group to pack arr into, if None (default) it is packed into the signletons group

Raises:

InputError – If input awkward array doesn’t have any fields, if this happens consider using pack_single_awkward_array. Also happens if the input is not an awkward array and can not be converted.

hepfile.awkward_tools.pack_single_awkward_array(data: dict, arr: Array, dset_name: str, group_name: str = None, counter: str = None) None#

Packs a 1D awkward array as a dataset/singleton depending on if group_name is given

Parameters:
  • data (dict) – data dictionary created by hepfile.initialize()

  • arr (ak.Array) – 1D awkward array to pack as either a dataset or a group. If group_name is None the arr is packed as a singleton

  • dset_name (str) – Full path to the dataset.

  • group_name (str) – name of the group to pack the arr under, default is None

  • counter (str) – name of the counter in the hepfile for this dataset

Raises:

InputError – If it can not extract the datatype of the values in arr

hepfile.df_tools#

Tools to work with Pandas DataFrames and Hepfile data

Note: The base installation package does not contain these tools! You must have installed hepfile with either

  1. python -m pip install hepfile[pandas], or

  2. python -m pip install hepfile[all]

hepfile.df_tools.awkward_to_df(ak_array: ak.Array, groups: list[str] = None, events: list[int] = None) dict[pd.DataFrame]#

Converts an awkward array of hepfile data to a dataframe. Does the same thing as hepfile_to_df but given an awkward array.

Note: You must have installed with python -m pip install hepfile[all]

to use this tool!

Parameters:
  • ak_array (ak.Array) – awkward array in the format of a hepfile

  • groups (list) – groups to include, None (default) means include all groups

  • events (list) – list of event indexes to include

Returns:

Dictionary of requested groups as dataframes where the

keys are the group names. If only one group is requested then it just returns a dataframe of that group.

Return type:

dict[pd.DataFrame]

Raises:
hepfile.df_tools.df_to_hepfile(df_dict: dict[pandas.core.frame.DataFrame], outfile: str = None, event_num_col='event_num', write_hepfile: bool = True) dict#

Converts a list of dataframes of group data to a hepfile. The opposite of hepfile_to_df. Must have an event_num column!

Parameters:
  • df_dict (dict) – dictionary of pandas DataFrame groups to write to a hepfile

  • outfile (str) – output file name, required if write_hepfile is True

  • event_num_col (str) – name of a column in the pd.DataFrame to group by

  • write_hepfile (bool) – should we write the hepfile data to a hepfile?

Returns:

hepfile data dictionary

Return type:

dict

Raises:

InputError – If something is wrong with the specific input.

hepfile.df_tools.groups_to_events(df_dict: dict[pandas.core.frame.DataFrame], event_num_col: str = 'event_num') dict#

Converts a dictionary of group dataframes to a dictionary of event dataframes

Parameters:
  • [dict] (df_dict) – dictionary of groups to convert to a dictionary of events

  • [str] (event_num_col) – column to group each group by

Returns:

dictionary of pandas dataframes organized by events

Return type:

dict[pd.DataFrame]

Raises:

InputError – Something is wrong with the input dictionary

hepfile.df_tools.hepfile_to_df(data: dict, groups: list[str] = None, events: list[int] = None) dict[pandas.core.frame.DataFrame]#

Converts hepfile data to dataframes where each group is in its own dataframe and we add an extra column called ‘event_num’. Singletons have its own df

Parameters:
  • data (dict) – data object either loaded from a hepfile or about to be written to a hepfile.

  • groups (list) – groups to include, None (default) means include all groups

  • events (list) – list of event indexes to include

Returns:

Dictionary of requested groups as dataframes where

the keys are the group names. If only one group is requested then it just returns a dataframe of that group.

Return type:

dict[pd.DataFrame]

Raises:

InputError – Something is wrong with the specified input

hepfile.csv_tools#

Tools to help with managing csvs with hepfile

Note: The base installation package does not contain these tools! You must have installed hepfile with either

  1. python -m pip install hepfile[pandas], or

  2. python -m pip install hepfile[all]

hepfile.csv_tools.csv_to_hepfile(csvpaths: list[str], common_key: str, outfile: str | None = None, group_names: list | None = None, write_hepfile: bool = True) tuple[str, dict]#

Convert a list of csvs to a hepfile

This is helpful for converting database-like csvs to a hepfile where each input csv has a common key and can be combined into a large table.

Parameters:
  • csvpaths (list[str]) – list of absolute paths to the csvs to convert to a hepfile

  • common_key (str) – The above list of csvs should have a common column name, give the name of this column

  • outfile (str) – The output file name, if None data is written to the first filepath in csvpaths with ‘csv’ replaced with ‘h5’

  • group_names (list) – the names for the groups in the hepfile. Default is None and the groups are based on the filenames

  • write_hepfile – (bool): if True, write the hepfile. Default is True.

Returns:

path to the output hepfile, Dictionary of hepfile data

Return type:

Tuple(str, dict)

Raises:

InputError – If something is wrong with the specific input.

hepfile.errors#

Custom exception messages

exception hepfile.errors.AwkwardStructureError#

Thrown when the structure of an Awkward Array is not appropriate for the future processing.

exception hepfile.errors.DatasetSizeDiscrepancy#

Thrown when two datasets under one group do not have the same length. This is usually not appropriate for hepfiles.

exception hepfile.errors.DictStructureError#

Thrown when the structure of a dictionary is not appropriate for the future processing.

exception hepfile.errors.HeaderNotFound#

Thrown when there is no header found for a hepfile even though the user has requested it.

exception hepfile.errors.InputError#

General error to describe when the input value of a function in the module is either the wrong type or incorrectly formatted. Note: This replaced pythons builtin IOError that is now deprecated and replaced with the more general and less informative OSError.

exception hepfile.errors.MetadataNotFound#

Thrown when there is no metadata found for a hepfile even though the user has requested it.

exception hepfile.errors.MissingOptionalDependency(module)#

Thrown when the user tries to use an optional part of the package that was not installed when they installed.

Parameters:

module (str) – name of the missing optional dependency that needs to be installed by the user.

exception hepfile.errors.MissingSingletonValue#

Thrown when we try to pack a bucket into a hepfile data dictionary and no singleton value is found in the new bucket.

exception hepfile.errors.RangeSubsetError#

Thrown when the input range is incorrectly formatted. See the error for more details about what exactly is incorrect.