Source Code Documentation

Source Code Documentation#

hepfile.read#

Functions to assist in reading and accessing information in hepfiles.

hepfile.read.get_file_header(filename: str, return_type: str = 'dict') → dict#

Get the file header and return it as a dictionary or dataframe

Parameters:

filename (string) – HDF5 file to open and read the header information.
return_type (string) – If ‘dict’ return the header information as a dictionary. If ‘df’ or ‘dataframe’, return the information as a pandas dataframe.

Returns:

Dictionary with the header information.

Return type:

dict

Raises:

MissingOptionalDependency – If return_type is df or dataframe and pandas isn’t installed.
InputError – If something is wrong with the return_type variable
HeaderNotFound – If there is not header in filename

hepfile.read.get_file_metadata(filename: str) → dict#

Get the file metadata and return it as a dictionary

Parameters:: filename (str) – The hepfile to open and read the metadata from.
Returns:: Dictionary of the hepfile’s metadata.
Return type:: dict
Raises:: MetadataNotFound – If there is not metadata in filename

hepfile.read.get_nbuckets_in_data(data: dict) → int#

Get the number of buckets in the data dictionary.

This is useful in case you’ve only pulled out subsets of the data

Parameters:

data (dict) – data dictionary of hepfile data to get the number of buckets in

Returns:

number of buckets in data

Return type:

int

Raises:

InputError – If data is not a dictionary
AttributeError – If the _NUMBER_OF_BUCKETS_ key is not in data, this can happen if the hepfile data was read from was corrupted or if you are passing in a data dictionary before packing the bucket properly. This is because the number of buckets is not calculated until the hepfile data dictionary is actually packed.

hepfile.read.get_nbuckets_in_file(filename: str) → int#

Get the number of buckets in the file.

Parameters:: filename (str) – filename to count the number of buckets in
Returns:: number of buckets in filename
Return type:: int
Raises:: InputError – if something is wrong with the input filename

hepfile.read.load(filename: str, verbose: bool = False, desired_groups: list[str] = None, subset: int = None, return_type: str = 'dictionary') → tuple[dict, dict]#

Reads all, or a subset of the data, from the HDF5 file to fill a data dictionary. Returns an empty dictionary to be filled later with data from individual buckets.

Parameters:

filename (string) – Name of the input file
verbose (boolean) – True if debug output is required
desired_groups (list) – Groups to be read from input file,
subset (int) – Number of buckets to be read from input file
return_type (str) – Type to return. Options are ‘dictionary’, ‘awkward’, and ‘ ‘pandas’. Default is ‘dictionary’. Note: the ‘awkward’ option requires hepfile to be installed with the awkward or all option and the ‘pandas’ option requires hepfile to be installed with the pandas or all option!

Returns:

Selected data from HDF5, An empty bucket dictionary to be: filled by data from select buckets

Return type:

tuple(dict, dict)

Raises:

InputError – If something is wrong with the specified input.
MissingOptionalDependency – If return_type=’awkward’ and awkward isn’t installed or return_type=’pandas’ and pandas isn’t installed.
RangeSubsetError – Something is wrong with the input subset range (like the max is greater than the min)
Warning – Usually because something is unexpected in the way the hepfile is stored

hepfile.read.print_file_header(filename: str) → str#

Pretty print the file header

Parameters:

filename (str) – filename to retrieve the header from.

Returns:

String representation of the header information, if it exists.

Return type:

str

Raises:

MissingOptionalDependency – If return_type is df or dataframe and pandas isn’t installed.
InputError – If something is wrong with the return_type variable
HeaderNotFound – If there is not header in filename

hepfile.read.print_file_metadata(filename: str) → str#

Pretty print the file metadata

Parameters:: filename (str) – hepfile to read and pring the metadata from.
Returns:: String representation of the hepfile’s metadata.
Return type:: str
Raises:: MetadataNotFound – If there is not metadata in filename

hepfile.read.unpack(bucket: dict, data: dict, entry_num: int = 0)#

Fills the bucket dictionary with selected rows from the data dictionary.

Parameters:

bucket (dict) – bucket dictionary to be filled
data (dict) – Data dictionary used to fill the bucket dictionary
entry_num (integer) – 0 by default. Which entry should be pulled out of the data dictionary and inserted into the bucket dictionary.

hepfile.write#

Functions to assist in writing a hepfile “from scratch”

hepfile.write.add_meta(data: dict, name: str, meta_data: list)#

Create metadata for a group, singleton, or dataset and add it to data

Parameters:

data (dict) – a data object returned by hf.initialize()
name (str) – name of either a group, singleton, or dataset the metadata corresponds to. if passing a dataset name, make sure it is the full path (group/dataset)!
meta_data (list) – list of metadata to write to that group/dataset/singleton

Raises:

Warning – If the metadata is already in the hepfile data

hepfile.write.clear_bucket(bucket: dict) → None#

Clears the data from the bucket dictionary

Parameters:: dict – The dictionary to be cleared. This is designed to clear the data from the lists in the bucket dictionary, but theoretically, it would clear out the lists from any dictionary.

hepfile.write.create_dataset(data: dict, dset_name: list, group: str = None, dtype: type = <class 'float'>, verbose=False, ignore_protected=False)#

Adds a dataset to a group in a dictionary. If the group does not exist, it will be created.

Parameters:

data (dict) – Dictionary that contains the group
dset_name (list/str) – Dataset to be added to the group (This doesn’t have to be a list)
group (string) – Name of group the dataset will be added to. None by default
dtype (type) – The data type. None by default - I don’t think this is every used

Raises:

InputError – If the dataset name is protected
Warning – If the code is doing something the user won’t expect (see the message)

hepfile.write.create_group(data: dict, group_name: str, counter: str = None, verbose=False, ignore_protected=False)#

Adds a group in the dictionary

Parameters:

data (dict) – Dictionary to which the group will be added
group_name (string) – Name of the group to be added
counter (string) – Name of the counter key. None by default

Raises:

InputError – If the group_name is protected
Warning – Usually if the code is doing something to the hepfile the user won’t expect

hepfile.write.create_single_bucket(data: dict) → dict#

Creates a bucket dictionary that will be used to collect data and then packed into the the master data dictionary.

Parameters:: data (dict) – Data dictionary that will hold all the data from the bucket.
Returns:: The new bucket dictionary with keys and no bucket information
Return type:: dict

hepfile.write.initialize() → dict#

Creates an empty data dictionary

Returns:: An empty data dictionary
Return type:: dict

hepfile.write.pack(data: dict, bucket: dict, AUTO_SET_COUNTER: bool = True, EMPTY_OUT_BUCKET: bool = True, STRICT_CHECKING: bool = False, verbose: bool = False)#

Takes the data from n bucket and packs it into the data dictionary, intelligently, so that it can be stored and extracted efficiently. (This is analagous to the ROOT TTree::Fill() member function).

Note: The data and bucket dictionaries can be made up of either lists or NumPy arrays but because of the nature of how NumPy arrays are saved this pack function is more time and space efficient with python lists.

Parameters:

data (dict) – Data dictionary to hold the entire dataset EDIT.
bucket (dict) – bucket to be packed into data.
EMPTY_OUT_BUCKET (bool) – If this is True then empty out the bucket container in preparation for the next iteration. We used to ask the users to do this “by hand” but now do it automatically by default. We allow the user to not do this, if they are running some sort of debugging.

Raises:

DatasetSizeDiscrepancy – If STRICT_CHECKING is True and two datasets in a single group have different lengths
MissingSingletonValue – If the bucket is missing a singleton value for data. Because of the nature of singletons, every event is expected to have a row in the singleton dataset.

hepfile.write.write_file_header(filename: str, mydict: dict, verbose: bool = False) → File#

Writes header data to a protected group in an HDF5 file.

If there is already header information, it is overwritten by this function.

Parameters:

filename (string) – Name of file to write to (file should already exist and the group will be appended to it.)
mydict (dictionary) – Header data passed in by user
verbose (bool) – True to print out info as it runs

Returns:

Returns the HDF5 file with new metadata

Return type:

h5py.File

Raises:

InputError – If the header data in mydict is nonexistent or the header data can not be converted to a numpy array.

hepfile.write.write_file_metadata(filename: str, mydict: dict = None, write_default_values: bool = True, append: bool = True, verbose: bool = False) → File#

Writes file metadata in the attributes of an HDF5 file

Parameters:

filename (string) – Name of output file
mydict (dictionary) – Metadata desired by user
write_default_values (boolean) – True if user wants to write/update the default metadata: date, hepfile version, h5py version, numpy version, and Python version, false if otherwise.
append (boolean) – True if user wants to keep older metadata, false otherwise.
verbose (boolean) – True to print out statements as it goes

Returns:

HDF5 File with new metadata

Return type:

h5py.File

hepfile.write.write_to_file(filename: str, data: dict, comp_type: str = None, comp_opts: list = None, force_single_precision: bool = True, verbose: bool = False) → File#

Writes the selected data to an HDF5 file

Parameters:

filename (string) – Name of output file
data (dictionary) – Data to be written into output file
comp_type (string) – Type of compression
force_single_precision (boolean) – True if data should be written in single precision

Returns:

HDF5 File to which the data has been written

Return type:

h5py.File

Raises:

Warning – If two counters have a different number of entries. This usually means something is wrong with the data dictionary you are trying to write.

hepfile.dict_tools#

Functions to help convert dictionaries into hepfiles

hepfile.dict_tools.append(ak_dict: ak.Record, new_dict: dict) → ak.Record#

Append a new event to an existing awkward dictionary with events

Note: This tool requires awkward to be installed. Make sure you installed with either:

python -m pip install hepfile[awkward] or,
python -m pip install hepfile[all]

Parameters:

ak_dict (ak.Record) – awkward Record of data
new_dict (dict) – Dictionary of value to append to ak_dict. All keys must match ak_dict!

Returns:

Awkward Record of awkward arrays with the new_dict appended

Return type:

ak.Record

Raises:

MissingOptionalDependency – If awkward isn’t installed
AwkwardStructureError – If something is wrong with the input array
InputError – If keys of the new and old arrays don’t match

hepfile.dict_tools.dictlike_to_hepfile(dict_list: list[dict], outfile: str = None, how_to_pack='classic', **kwargs) → dict#

This wraps on hepfile.awkward_tools.awkward_to_hepfile and writes a list of dictionaries to a hepfile.

Writes a list of dictlike object to a hepfile. Must have a specific format:

each dictlike object is a “event”
first level of dict keys are the groups
second level of dict keys are the datasets
entries in second level of dict object is the data (awkward array or list)
data entries in the first level of the dict are singleton objects

Parameters:

dict_list (list) – list of dictionaries or dataframes where each dictionary/df holds information on an event
outfile (str) – path to write output hepfile to
how_to_pack (str) – how to pack the input dataset. Options are ‘awkward’ or ‘classic’. ‘awkward’ called awkward_to_hepfile, ‘classic’ does it more traditional. default is ‘classic’. To use how_to_pack=’awkward’, make sure you installed hepfile with the ‘awkward’ or ‘all’ optional dependency!
**kwargs – passed to hepfile.write.write_to_file if ‘awkward’. Can only be ‘write_to_hepfile’ and ‘ignore_protected’ if ‘classic’.

Returns:

if how_to_pack=’awkward’ it is an ak.Array, if instead: how_to_pack=’classic’ if is a hepfile style dictionary

Return type:

ak.Array or dict

Raises:

InputError – If something is wrong with the specific input.
MissingOptionalDependency – If how_to_pack=’awkward’ and awkward isn’t installed
DictStructureError – something with the output doesn’t look right, check your input!

hepfile.awkward_tools#

These are tools to make working with and translating between awkward arrays and hepfile data objects easier.

Note: The base installation package does not contain these tools! You must have installed hepfile with either

python -m pip install hepfile[awkward], or
python -m pip install hepfile[all]

hepfile.awkward_tools.awkward_to_hepfile(ak_array: Array, outfile: str = None, write_hepfile: bool = True, **kwargs) → dict#

Write an awkward array with depth <= 2 to a hepfile

Parameters:

ak_array (ak.Array) – awkward array with fields of groups/singletons. Under the group fields are the dataset fields.
outfile (str) – path to where the hepfile should be written. Default is None and can only be None if write_hepfile=False.
write_hepfile (bool) – if True, write the hepfile and return the data dictionary. If False, just return the data dictionary without returning. Default is True.
**kwargs – passed to hepfile.write_to_file

Returns:

Data dictionary in the hepfile

Return type:

dict

Raises:

AwkwardStructureError – If the input awkward array is not formatted properly.
InputError – If something is wrong with the specified input
Warning – If write_hepfile is false but you still give an output path

hepfile.awkward_tools.hepfile_to_awkward(data: dict, groups: list = None, datasets: list = None) → Record#

Converts all (or a subset of) the output data from hepfile.read.load to a dictionary of awkward arrays.

Parameters:

data (dict) – Output data dictionary from the hepfile.read.load function.
groups (list) – list of groups to pull from data and convert to awkward arrays.
datasets (list) – list of full dataset paths (ex. ‘jet/px’ not ‘px’) to pull from data and include in the awkward arrays.

Returns:

dictionary of awkward arrays with the data.

Return type:

dict

Raises:

AwkwardStructureError – If something is wrong with the awkward array being outputted
Warning – If it is returning an awkward Record rather than and awkward Array

hepfile.awkward_tools.pack_multiple_awkward_arrays(data: dict, arr: Array, group_name: str = None, group_counter_name: str = None) → None#

Pack an awkward array of arrays into group_name or the singletons group

Parameters:

data (dict) – hepfile data dictionary that is returned from hepfile.initialize()
arr (ak.Array) – Awkward array of the group in a set of data
group_name (str) – Name of the group to pack arr into, if None (default) it is packed into the signletons group

Raises:

InputError – If input awkward array doesn’t have any fields, if this happens consider using pack_single_awkward_array. Also happens if the input is not an awkward array and can not be converted.

hepfile.awkward_tools.pack_single_awkward_array(data: dict, arr: Array, dset_name: str, group_name: str = None, counter: str = None) → None#

Packs a 1D awkward array as a dataset/singleton depending on if group_name is given

Parameters:

data (dict) – data dictionary created by hepfile.initialize()
arr (ak.Array) – 1D awkward array to pack as either a dataset or a group. If group_name is None the arr is packed as a singleton
dset_name (str) – Full path to the dataset.
group_name (str) – name of the group to pack the arr under, default is None
counter (str) – name of the counter in the hepfile for this dataset

Raises:

InputError – If it can not extract the datatype of the values in arr

hepfile.df_tools#

Tools to work with Pandas DataFrames and Hepfile data

Note: The base installation package does not contain these tools! You must have installed hepfile with either

python -m pip install hepfile[pandas], or
python -m pip install hepfile[all]

hepfile.df_tools.awkward_to_df(ak_array: ak.Array, groups: list[str] = None, events: list[int] = None) → dict[pd.DataFrame]#

Converts an awkward array of hepfile data to a dataframe. Does the same thing as hepfile_to_df but given an awkward array.

Note: You must have installed with python -m pip install hepfile[all]: to use this tool!

Parameters:

ak_array (ak.Array) – awkward array in the format of a hepfile
groups (list) – groups to include, None (default) means include all groups
events (list) – list of event indexes to include

Returns:

Dictionary of requested groups as dataframes where the: keys are the group names. If only one group is requested then it just returns a dataframe of that group.

Return type:

dict[pd.DataFrame]

Raises:

MissingOptionalDependency – If you do not have the optional dependency awkward installed.
InputError – If something is wrong with the specified input values.

hepfile.df_tools.df_to_hepfile(df_dict: dict[pandas.core.frame.DataFrame], outfile: str = None, event_num_col='event_num', write_hepfile: bool = True) → dict#

Converts a list of dataframes of group data to a hepfile. The opposite of hepfile_to_df. Must have an event_num column!

Parameters:

df_dict (dict) – dictionary of pandas DataFrame groups to write to a hepfile
outfile (str) – output file name, required if write_hepfile is True
event_num_col (str) – name of a column in the pd.DataFrame to group by
write_hepfile (bool) – should we write the hepfile data to a hepfile?

Returns:

hepfile data dictionary

Return type:

dict

Raises:

InputError – If something is wrong with the specific input.

hepfile.df_tools.groups_to_events(df_dict: dict[pandas.core.frame.DataFrame], event_num_col: str = 'event_num') → dict#

Converts a dictionary of group dataframes to a dictionary of event dataframes

Parameters:

[dict] (df_dict) – dictionary of groups to convert to a dictionary of events
[str] (event_num_col) – column to group each group by

Returns:

dictionary of pandas dataframes organized by events

Return type:

dict[pd.DataFrame]

Raises:

InputError – Something is wrong with the input dictionary

hepfile.df_tools.hepfile_to_df(data: dict, groups: list[str] = None, events: list[int] = None) → dict[pandas.core.frame.DataFrame]#

Converts hepfile data to dataframes where each group is in its own dataframe and we add an extra column called ‘event_num’. Singletons have its own df

Parameters:

data (dict) – data object either loaded from a hepfile or about to be written to a hepfile.
groups (list) – groups to include, None (default) means include all groups
events (list) – list of event indexes to include

Returns:

Dictionary of requested groups as dataframes where: the keys are the group names. If only one group is requested then it just returns a dataframe of that group.

Return type:

dict[pd.DataFrame]

Raises:

InputError – Something is wrong with the specified input

hepfile.csv_tools#

Tools to help with managing csvs with hepfile

Note: The base installation package does not contain these tools! You must have installed hepfile with either

python -m pip install hepfile[pandas], or
python -m pip install hepfile[all]

hepfile.csv_tools.csv_to_hepfile(csvpaths: list[str], common_key: str, outfile: str | None = None, group_names: list | None = None, write_hepfile: bool = True) → tuple[str, dict]#

Convert a list of csvs to a hepfile

This is helpful for converting database-like csvs to a hepfile where each input csv has a common key and can be combined into a large table.

Parameters:

csvpaths (list[str]) – list of absolute paths to the csvs to convert to a hepfile
common_key (str) – The above list of csvs should have a common column name, give the name of this column
outfile (str) – The output file name, if None data is written to the first filepath in csvpaths with ‘csv’ replaced with ‘h5’
group_names (list) – the names for the groups in the hepfile. Default is None and the groups are based on the filenames
write_hepfile – (bool): if True, write the hepfile. Default is True.

Returns:

path to the output hepfile, Dictionary of hepfile data

Return type:

Tuple(str, dict)

Raises:

InputError – If something is wrong with the specific input.

hepfile.errors#

Custom exception messages

exception hepfile.errors.AwkwardStructureError#: Thrown when the structure of an Awkward Array is not appropriate for the future processing.

exception hepfile.errors.DatasetSizeDiscrepancy#: Thrown when two datasets under one group do not have the same length. This is usually not appropriate for hepfiles.

exception hepfile.errors.DictStructureError#: Thrown when the structure of a dictionary is not appropriate for the future processing.

exception hepfile.errors.HeaderNotFound#: Thrown when there is no header found for a hepfile even though the user has requested it.

exception hepfile.errors.InputError#: General error to describe when the input value of a function in the module is either the wrong type or incorrectly formatted. Note: This replaced pythons builtin IOError that is now deprecated and replaced with the more general and less informative OSError.

exception hepfile.errors.MetadataNotFound#: Thrown when there is no metadata found for a hepfile even though the user has requested it.

exception hepfile.errors.MissingOptionalDependency(module)#

Thrown when the user tries to use an optional part of the package that was not installed when they installed.

Parameters:: module (str) – name of the missing optional dependency that needs to be installed by the user.

exception hepfile.errors.MissingSingletonValue#: Thrown when we try to pack a bucket into a hepfile data dictionary and no singleton value is found in the new bucket.

exception hepfile.errors.RangeSubsetError#: Thrown when the input range is incorrectly formatted. See the error for more details about what exactly is incorrect.

Source Code Documentation

Contents

Source Code Documentation#

hepfile.read#

hepfile.write#

hepfile.dict_tools#

hepfile.awkward_tools#

hepfile.df_tools#

hepfile.csv_tools#

hepfile.errors#