Real Example: Housing Data#

This is a real example using Housing Data and demonstrates the hepfile.csv_tools module!

[1]:
import hepfile as hf
import pandas as pd

Before moving on with the tutorial, make sure you have downloaded the following datasets using the wget command. This only needs to be run once.

Also, make sure you review the following link on the hepfile readthedocs page to get some context: https://hepfile.readthedocs.io/en/latest/introduction.html#overview-of-use-case

[2]:
!wget -nc -O 'People.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/People.csv'
!wget -nc -O 'Vehicles.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/Vehicles.csv'
!wget -nc -O 'Residences.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/Residences.csv'
File ‘People.csv’ already there; not retrieving.
File ‘Vehicles.csv’ already there; not retrieving.
File ‘Residences.csv’ already there; not retrieving.

The next step is to define a list of all of these filepaths

[3]:
filepaths = ['People.csv', 'Vehicles.csv', 'Residences.csv']

For the sake of completeness, let’s take a look at these datasets

[4]:
import pandas as pd
for f in filepaths:
    print(f + ':\n')
    print(pd.read_csv(f).to_markdown())
    print()
People.csv:

|    |   Household ID | First name   | Last name   | Gender ID   |   Age |   Height |   Yearly income | Highest degree/grade   |
|---:|---------------:|:-------------|:------------|:------------|------:|---------:|----------------:|:-----------------------|
|  0 |              0 | blah         | blah        | M           |    54 |      159 |           75000 | BS                     |
|  1 |              0 | blah         | blah        | F           |    52 |      140 |           80000 | MS                     |
|  2 |              0 | blah         | blah        | NB          |    18 |      168 |               0 | 12                     |
|  3 |              0 | blah         | blah        | F           |    14 |      150 |               0 | 9                      |
|  4 |              1 | blah         | blah        | M           |    32 |      159 |           49000 | BS                     |
|  5 |              1 | blah         | blah        | M           |    27 |      140 |           40000 | BS                     |
|  6 |              2 | blah         | blah        | F           |    65 |      140 |           40000 | BS                     |
|  7 |              3 | blah         | blah        | M           |    54 |      159 |           75000 | BS                     |
|  8 |              3 | blah         | blah        | F           |    52 |      140 |           80000 | MS                     |
|  9 |              3 | blah         | blah        | NB          |    18 |      168 |               0 | 12                     |
| 10 |              3 | blah         | blah        | F           |    14 |      150 |               0 | 9                      |
| 11 |              3 | blah         | blah        | M           |    12 |      159 |               0 | 8                      |
| 12 |              3 | blah         | blah        | M           |    11 |      140 |               0 | 7                      |
| 13 |              3 | blah         | blah        | F           |    65 |      140 |               0 | 12                     |

Vehicles.csv:

|    |   Household ID | Type of vehicle   |   # of riders | Gas/electric/human powered   |   Year |   Cost |
|---:|---------------:|:------------------|--------------:|:-----------------------------|-------:|-------:|
|  0 |              0 | Car               |             4 | Gas                          |   2005 |  25000 |
|  1 |              0 | Car               |             5 | Electric                     |   2018 |  40000 |
|  2 |              0 | Bike              |             1 | Human                        |   2015 |    500 |
|  3 |              0 | Bike              |             1 | Human                        |   2015 |    500 |
|  4 |              0 | Bike              |             1 | Human                        |   2015 |    500 |
|  5 |              0 | Bike              |             1 | Human                        |   2015 |    500 |
|  6 |              2 | Car               |             7 | Gas                          |   2012 |  45000 |
|  7 |              3 | Car               |             2 | Gas                          |   2005 |  25000 |
|  8 |              3 | Car               |             7 | Gas                          |   2018 |  40000 |

Residences.csv:

|    |   Household ID | House/apartment/condo   |   # of bedrooms |   # of bathrooms |   Square footage |   Year built |   Estimate |
|---:|---------------:|:------------------------|----------------:|-----------------:|-----------------:|-------------:|-----------:|
|  0 |              0 | House                   |               4 |              2.5 |             1500 |         1955 |     250000 |
|  1 |              1 | Apartment               |               2 |              2   |             1200 |         2002 |       1400 |
|  2 |              2 | Condo                   |               2 |              1   |             1000 |         2014 |     325000 |
|  3 |              3 | House                   |               6 |              4.5 |             4500 |         1998 |     500000 |

So there is a lot of different columns in these three csvs but it looks like they are all connected by the common key Household ID. This is similar to a database structure where each csv has a different length but are connected by a common ID. This makes these files perfect for being stored in a hepfile!

If we want to go straight to writing a hepfile instead of just creating an awkward array of the data, we can use the hepfile.csv_tools.csv_to_hepfile method. This takes a list of csv filepaths and a common key to merge by.

[5]:
outfilename, hepfile = hf.csv_tools.csv_to_hepfile(filepaths, common_key='Household ID', group_names=['People', 'Vehicles', 'Residences'])
print()
print('#########################################')
print(f'Output File Name: {outfilename}')

#########################################
Output File Name: People.h5
/home/nfranz/research/hepfile/src/hepfile/write.py:266: UserWarning: ----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Highest degree/grade
The new name will be Highest degree-grade
----------------------------------------------------
  warnings.warn(warning)
/home/nfranz/research/hepfile/src/hepfile/write.py:266: UserWarning: ----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Gas/electric/human powered
The new name will be Gas-electric-human powered
----------------------------------------------------
  warnings.warn(warning)
/home/nfranz/research/hepfile/src/hepfile/write.py:266: UserWarning: ----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name House/apartment/condo
The new name will be House-apartment-condo
----------------------------------------------------
  warnings.warn(warning)

Notice how the outfile name is the name of the first csv file with csv replaced with h5. Sometimes, this works but other times you may want to provide a more specific output file name. Use the outfile flag to do this.

[6]:
outfilename, hepfile = hf.csv_tools.csv_to_hepfile(filepaths, common_key='Household ID', outfile='test.h5', group_names=['People', 'Vehicles', 'Residences'])
print()
print('#########################################')
print(f'Output File Name: {outfilename}')

#########################################
Output File Name: test.h5
/home/nfranz/research/hepfile/src/hepfile/write.py:266: UserWarning: ----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Highest degree/grade
The new name will be Highest degree-grade
----------------------------------------------------
  warnings.warn(warning)
/home/nfranz/research/hepfile/src/hepfile/write.py:266: UserWarning: ----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Gas/electric/human powered
The new name will be Gas-electric-human powered
----------------------------------------------------
  warnings.warn(warning)
/home/nfranz/research/hepfile/src/hepfile/write.py:266: UserWarning: ----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name House/apartment/condo
The new name will be House-apartment-condo
----------------------------------------------------
  warnings.warn(warning)