# Real Example: Housing Data

This is a real example using Housing Data and demonstrates the `hepfile.csv_tools` module!

In [1]:
import hepfile as hf
import pandas as pd

Before moving on with the tutorial, make sure you have downloaded the following datasets using the wget command. This only needs to be run once.

Also, make sure you review the following link on the hepfile readthedocs page to get some context: https://hepfile.readthedocs.io/en/latest/introduction.html#overview-of-use-case

In [2]:
!wget -nc -O 'People.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/People.csv'
!wget -nc -O 'Vehicles.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/Vehicles.csv'
!wget -nc -O 'Residences.csv' 'https://raw.githubusercontent.com/mattbellis/hepfile/main/docs/example_nb/Residences.csv'

File ‘People.csv’ already there; not retrieving.
File ‘Vehicles.csv’ already there; not retrieving.
File ‘Residences.csv’ already there; not retrieving.


The next step is to define a list of all of these filepaths

In [3]:
filepaths = ['People.csv', 'Vehicles.csv', 'Residences.csv']

For the sake of completeness, let's take a look at these datasets

In [4]:
import pandas as pd
for f in filepaths:
    print(f + ':\n')
    print(pd.read_csv(f).to_markdown())
    print()

People.csv:

|    |   Household ID | First name   | Last name   | Gender ID   |   Age |   Height |   Yearly income | Highest degree/grade   |
|---:|---------------:|:-------------|:------------|:------------|------:|---------:|----------------:|:-----------------------|
|  0 |              0 | blah         | blah        | M           |    54 |      159 |           75000 | BS                     |
|  1 |              0 | blah         | blah        | F           |    52 |      140 |           80000 | MS                     |
|  2 |              0 | blah         | blah        | NB          |    18 |      168 |               0 | 12                     |
|  3 |              0 | blah         | blah        | F           |    14 |      150 |               0 | 9                      |
|  4 |              1 | blah         | blah        | M           |    32 |      159 |           49000 | BS                     |
|  5 |              1 | blah         | blah        | M           |    27 |      140 

So there is a lot of different columns in these three csvs but it looks like they are all connected by the common key `Household ID`. This is similar to a database structure where each csv has a different length but are connected by a common ID. This makes these files perfect for being stored in a hepfile!

If we want to go straight to writing a hepfile instead of just creating an awkward array of the data, we can use the `hepfile.csv_tools.csv_to_hepfile` method. This takes a list of csv filepaths and a common key to merge by.

In [5]:
outfilename, hepfile = hf.csv_tools.csv_to_hepfile(filepaths, common_key='Household ID', group_names=['People', 'Vehicles', 'Residences'])
print()
print('#########################################')
print(f'Output File Name: {outfilename}')


#########################################
Output File Name: People.h5


Slashes / are not allowed in dataset names
Replacing / with - in dataset name Highest degree/grade
The new name will be Highest degree-grade
----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Gas/electric/human powered
The new name will be Gas-electric-human powered
----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name House/apartment/condo
The new name will be House-apartment-condo
----------------------------------------------------


Notice how the outfile name is the name of the first csv file with csv replaced with h5. Sometimes, this works but other times you may want to provide a more specific output file name. Use the `outfile` flag to do this.

In [6]:
outfilename, hepfile = hf.csv_tools.csv_to_hepfile(filepaths, common_key='Household ID', outfile='test.h5', group_names=['People', 'Vehicles', 'Residences'])
print()
print('#########################################')
print(f'Output File Name: {outfilename}')


#########################################
Output File Name: test.h5


Slashes / are not allowed in dataset names
Replacing / with - in dataset name Highest degree/grade
The new name will be Highest degree-grade
----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name Gas/electric/human powered
The new name will be Gas-electric-human powered
----------------------------------------------------
Slashes / are not allowed in dataset names
Replacing / with - in dataset name House/apartment/condo
The new name will be House-apartment-condo
----------------------------------------------------
