Transform Data with Python

import
transform
pandas
polars
janitor
Author
Published

January 21, 2024

Note

This is the second post in the NGSIM data analysis series. Previous post:

After completing this post, you will be able to:
  • Read parquet files
  • Do a quick clean-up of column names
  • Convert integer time to datetime fromat
  • Sort data by columns
  • Create new columns
  • Remove undesired columns
  • Filter data

This post shows code in Python. You can see the same post written in R here.

In the previous post, we partitioned the Interstate 80 (I80) vehicle trajectories data by time and then saved it on disk:

We now make use of the parquet file in the period=first directory for learning to transform data. Since this file is part of partitioned data, I saved it as a separate parquet file (i80_period1.parquet) in a different location.

You may read the partitioned parquet file also, but R and Python load it slightly differently.

Now, we read the i80_period1.parquet file:

Load parquet file

Reading the parquet file requires the polars package:

import polars as pl

path_to_first_period_file = "data/i80_period1.parquet"
data_py = pl.read_parquet(path_to_first_period_file)

data_py.head()
shape: (5, 24)
Vehicle_ID Frame_ID Total_Frames Global_Time Local_X Local_Y Global_X Global_Y v_length v_Width v_Class v_Vel v_Acc Lane_ID O_Zone D_Zone Int_ID Section_ID Direction Movement Preceding Following Space_Headway Time_Headway
i32 i32 i32 i64 f64 f64 f64 f64 f64 f64 i32 f64 f64 i32 i32 i32 i32 i32 i32 i32 i32 i32 f64 f64
3027 8493 813 1113433984200 53.115 363.266 6.0428e6 2.1334e6 15.3 7.4 2 21.46 -8.14 5 null null null null null null 3014 3032 64.23 2.99
3214 9115 708 1113434046400 67.931 655.629 6.0428e6 2.1337e6 13.8 6.3 2 15.37 11.2 6 null null null null null null 3221 3229 31.71 2.06
3199 9329 575 1113434067800 17.026 1237.592 6.0427e6 2.1343e6 14.4 5.9 2 39.3 0.0 2 null null null null null null 3188 3206 72.36 1.84
3159 8919 572 1113434026800 16.541 306.905 6.0428e6 2.1334e6 16.4 5.9 2 14.71 3.61 2 null null null null null null 3152 3171 51.9 3.53
3314 9324 616 1113434067300 28.846 65.807 6.0429e6 2.1331e6 14.8 6.4 2 36.24 0.0 3 null null null null null null 3301 0 103.26 2.85

Clean dataframe names

As you can see above, the column names are in good shape i.e., without any spaces. However, it is easier typing lowercase letters than the sentence case. So, we use the clean_names function from the janitor package to make all column names lowercase. If the column names have spaces or periods in them, clean_names would replace them with underscores.

Using pyjanitor requires the data to be a pandas dataframe. But we loaded the data from a parquet file using the polars package, so we need to first convert the polars dataframe to a pandas dataframe. This further requires importing the pyarrow package:

# Installation:
# pip install pyjanitor
# pip install pyarrow
# pip install pandas

import pandas as pd
import pyarrow
import janitor
data_py = data_py.to_pandas()
data_py = data_py.clean_names()

data_py.head()
vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width ... o_zone d_zone int_id section_id direction movement preceding following space_headway time_headway
0 3027 8493 813 1113433984200 53.115 363.266 6042839.372 2133434.927 15.3 7.4 ... NaN NaN NaN NaN NaN NaN 3014 3032 64.23 2.99
1 3214 9115 708 1113434046400 67.931 655.629 6042817.933 2133726.884 13.8 6.3 ... NaN NaN NaN NaN NaN NaN 3221 3229 31.71 2.06
2 3199 9329 575 1113434067800 17.026 1237.592 6042683.444 2134296.961 14.4 5.9 ... NaN NaN NaN NaN NaN NaN 3188 3206 72.36 1.84
3 3159 8919 572 1113434026800 16.541 306.905 6042810.096 2133374.273 16.4 5.9 ... NaN NaN NaN NaN NaN NaN 3152 3171 51.90 3.53
4 3314 9324 616 1113434067300 28.846 65.807 6042851.778 2133136.617 14.8 6.4 ... NaN NaN NaN NaN NaN NaN 3301 0 103.26 2.85

5 rows × 24 columns

How to create a time column?

Since vehicle trajectories change over time, it is nice to see these changes over different time periods. However, the gloabl_time column in this dataset contains integers rather than the actual time. So, we create a new column called actual_time by dividing the global_time by 1000 and converting it to a datetime object.

In Python, this can be done via pandas.to_datetime. Note that we specify the time zone to be America/Los_Angeles.

data_py['actual_time'] = pd.to_datetime(data_py['global_time'] / 1000, 
                                    unit='s', origin='1970-01-01', utc=True)
data_py['actual_time'] = data_py['actual_time'].dt.tz_convert('America/Los_Angeles')

data_py.head()
vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width ... d_zone int_id section_id direction movement preceding following space_headway time_headway actual_time
0 3027 8493 813 1113433984200 53.115 363.266 6042839.372 2133434.927 15.3 7.4 ... NaN NaN NaN NaN NaN 3014 3032 64.23 2.99 2005-04-13 16:13:04.200000048-07:00
1 3214 9115 708 1113434046400 67.931 655.629 6042817.933 2133726.884 13.8 6.3 ... NaN NaN NaN NaN NaN 3221 3229 31.71 2.06 2005-04-13 16:14:06.400000095-07:00
2 3199 9329 575 1113434067800 17.026 1237.592 6042683.444 2134296.961 14.4 5.9 ... NaN NaN NaN NaN NaN 3188 3206 72.36 1.84 2005-04-13 16:14:27.799999952-07:00
3 3159 8919 572 1113434026800 16.541 306.905 6042810.096 2133374.273 16.4 5.9 ... NaN NaN NaN NaN NaN 3152 3171 51.90 3.53 2005-04-13 16:13:46.799999952-07:00
4 3314 9324 616 1113434067300 28.846 65.807 6042851.778 2133136.617 14.8 6.4 ... NaN NaN NaN NaN NaN 3301 0 103.26 2.85 2005-04-13 16:14:27.299999952-07:00

5 rows × 25 columns

Note that the data is not in the correct order. The vehicle_ids in the first two rows are 3027 and 3214. However, we know that the same vehicle was observed for several seconds. This means that we should see a given vehicle_id repeated over multiple rows consecutively. We therefore sort by vehicle_id and frame_id:

## First: Sort by Vehicle ID and Time
data_py = data_py.sort_values(by = ["vehicle_id", "frame_id"])

data_py.head()
vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width ... d_zone int_id section_id direction movement preceding following space_headway time_headway actual_time
1142997 1 12 884 1113433136100 16.884 48.213 6042842.116 2133117.662 14.3 6.4 ... NaN NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.099999905-07:00
1043434 1 13 884 1113433136200 16.938 49.463 6042842.012 2133118.909 14.3 6.4 ... NaN NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.200000048-07:00
272197 1 14 884 1113433136300 16.991 50.712 6042841.908 2133120.155 14.3 6.4 ... NaN NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.299999952-07:00
771912 1 15 884 1113433136400 17.045 51.963 6042841.805 2133121.402 14.3 6.4 ... NaN NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.400000095-07:00
1036275 1 16 884 1113433136500 17.098 53.213 6042841.701 2133122.649 14.3 6.4 ... NaN NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.500000-07:00

5 rows × 25 columns

When we want to compare several vehicle trajectories, e.g., how their speeds change over time regardless of when they were observed, we’d want a common time scale. The NGSIM documentation describes that vehicles were observed at a resolution of 0.1 seconds. So, we can create atime variable for each vehicle that starts at 0 and ends at (n-1)/10 where n = number of rows for which a vehicle_id is repeated.

We first define a function calculate_time_elapsed that takes in a dataframe and returns the sequence from 0 to (n-1)/10 with a step size of 0.1 as a new column. Then we apply this function on the pandas dataframe:

def calculate_time_elapsed(group_df):
    num_rows = len(group_df)
    group_df['time'] = [i / 10.0 for i in range(num_rows)]
    return group_df

# Add the time elapsed column to the DataFrame within each group
data_py = data_py.groupby('vehicle_id').apply(calculate_time_elapsed)

data_py.head()
vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width ... int_id section_id direction movement preceding following space_headway time_headway actual_time time
vehicle_id
1 1142997 1 12 884 1113433136100 16.884 48.213 6042842.116 2133117.662 14.3 6.4 ... NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.099999905-07:00 0.0
1043434 1 13 884 1113433136200 16.938 49.463 6042842.012 2133118.909 14.3 6.4 ... NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.200000048-07:00 0.1
272197 1 14 884 1113433136300 16.991 50.712 6042841.908 2133120.155 14.3 6.4 ... NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.299999952-07:00 0.2
771912 1 15 884 1113433136400 17.045 51.963 6042841.805 2133121.402 14.3 6.4 ... NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.400000095-07:00 0.3
1036275 1 16 884 1113433136500 17.098 53.213 6042841.701 2133122.649 14.3 6.4 ... NaN NaN NaN NaN 0 0 0.0 0.0 2005-04-13 15:58:56.500000-07:00 0.4

5 rows × 26 columns

How to create variables for the preceding vehicle?

We’d often need velocity, acceleration, and other variables for the preceding vehicle (vehicle in front of the subject vehicle). These variables can be useful in car-following modeling.

In this dataset, preceding vehicle variables do not exist as separate columns. The only relevant column is preceding which is the identifier of the preceding vehicle. But the data also contains the subject vehicle identifier vehicle_id, along with these variables:

  • local_y: longitudinal position of the front end of the subject vehicle (feet)
  • local_x: lateral position of the front end of the subject vehicle (feet)
  • v_length and v_width are the length and width of the subject vehicle (feet)
  • v_class is the class of the subject vehicle, i.e., 1 = motorcycle, 2 = car, and 3 = heavy vehicle (bus/truck)
  • v_vel and v_acc are the velocity (ft/s) and acceleration (ft/s/s) of the subject vehicle

Our goal now is to create new columns of the above variables for the preceding vehicle. To this end, we look for the value of preceding in the vehicle_id column at a given value of frame_id (identifier of time frame) and then grab the value of variable e.g., v_vel at that frame_id. In Python, we achieve this by joining a few columns of the dataset with the full dataset while using the columns vehicle_id and preceding for the join:

data_py = data_py.reset_index(drop=True)

## Create new cols
data_py = data_py.merge(
  data_py.loc[:, ['frame_id', 'vehicle_id', 'local_x', 'local_y', 'v_length',
            'v_width', 'v_class', 'v_vel', 'v_acc']] ,
              left_on = ['frame_id', 'preceding'],
              right_on = ['frame_id', 'vehicle_id'],
              how = 'left',
              suffixes=['', '_preceding']
              )
data_py = data_py.drop(['vehicle_id_preceding'], axis = 'columns')

data_py.head()
1
merge is used for joining pandas dataframes.
2
This is part of the data_py dataframe. It includes those variables of the subject vehicle that we want to create for the preceding vehicle.
3
data_py is the “left” dataframe and data_py.loc[:, ['frame_id', 'vehicle_id', 'local_x', 'local_y', 'v_length', 'v_width', 'v_class', 'v_vel', 'v_acc']] is the “right” dataframe. At a given frame_id, ‘local_x’, ‘local_y’, ‘v_length’, ‘v_width’, ‘v_class’, ‘v_vel’, ‘v_acc’ are joined with data_py using the preceding and vehicle_id columns.
4
The type of join is “left”.
5
A suffix _preceding is added to show that these variables are for the preceding vehicle.
6
This operation created a redundant preceding vehicle ID that we drop here.
vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width ... time_headway actual_time time local_x_preceding local_y_preceding v_length_preceding v_width_preceding v_class_preceding v_vel_preceding v_acc_preceding
0 1 12 884 1113433136100 16.884 48.213 6042842.116 2133117.662 14.3 6.4 ... 0.0 2005-04-13 15:58:56.099999905-07:00 0.0 NaN NaN NaN NaN NaN NaN NaN
1 1 13 884 1113433136200 16.938 49.463 6042842.012 2133118.909 14.3 6.4 ... 0.0 2005-04-13 15:58:56.200000048-07:00 0.1 NaN NaN NaN NaN NaN NaN NaN
2 1 14 884 1113433136300 16.991 50.712 6042841.908 2133120.155 14.3 6.4 ... 0.0 2005-04-13 15:58:56.299999952-07:00 0.2 NaN NaN NaN NaN NaN NaN NaN
3 1 15 884 1113433136400 17.045 51.963 6042841.805 2133121.402 14.3 6.4 ... 0.0 2005-04-13 15:58:56.400000095-07:00 0.3 NaN NaN NaN NaN NaN NaN NaN
4 1 16 884 1113433136500 17.098 53.213 6042841.701 2133122.649 14.3 6.4 ... 0.0 2005-04-13 15:58:56.500000-07:00 0.4 NaN NaN NaN NaN NaN NaN NaN

5 rows × 33 columns

A NaN or null value indicates missing value. In this dataset, NaN / null indicates that the value is missing because there was no preceding vehicle observed. For vehicle_id 1 we can see this is true because the preceding value is 0.

To keep the column names consistent with the result in the same post with R code, we rename the preceding vehicle variables:

data_py = pl.from_pandas(data_py)

data_py = data_py.rename({
    "local_y_preceding":"preceding_local_y", 
    "v_length_preceding":"preceding_length", 
    "v_width_preceding":"preceding_width", 
    "v_class_preceding":"preceding_class", 
    "v_vel_preceding":"preceding_vel", 
    "v_acc_preceding":"preceding_acc"
    })
    
data_py.columns
1
Convert the pandas dataframe to polars dataframe.
2
Use the polars.rename function to rename the columns.
['vehicle_id',
 'frame_id',
 'total_frames',
 'global_time',
 'local_x',
 'local_y',
 'global_x',
 'global_y',
 'v_length',
 'v_width',
 'v_class',
 'v_vel',
 'v_acc',
 'lane_id',
 'o_zone',
 'd_zone',
 'int_id',
 'section_id',
 'direction',
 'movement',
 'preceding',
 'following',
 'space_headway',
 'time_headway',
 'actual_time',
 'time',
 'local_x_preceding',
 'preceding_local_y',
 'preceding_length',
 'preceding_width',
 'preceding_class',
 'preceding_vel',
 'preceding_acc']

How to remove undesired columns?

There are several variables in this dataset that we don’t need as they are completely devoid of any value. So we remove them:

data_py = data_py.drop(["total_frames", "global_x", "global_y", "following", 
              "o_zone", "d_zone", "int_id", "section_id", "direction", 
             "movement"])
             
data_py.head()
shape: (5, 23)
vehicle_id frame_id global_time local_x local_y v_length v_width v_class v_vel v_acc lane_id preceding space_headway time_headway actual_time time local_x_preceding preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
i32 i32 i64 f64 f64 f64 f64 i32 f64 f64 i32 i32 f64 f64 datetime[ns, America/Los_Angeles] f64 f64 f64 f64 f64 f64 f64 f64
1 12 1113433136100 16.884 48.213 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.099999905 PDT 0.0 null null null null null null null
1 13 1113433136200 16.938 49.463 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.200000048 PDT 0.1 null null null null null null null
1 14 1113433136300 16.991 50.712 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.299999952 PDT 0.2 null null null null null null null
1 15 1113433136400 17.045 51.963 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.400000095 PDT 0.3 null null null null null null null
1 16 1113433136500 17.098 53.213 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.500 PDT 0.4 null null null null null null null

How to transform multiple columns?

Metric units

As discussed before, variables in this dataset have imperial units (feet, ft/s, etc.). You may want to transform the values of these variables to metric units. The conversion factor is 0.3048. Here, we utilize the polars.with_columns function to take all the desired columns (cols_to_convert_to_metric) and apply the conversion factor along with rounding to 2 decimal places:

## convert to metric
cols_to_convert_to_metric = ['local_x', 'local_y', 'v_length', 'v_width', 
        'v_vel', 'v_acc', 'space_headway', 'preceding_local_y',
        'preceding_length', 'preceding_width', 'preceding_vel',
       'preceding_acc']

data_py = data_py.with_columns((pl.col(cols_to_convert_to_metric) * .3048).round(2))

data_py.head()
shape: (5, 23)
vehicle_id frame_id global_time local_x local_y v_length v_width v_class v_vel v_acc lane_id preceding space_headway time_headway actual_time time local_x_preceding preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
i32 i32 i64 f64 f64 f64 f64 i32 f64 f64 i32 i32 f64 f64 datetime[ns, America/Los_Angeles] f64 f64 f64 f64 f64 f64 f64 f64
1 12 1113433136100 5.15 14.7 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.099999905 PDT 0.0 null null null null null null null
1 13 1113433136200 5.16 15.08 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.200000048 PDT 0.1 null null null null null null null
1 14 1113433136300 5.18 15.46 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.299999952 PDT 0.2 null null null null null null null
1 15 1113433136400 5.2 15.84 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.400000095 PDT 0.3 null null null null null null null
1 16 1113433136500 5.21 16.22 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13 15:58:56.500 PDT 0.4 null null null null null null null

Convert numbers/strings to categorical data type

Moreover, we know that there are variables that should be treated as categorical (qualitative) rather than numbers or strings. For instance, lane_id has values 1-7 and we know that these are identifiers for lanes. Similarly, the class of a vehicle is encoded as 1, 2, and 3 but we know that these numbers do not have any quantitaive information, rather they are categories.

In polars, categorical data is encoded as polars.Categorical data type:

## change the data type to categorical
cols_to_convert_to_categorical = ['vehicle_id', 'v_class', 'lane_id', 
                             'preceding', 'preceding_class']
data_py = data_py.with_columns(pl.col(cols_to_convert_to_categorical).cast(pl.String).cast(pl.Categorical))

data_py.head()
1
Columns are first converted to string data type and then the strings are converted to categorical data type.
shape: (5, 23)
vehicle_id frame_id global_time local_x local_y v_length v_width v_class v_vel v_acc lane_id preceding space_headway time_headway actual_time time local_x_preceding preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
cat i32 i64 f64 f64 f64 f64 cat f64 f64 cat cat f64 f64 datetime[ns, America/Los_Angeles] f64 f64 f64 f64 f64 cat f64 f64
"1" 12 1113433136100 5.15 14.7 4.36 1.95 "2" 3.81 0.0 "2" "0" 0.0 0.0 2005-04-13 15:58:56.099999905 PDT 0.0 null null null null null null null
"1" 13 1113433136200 5.16 15.08 4.36 1.95 "2" 3.81 0.0 "2" "0" 0.0 0.0 2005-04-13 15:58:56.200000048 PDT 0.1 null null null null null null null
"1" 14 1113433136300 5.18 15.46 4.36 1.95 "2" 3.81 0.0 "2" "0" 0.0 0.0 2005-04-13 15:58:56.299999952 PDT 0.2 null null null null null null null
"1" 15 1113433136400 5.2 15.84 4.36 1.95 "2" 3.81 0.0 "2" "0" 0.0 0.0 2005-04-13 15:58:56.400000095 PDT 0.3 null null null null null null null
"1" 16 1113433136500 5.21 16.22 4.36 1.95 "2" 3.81 0.0 "2" "0" 0.0 0.0 2005-04-13 15:58:56.500 PDT 0.4 null null null null null null null

Visualization with one vehicle

Cool! We are almost done with transforming our dataset. It is time to do some visualization. The last transformation we learn is to filter the data to keep only one vehicle:

data_py_veh = data_py.filter(pl.col('vehicle_id') == "2")

And now we use ggplot2 to create a plot of velocity over time. Subject vehicle in blue and preceding vehicle in orange.

from lets_plot import *
LetsPlot.setup_html()
(
ggplot(data = data_py_veh) +\
  geom_path(mapping = aes(x = 'time', y = 'v_vel'), color = 'blue') +\
  geom_path(mapping = aes(x = 'time', y = 'preceding_vel'), color = 'orange') +\
  labs(x = "Time (s)", y = "Velocity (m/s)",
       title = "Velocity of vehicle # 2 and its preceding vehicle") +\
  theme_minimal()
)

Reuse

Citation

BibTeX citation:
@online{durrani2024,
  author = {Durrani, Umair},
  title = {Transform {Data} with {Python}},
  date = {2024-01-21},
  langid = {en}
}
For attribution, please cite this work as:
Durrani, Umair. 2024. “Transform Data with Python.” January 21, 2024.