Home - Transform Data with Python

Note

This is the second post in the NGSIM data analysis series. Previous post:

Import data

After completing this post, you will be able to:

Read parquet files
Do a quick clean-up of column names
Convert integer time to datetime fromat
Sort data by columns
Create new columns
Remove undesired columns
Filter data

This post shows code in Python. You can see the same post written in R here.

In the previous post, we partitioned the Interstate 80 (I80) vehicle trajectories data by time and then saved it on disk:

We now make use of the parquet file in the period=first directory for learning to transform data. Since this file is part of partitioned data, I saved it as a separate parquet file (i80_period1.parquet) in a different location.

You may read the partitioned parquet file also, but R and Python load it slightly differently.

Now, we read the i80_period1.parquet file:

Load parquet file

Reading the parquet file requires the polars package:

import polars as pl

path_to_first_period_file = "data/i80_period1.parquet"
data_py = pl.read_parquet(path_to_first_period_file)

data_py.head()

shape: (5, 24)

Vehicle_ID	Frame_ID	Total_Frames	Global_Time	Local_X	Local_Y	Global_X	Global_Y	v_length	v_Width	v_Class	v_Vel	v_Acc	Lane_ID	O_Zone	D_Zone	Int_ID	Section_ID	Direction	Movement	Preceding	Following	Space_Headway	Time_Headway
i32	i32	i32	i64	f64	f64	f64	f64	f64	f64	i32	f64	f64	i32	i32	i32	i32	i32	i32	i32	i32	i32	f64	f64
3027	8493	813	1113433984200	53.115	363.266	6.0428e6	2.1334e6	15.3	7.4	2	21.46	-8.14	5	null	null	null	null	null	null	3014	3032	64.23	2.99
3214	9115	708	1113434046400	67.931	655.629	6.0428e6	2.1337e6	13.8	6.3	2	15.37	11.2	6	null	null	null	null	null	null	3221	3229	31.71	2.06
3199	9329	575	1113434067800	17.026	1237.592	6.0427e6	2.1343e6	14.4	5.9	2	39.3	0.0	2	null	null	null	null	null	null	3188	3206	72.36	1.84
3159	8919	572	1113434026800	16.541	306.905	6.0428e6	2.1334e6	16.4	5.9	2	14.71	3.61	2	null	null	null	null	null	null	3152	3171	51.9	3.53
3314	9324	616	1113434067300	28.846	65.807	6.0429e6	2.1331e6	14.8	6.4	2	36.24	0.0	3	null	null	null	null	null	null	3301	0	103.26	2.85

Clean dataframe names

As you can see above, the column names are in good shape i.e., without any spaces. However, it is easier typing lowercase letters than the sentence case. So, we use the clean_names function from the janitor package to make all column names lowercase. If the column names have spaces or periods in them, clean_names would replace them with underscores.

Using pyjanitor requires the data to be a pandas dataframe. But we loaded the data from a parquet file using the polars package, so we need to first convert the polars dataframe to a pandas dataframe. This further requires importing the pyarrow package:

# Installation:
# pip install pyjanitor
# pip install pyarrow
# pip install pandas

import pandas as pd
import pyarrow
import janitor
data_py = data_py.to_pandas()
data_py = data_py.clean_names()

data_py.head()

	vehicle_id	frame_id	total_frames	global_time	local_x	local_y	global_x	global_y	v_length	v_width	...	o_zone	d_zone	int_id	section_id	direction	movement	preceding	following	space_headway	time_headway
0	3027	8493	813	1113433984200	53.115	363.266	6042839.372	2133434.927	15.3	7.4	...	NaN	NaN	NaN	NaN	NaN	NaN	3014	3032	64.23	2.99
1	3214	9115	708	1113434046400	67.931	655.629	6042817.933	2133726.884	13.8	6.3	...	NaN	NaN	NaN	NaN	NaN	NaN	3221	3229	31.71	2.06
2	3199	9329	575	1113434067800	17.026	1237.592	6042683.444	2134296.961	14.4	5.9	...	NaN	NaN	NaN	NaN	NaN	NaN	3188	3206	72.36	1.84
3	3159	8919	572	1113434026800	16.541	306.905	6042810.096	2133374.273	16.4	5.9	...	NaN	NaN	NaN	NaN	NaN	NaN	3152	3171	51.90	3.53
4	3314	9324	616	1113434067300	28.846	65.807	6042851.778	2133136.617	14.8	6.4	...	NaN	NaN	NaN	NaN	NaN	NaN	3301	0	103.26	2.85

5 rows × 24 columns

How to create a time column?

Since vehicle trajectories change over time, it is nice to see these changes over different time periods. However, the gloabl_time column in this dataset contains integers rather than the actual time. So, we create a new column called actual_time by dividing the global_time by 1000 and converting it to a datetime object.

In Python, this can be done via pandas.to_datetime. Note that we specify the time zone to be America/Los_Angeles.

data_py['actual_time'] = pd.to_datetime(data_py['global_time'] / 1000, 
                                    unit='s', origin='1970-01-01', utc=True)
data_py['actual_time'] = data_py['actual_time'].dt.tz_convert('America/Los_Angeles')

data_py.head()

	vehicle_id	frame_id	total_frames	global_time	local_x	local_y	global_x	global_y	v_length	v_width	...	d_zone	int_id	section_id	direction	movement	preceding	following	space_headway	time_headway	actual_time
0	3027	8493	813	1113433984200	53.115	363.266	6042839.372	2133434.927	15.3	7.4	...	NaN	NaN	NaN	NaN	NaN	3014	3032	64.23	2.99	2005-04-13 16:13:04.200000048-07:00
1	3214	9115	708	1113434046400	67.931	655.629	6042817.933	2133726.884	13.8	6.3	...	NaN	NaN	NaN	NaN	NaN	3221	3229	31.71	2.06	2005-04-13 16:14:06.400000095-07:00
2	3199	9329	575	1113434067800	17.026	1237.592	6042683.444	2134296.961	14.4	5.9	...	NaN	NaN	NaN	NaN	NaN	3188	3206	72.36	1.84	2005-04-13 16:14:27.799999952-07:00
3	3159	8919	572	1113434026800	16.541	306.905	6042810.096	2133374.273	16.4	5.9	...	NaN	NaN	NaN	NaN	NaN	3152	3171	51.90	3.53	2005-04-13 16:13:46.799999952-07:00
4	3314	9324	616	1113434067300	28.846	65.807	6042851.778	2133136.617	14.8	6.4	...	NaN	NaN	NaN	NaN	NaN	3301	0	103.26	2.85	2005-04-13 16:14:27.299999952-07:00

5 rows × 25 columns

Note that the data is not in the correct order. The vehicle_ids in the first two rows are 3027 and 3214. However, we know that the same vehicle was observed for several seconds. This means that we should see a given vehicle_id repeated over multiple rows consecutively. We therefore sort by vehicle_id and frame_id:

## First: Sort by Vehicle ID and Time
data_py = data_py.sort_values(by = ["vehicle_id", "frame_id"])

data_py.head()

	vehicle_id	frame_id	total_frames	global_time	local_x	local_y	global_x	global_y	v_length	v_width	...	d_zone	int_id	section_id	direction	movement	actual_time
1142997	1	12	884	1113433136100	16.884	48.213	6042842.116	2133117.662	14.3	6.4	...	NaN	NaN	NaN	NaN	NaN	2005-04-13 15:58:56.099999905-07:00
1043434	1	13	884	1113433136200	16.938	49.463	6042842.012	2133118.909	14.3	6.4	...	NaN	NaN	NaN	NaN	NaN	2005-04-13 15:58:56.200000048-07:00
272197	1	14	884	1113433136300	16.991	50.712	6042841.908	2133120.155	14.3	6.4	...	NaN	NaN	NaN	NaN	NaN	2005-04-13 15:58:56.299999952-07:00
771912	1	15	884	1113433136400	17.045	51.963	6042841.805	2133121.402	14.3	6.4	...	NaN	NaN	NaN	NaN	NaN	2005-04-13 15:58:56.400000095-07:00
1036275	1	16	884	1113433136500	17.098	53.213	6042841.701	2133122.649	14.3	6.4	...	NaN	NaN	NaN	NaN	NaN	2005-04-13 15:58:56.500000-07:00

5 rows × 25 columns

When we want to compare several vehicle trajectories, e.g., how their speeds change over time regardless of when they were observed, we’d want a common time scale. The NGSIM documentation describes that vehicles were observed at a resolution of 0.1 seconds. So, we can create atime variable for each vehicle that starts at 0 and ends at (n-1)/10 where n = number of rows for which a vehicle_id is repeated.

We first define a function calculate_time_elapsed that takes in a dataframe and returns the sequence from 0 to (n-1)/10 with a step size of 0.1 as a new column. Then we apply this function on the pandas dataframe:

def calculate_time_elapsed(group_df):
    num_rows = len(group_df)
    group_df['time'] = [i / 10.0 for i in range(num_rows)]
    return group_df

# Add the time elapsed column to the DataFrame within each group
data_py = data_py.groupby('vehicle_id').apply(calculate_time_elapsed)

data_py.head()

		vehicle_id	frame_id	total_frames	global_time	local_x	local_y	global_x	global_y	v_length	v_width	...	int_id	section_id	direction	movement	preceding	following	space_headway	time_headway	actual_time	time
vehicle_id
1	1142997	1	12	884	1113433136100	16.884	48.213	6042842.116	2133117.662	14.3	6.4	...	NaN	NaN	NaN	NaN	0	0	0.0	0.0	2005-04-13 15:58:56.099999905-07:00	0.0
	1043434	1	13	884	1113433136200	16.938	49.463	6042842.012	2133118.909	14.3	6.4	...	NaN	NaN	NaN	NaN	0	0	0.0	0.0	2005-04-13 15:58:56.200000048-07:00	0.1
	272197	1	14	884	1113433136300	16.991	50.712	6042841.908	2133120.155	14.3	6.4	...	NaN	NaN	NaN	NaN	0	0	0.0	0.0	2005-04-13 15:58:56.299999952-07:00	0.2
	771912	1	15	884	1113433136400	17.045	51.963	6042841.805	2133121.402	14.3	6.4	...	NaN	NaN	NaN	NaN	0	0	0.0	0.0	2005-04-13 15:58:56.400000095-07:00	0.3
	1036275	1	16	884	1113433136500	17.098	53.213	6042841.701	2133122.649	14.3	6.4	...	NaN	NaN	NaN	NaN	0	0	0.0	0.0	2005-04-13 15:58:56.500000-07:00	0.4

5 rows × 26 columns

How to create variables for the preceding vehicle?

We’d often need velocity, acceleration, and other variables for the preceding vehicle (vehicle in front of the subject vehicle). These variables can be useful in car-following modeling.

In this dataset, preceding vehicle variables do not exist as separate columns. The only relevant column is preceding which is the identifier of the preceding vehicle. But the data also contains the subject vehicle identifier vehicle_id, along with these variables:

local_y: longitudinal position of the front end of the subject vehicle (feet)
local_x: lateral position of the front end of the subject vehicle (feet)
v_length and v_width are the length and width of the subject vehicle (feet)
v_class is the class of the subject vehicle, i.e., 1 = motorcycle, 2 = car, and 3 = heavy vehicle (bus/truck)
v_vel and v_acc are the velocity (ft/s) and acceleration (ft/s/s) of the subject vehicle

Our goal now is to create new columns of the above variables for the preceding vehicle. To this end, we look for the value of preceding in the vehicle_id column at a given value of frame_id (identifier of time frame) and then grab the value of variable e.g., v_vel at that frame_id. In Python, we achieve this by joining a few columns of the dataset with the full dataset while using the columns vehicle_id and preceding for the join:

data_py = data_py.reset_index(drop=True)

## Create new cols
data_py = data_py.merge(
  data_py.loc[:, ['frame_id', 'vehicle_id', 'local_x', 'local_y', 'v_length',
            'v_width', 'v_class', 'v_vel', 'v_acc']] ,
              left_on = ['frame_id', 'preceding'],
              right_on = ['frame_id', 'vehicle_id'],
              how = 'left',
              suffixes=['', '_preceding']
              )
data_py = data_py.drop(['vehicle_id_preceding'], axis = 'columns')

data_py.head()

1: merge is used for joining pandas dataframes.
2: This is part of the data_py dataframe. It includes those variables of the subject vehicle that we want to create for the preceding vehicle.
3: data_py is the “left” dataframe and data_py.loc[:, ['frame_id', 'vehicle_id', 'local_x', 'local_y', 'v_length', 'v_width', 'v_class', 'v_vel', 'v_acc']] is the “right” dataframe. At a given frame_id, ‘local_x’, ‘local_y’, ‘v_length’, ‘v_width’, ‘v_class’, ‘v_vel’, ‘v_acc’ are joined with data_py using the preceding and vehicle_id columns.
4: The type of join is “left”.
5: A suffix _preceding is added to show that these variables are for the preceding vehicle.
6: This operation created a redundant preceding vehicle ID that we drop here.

	vehicle_id	frame_id	total_frames	global_time	local_x	local_y	global_x	global_y	v_length	v_width	...	actual_time	time	local_x_preceding	local_y_preceding	v_length_preceding	v_width_preceding	v_class_preceding	v_vel_preceding	v_acc_preceding
0	1	12	884	1113433136100	16.884	48.213	6042842.116	2133117.662	14.3	6.4	...	2005-04-13 15:58:56.099999905-07:00	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	1	13	884	1113433136200	16.938	49.463	6042842.012	2133118.909	14.3	6.4	...	2005-04-13 15:58:56.200000048-07:00	0.1	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	1	14	884	1113433136300	16.991	50.712	6042841.908	2133120.155	14.3	6.4	...	2005-04-13 15:58:56.299999952-07:00	0.2	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	1	15	884	1113433136400	17.045	51.963	6042841.805	2133121.402	14.3	6.4	...	2005-04-13 15:58:56.400000095-07:00	0.3	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1	16	884	1113433136500	17.098	53.213	6042841.701	2133122.649	14.3	6.4	...	2005-04-13 15:58:56.500000-07:00	0.4	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 33 columns

A NaN or null value indicates missing value. In this dataset, NaN / null indicates that the value is missing because there was no preceding vehicle observed. For vehicle_id 1 we can see this is true because the preceding value is 0.

To keep the column names consistent with the result in the same post with R code, we rename the preceding vehicle variables:

data_py = pl.from_pandas(data_py)

data_py = data_py.rename({
    "local_y_preceding":"preceding_local_y", 
    "v_length_preceding":"preceding_length", 
    "v_width_preceding":"preceding_width", 
    "v_class_preceding":"preceding_class", 
    "v_vel_preceding":"preceding_vel", 
    "v_acc_preceding":"preceding_acc"
    })
    
data_py.columns

1: Convert the pandas dataframe to polars dataframe.
2: Use the polars.rename function to rename the columns.

['vehicle_id',
 'frame_id',
 'total_frames',
 'global_time',
 'local_x',
 'local_y',
 'global_x',
 'global_y',
 'v_length',
 'v_width',
 'v_class',
 'v_vel',
 'v_acc',
 'lane_id',
 'o_zone',
 'd_zone',
 'int_id',
 'section_id',
 'direction',
 'movement',
 'preceding',
 'following',
 'space_headway',
 'time_headway',
 'actual_time',
 'time',
 'local_x_preceding',
 'preceding_local_y',
 'preceding_length',
 'preceding_width',
 'preceding_class',
 'preceding_vel',
 'preceding_acc']

How to remove undesired columns?

There are several variables in this dataset that we don’t need as they are completely devoid of any value. So we remove them:

data_py = data_py.drop(["total_frames", "global_x", "global_y", "following", 
              "o_zone", "d_zone", "int_id", "section_id", "direction", 
             "movement"])
             
data_py.head()

shape: (5, 23)

vehicle_id	frame_id	global_time	local_x	local_y	v_length	v_width	v_class	v_vel	v_acc	lane_id	preceding	space_headway	time_headway	actual_time	time	local_x_preceding	preceding_local_y	preceding_length	preceding_width	preceding_class	preceding_vel	preceding_acc
i32	i32	i64	f64	f64	f64	f64	i32	f64	f64	i32	i32	f64	f64	datetime[ns, America/Los_Angeles]	f64	f64	f64	f64	f64	f64	f64	f64
1	12	1113433136100	16.884	48.213	14.3	6.4	2	12.5	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.099999905 PDT	0.0	null	null	null	null	null	null	null
1	13	1113433136200	16.938	49.463	14.3	6.4	2	12.5	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.200000048 PDT	0.1	null	null	null	null	null	null	null
1	14	1113433136300	16.991	50.712	14.3	6.4	2	12.5	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.299999952 PDT	0.2	null	null	null	null	null	null	null
1	15	1113433136400	17.045	51.963	14.3	6.4	2	12.5	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.400000095 PDT	0.3	null	null	null	null	null	null	null
1	16	1113433136500	17.098	53.213	14.3	6.4	2	12.5	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.500 PDT	0.4	null	null	null	null	null	null	null

How to transform multiple columns?

Metric units

As discussed before, variables in this dataset have imperial units (feet, ft/s, etc.). You may want to transform the values of these variables to metric units. The conversion factor is 0.3048. Here, we utilize the polars.with_columns function to take all the desired columns (cols_to_convert_to_metric) and apply the conversion factor along with rounding to 2 decimal places:

## convert to metric
cols_to_convert_to_metric = ['local_x', 'local_y', 'v_length', 'v_width', 
        'v_vel', 'v_acc', 'space_headway', 'preceding_local_y',
        'preceding_length', 'preceding_width', 'preceding_vel',
       'preceding_acc']

data_py = data_py.with_columns((pl.col(cols_to_convert_to_metric) * .3048).round(2))

data_py.head()

shape: (5, 23)

vehicle_id	frame_id	global_time	local_x	local_y	v_length	v_width	v_class	v_vel	v_acc	lane_id	preceding	space_headway	time_headway	actual_time	time	local_x_preceding	preceding_local_y	preceding_length	preceding_width	preceding_class	preceding_vel	preceding_acc
i32	i32	i64	f64	f64	f64	f64	i32	f64	f64	i32	i32	f64	f64	datetime[ns, America/Los_Angeles]	f64	f64	f64	f64	f64	f64	f64	f64
1	12	1113433136100	5.15	14.7	4.36	1.95	2	3.81	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.099999905 PDT	0.0	null	null	null	null	null	null	null
1	13	1113433136200	5.16	15.08	4.36	1.95	2	3.81	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.200000048 PDT	0.1	null	null	null	null	null	null	null
1	14	1113433136300	5.18	15.46	4.36	1.95	2	3.81	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.299999952 PDT	0.2	null	null	null	null	null	null	null
1	15	1113433136400	5.2	15.84	4.36	1.95	2	3.81	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.400000095 PDT	0.3	null	null	null	null	null	null	null
1	16	1113433136500	5.21	16.22	4.36	1.95	2	3.81	0.0	2	0	0.0	0.0	2005-04-13 15:58:56.500 PDT	0.4	null	null	null	null	null	null	null

Convert numbers/strings to categorical data type

Moreover, we know that there are variables that should be treated as categorical (qualitative) rather than numbers or strings. For instance, lane_id has values 1-7 and we know that these are identifiers for lanes. Similarly, the class of a vehicle is encoded as 1, 2, and 3 but we know that these numbers do not have any quantitaive information, rather they are categories.

In polars, categorical data is encoded as polars.Categorical data type:

## change the data type to categorical
cols_to_convert_to_categorical = ['vehicle_id', 'v_class', 'lane_id', 
                             'preceding', 'preceding_class']
data_py = data_py.with_columns(pl.col(cols_to_convert_to_categorical).cast(pl.String).cast(pl.Categorical))

data_py.head()

1: Columns are first converted to string data type and then the strings are converted to categorical data type.

shape: (5, 23)

vehicle_id	frame_id	global_time	local_x	local_y	v_length	v_width	v_class	v_vel	v_acc	lane_id	preceding	space_headway	time_headway	actual_time	time	local_x_preceding	preceding_local_y	preceding_length	preceding_width	preceding_class	preceding_vel	preceding_acc
cat	i32	i64	f64	f64	f64	f64	cat	f64	f64	cat	cat	f64	f64	datetime[ns, America/Los_Angeles]	f64	f64	f64	f64	f64	cat	f64	f64
"1"	12	1113433136100	5.15	14.7	4.36	1.95	"2"	3.81	0.0	"2"	"0"	0.0	0.0	2005-04-13 15:58:56.099999905 PDT	0.0	null	null	null	null	null	null	null
"1"	13	1113433136200	5.16	15.08	4.36	1.95	"2"	3.81	0.0	"2"	"0"	0.0	0.0	2005-04-13 15:58:56.200000048 PDT	0.1	null	null	null	null	null	null	null
"1"	14	1113433136300	5.18	15.46	4.36	1.95	"2"	3.81	0.0	"2"	"0"	0.0	0.0	2005-04-13 15:58:56.299999952 PDT	0.2	null	null	null	null	null	null	null
"1"	15	1113433136400	5.2	15.84	4.36	1.95	"2"	3.81	0.0	"2"	"0"	0.0	0.0	2005-04-13 15:58:56.400000095 PDT	0.3	null	null	null	null	null	null	null
"1"	16	1113433136500	5.21	16.22	4.36	1.95	"2"	3.81	0.0	"2"	"0"	0.0	0.0	2005-04-13 15:58:56.500 PDT	0.4	null	null	null	null	null	null	null

Visualization with one vehicle

Cool! We are almost done with transforming our dataset. It is time to do some visualization. The last transformation we learn is to filter the data to keep only one vehicle:

data_py_veh = data_py.filter(pl.col('vehicle_id') == "2")

And now we use ggplot2 to create a plot of velocity over time. Subject vehicle in blue and preceding vehicle in orange.

from lets_plot import *
LetsPlot.setup_html()
(
ggplot(data = data_py_veh) +\
  geom_path(mapping = aes(x = 'time', y = 'v_vel'), color = 'blue') +\
  geom_path(mapping = aes(x = 'time', y = 'preceding_vel'), color = 'orange') +\
  labs(x = "Time (s)", y = "Velocity (m/s)",
       title = "Velocity of vehicle # 2 and its preceding vehicle") +\
  theme_minimal()
)

Reuse

https://creativecommons.org/licenses/by/4.0/

Citation

BibTeX citation:

@online{durrani2024,
  author = {Durrani, Umair},
  title = {Transform {Data} with {Python}},
  date = {2024-01-21},
  langid = {en}
}

For attribution, please cite this work as:

Durrani, Umair. 2024. “Transform Data with Python.” January 21, 2024.