This post shows code in Python. You can see the same post written in Rhere.
In the previous post, we partitioned the Interstate 80 (I80) vehicle trajectories data by time and then saved it on disk:
We now make use of the parquet file in the period=first directory for learning to transform data. Since this file is part of partitioned data, I saved it as a separate parquet file (i80_period1.parquet) in a different location.
You may read the partitioned parquet file also, but R and Python load it slightly differently.
Now, we read the i80_period1.parquet file:
Load parquet file
Reading the parquet file requires the polars package:
import polars as plpath_to_first_period_file ="data/i80_period1.parquet"data_py = pl.read_parquet(path_to_first_period_file)data_py.head()
shape: (5, 24)
Vehicle_ID
Frame_ID
Total_Frames
Global_Time
Local_X
Local_Y
Global_X
Global_Y
v_length
v_Width
v_Class
v_Vel
v_Acc
Lane_ID
O_Zone
D_Zone
Int_ID
Section_ID
Direction
Movement
Preceding
Following
Space_Headway
Time_Headway
i32
i32
i32
i64
f64
f64
f64
f64
f64
f64
i32
f64
f64
i32
i32
i32
i32
i32
i32
i32
i32
i32
f64
f64
3027
8493
813
1113433984200
53.115
363.266
6.0428e6
2.1334e6
15.3
7.4
2
21.46
-8.14
5
null
null
null
null
null
null
3014
3032
64.23
2.99
3214
9115
708
1113434046400
67.931
655.629
6.0428e6
2.1337e6
13.8
6.3
2
15.37
11.2
6
null
null
null
null
null
null
3221
3229
31.71
2.06
3199
9329
575
1113434067800
17.026
1237.592
6.0427e6
2.1343e6
14.4
5.9
2
39.3
0.0
2
null
null
null
null
null
null
3188
3206
72.36
1.84
3159
8919
572
1113434026800
16.541
306.905
6.0428e6
2.1334e6
16.4
5.9
2
14.71
3.61
2
null
null
null
null
null
null
3152
3171
51.9
3.53
3314
9324
616
1113434067300
28.846
65.807
6.0429e6
2.1331e6
14.8
6.4
2
36.24
0.0
3
null
null
null
null
null
null
3301
0
103.26
2.85
Clean dataframe names
As you can see above, the column names are in good shape i.e., without any spaces. However, it is easier typing lowercase letters than the sentence case. So, we use the clean_names function from the janitor package to make all column names lowercase. If the column names have spaces or periods in them, clean_names would replace them with underscores.
Using pyjanitor requires the data to be a pandas dataframe. But we loaded the data from a parquet file using the polars package, so we need to first convert the polars dataframe to a pandas dataframe. This further requires importing the pyarrow package:
Since vehicle trajectories change over time, it is nice to see these changes over different time periods. However, the gloabl_time column in this dataset contains integers rather than the actual time. So, we create a new column called actual_time by dividing the global_time by 1000 and converting it to a datetime object.
In Python, this can be done via pandas.to_datetime. Note that we specify the time zone to be America/Los_Angeles.
Note that the data is not in the correct order. The vehicle_ids in the first two rows are 3027 and 3214. However, we know that the same vehicle was observed for several seconds. This means that we should see a given vehicle_id repeated over multiple rows consecutively. We therefore sort by vehicle_id and frame_id:
## First: Sort by Vehicle ID and Timedata_py = data_py.sort_values(by = ["vehicle_id", "frame_id"])data_py.head()
vehicle_id
frame_id
total_frames
global_time
local_x
local_y
global_x
global_y
v_length
v_width
...
d_zone
int_id
section_id
direction
movement
preceding
following
space_headway
time_headway
actual_time
1142997
1
12
884
1113433136100
16.884
48.213
6042842.116
2133117.662
14.3
6.4
...
NaN
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.099999905-07:00
1043434
1
13
884
1113433136200
16.938
49.463
6042842.012
2133118.909
14.3
6.4
...
NaN
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.200000048-07:00
272197
1
14
884
1113433136300
16.991
50.712
6042841.908
2133120.155
14.3
6.4
...
NaN
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.299999952-07:00
771912
1
15
884
1113433136400
17.045
51.963
6042841.805
2133121.402
14.3
6.4
...
NaN
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.400000095-07:00
1036275
1
16
884
1113433136500
17.098
53.213
6042841.701
2133122.649
14.3
6.4
...
NaN
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.500000-07:00
5 rows × 25 columns
When we want to compare several vehicle trajectories, e.g., how their speeds change over time regardless of when they were observed, we’d want a common time scale. The NGSIM documentation describes that vehicles were observed at a resolution of 0.1 seconds. So, we can create atime variable for each vehicle that starts at 0 and ends at (n-1)/10 where n = number of rows for which a vehicle_id is repeated.
We first define a function calculate_time_elapsed that takes in a dataframe and returns the sequence from 0 to (n-1)/10 with a step size of 0.1 as a new column. Then we apply this function on the pandas dataframe:
def calculate_time_elapsed(group_df): num_rows =len(group_df) group_df['time'] = [i /10.0for i inrange(num_rows)]return group_df# Add the time elapsed column to the DataFrame within each groupdata_py = data_py.groupby('vehicle_id').apply(calculate_time_elapsed)data_py.head()
vehicle_id
frame_id
total_frames
global_time
local_x
local_y
global_x
global_y
v_length
v_width
...
int_id
section_id
direction
movement
preceding
following
space_headway
time_headway
actual_time
time
vehicle_id
1
1142997
1
12
884
1113433136100
16.884
48.213
6042842.116
2133117.662
14.3
6.4
...
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.099999905-07:00
0.0
1043434
1
13
884
1113433136200
16.938
49.463
6042842.012
2133118.909
14.3
6.4
...
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.200000048-07:00
0.1
272197
1
14
884
1113433136300
16.991
50.712
6042841.908
2133120.155
14.3
6.4
...
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.299999952-07:00
0.2
771912
1
15
884
1113433136400
17.045
51.963
6042841.805
2133121.402
14.3
6.4
...
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.400000095-07:00
0.3
1036275
1
16
884
1113433136500
17.098
53.213
6042841.701
2133122.649
14.3
6.4
...
NaN
NaN
NaN
NaN
0
0
0.0
0.0
2005-04-13 15:58:56.500000-07:00
0.4
5 rows × 26 columns
How to create variables for the preceding vehicle?
We’d often need velocity, acceleration, and other variables for the preceding vehicle (vehicle in front of the subject vehicle). These variables can be useful in car-following modeling.
In this dataset, preceding vehicle variables do not exist as separate columns. The only relevant column is preceding which is the identifier of the preceding vehicle. But the data also contains the subject vehicle identifier vehicle_id, along with these variables:
local_y: longitudinal position of the front end of the subject vehicle (feet)
local_x: lateral position of the front end of the subject vehicle (feet)
v_length and v_width are the length and width of the subject vehicle (feet)
v_class is the class of the subject vehicle, i.e., 1 = motorcycle, 2 = car, and 3 = heavy vehicle (bus/truck)
v_vel and v_acc are the velocity (ft/s) and acceleration (ft/s/s) of the subject vehicle
Our goal now is to create new columns of the above variables for the preceding vehicle. To this end, we look for the value of preceding in the vehicle_id column at a given value of frame_id (identifier of time frame) and then grab the value of variable e.g., v_vel at that frame_id. In Python, we achieve this by joining a few columns of the dataset with the full dataset while using the columns vehicle_id and preceding for the join:
This is part of the data_py dataframe. It includes those variables of the subject vehicle that we want to create for the preceding vehicle.
3
data_py is the “left” dataframe and data_py.loc[:, ['frame_id', 'vehicle_id', 'local_x', 'local_y', 'v_length', 'v_width', 'v_class', 'v_vel', 'v_acc']] is the “right” dataframe. At a given frame_id, ‘local_x’, ‘local_y’, ‘v_length’, ‘v_width’, ‘v_class’, ‘v_vel’, ‘v_acc’ are joined with data_py using the preceding and vehicle_id columns.
4
The type of join is “left”.
5
A suffix _preceding is added to show that these variables are for the preceding vehicle.
6
This operation created a redundant preceding vehicle ID that we drop here.
vehicle_id
frame_id
total_frames
global_time
local_x
local_y
global_x
global_y
v_length
v_width
...
time_headway
actual_time
time
local_x_preceding
local_y_preceding
v_length_preceding
v_width_preceding
v_class_preceding
v_vel_preceding
v_acc_preceding
0
1
12
884
1113433136100
16.884
48.213
6042842.116
2133117.662
14.3
6.4
...
0.0
2005-04-13 15:58:56.099999905-07:00
0.0
NaN
NaN
NaN
NaN
NaN
NaN
NaN
1
1
13
884
1113433136200
16.938
49.463
6042842.012
2133118.909
14.3
6.4
...
0.0
2005-04-13 15:58:56.200000048-07:00
0.1
NaN
NaN
NaN
NaN
NaN
NaN
NaN
2
1
14
884
1113433136300
16.991
50.712
6042841.908
2133120.155
14.3
6.4
...
0.0
2005-04-13 15:58:56.299999952-07:00
0.2
NaN
NaN
NaN
NaN
NaN
NaN
NaN
3
1
15
884
1113433136400
17.045
51.963
6042841.805
2133121.402
14.3
6.4
...
0.0
2005-04-13 15:58:56.400000095-07:00
0.3
NaN
NaN
NaN
NaN
NaN
NaN
NaN
4
1
16
884
1113433136500
17.098
53.213
6042841.701
2133122.649
14.3
6.4
...
0.0
2005-04-13 15:58:56.500000-07:00
0.4
NaN
NaN
NaN
NaN
NaN
NaN
NaN
5 rows × 33 columns
A NaN or null value indicates missing value. In this dataset, NaN / null indicates that the value is missing because there was no preceding vehicle observed. For vehicle_id 1 we can see this is true because the preceding value is 0.
To keep the column names consistent with the result in the same post with R code, we rename the preceding vehicle variables:
As discussed before, variables in this dataset have imperial units (feet, ft/s, etc.). You may want to transform the values of these variables to metric units. The conversion factor is 0.3048. Here, we utilize the polars.with_columns function to take all the desired columns (cols_to_convert_to_metric) and apply the conversion factor along with rounding to 2 decimal places:
Moreover, we know that there are variables that should be treated as categorical (qualitative) rather than numbers or strings. For instance, lane_id has values 1-7 and we know that these are identifiers for lanes. Similarly, the class of a vehicle is encoded as 1, 2, and 3 but we know that these numbers do not have any quantitaive information, rather they are categories.
In polars, categorical data is encoded as polars.Categorical data type:
## change the data type to categoricalcols_to_convert_to_categorical = ['vehicle_id', 'v_class', 'lane_id', 'preceding', 'preceding_class']data_py = data_py.with_columns(pl.col(cols_to_convert_to_categorical).cast(pl.String).cast(pl.Categorical))data_py.head()
1
Columns are first converted to string data type and then the strings are converted to categorical data type.
shape: (5, 23)
vehicle_id
frame_id
global_time
local_x
local_y
v_length
v_width
v_class
v_vel
v_acc
lane_id
preceding
space_headway
time_headway
actual_time
time
local_x_preceding
preceding_local_y
preceding_length
preceding_width
preceding_class
preceding_vel
preceding_acc
cat
i32
i64
f64
f64
f64
f64
cat
f64
f64
cat
cat
f64
f64
datetime[ns, America/Los_Angeles]
f64
f64
f64
f64
f64
cat
f64
f64
"1"
12
1113433136100
5.15
14.7
4.36
1.95
"2"
3.81
0.0
"2"
"0"
0.0
0.0
2005-04-13 15:58:56.099999905 PDT
0.0
null
null
null
null
null
null
null
"1"
13
1113433136200
5.16
15.08
4.36
1.95
"2"
3.81
0.0
"2"
"0"
0.0
0.0
2005-04-13 15:58:56.200000048 PDT
0.1
null
null
null
null
null
null
null
"1"
14
1113433136300
5.18
15.46
4.36
1.95
"2"
3.81
0.0
"2"
"0"
0.0
0.0
2005-04-13 15:58:56.299999952 PDT
0.2
null
null
null
null
null
null
null
"1"
15
1113433136400
5.2
15.84
4.36
1.95
"2"
3.81
0.0
"2"
"0"
0.0
0.0
2005-04-13 15:58:56.400000095 PDT
0.3
null
null
null
null
null
null
null
"1"
16
1113433136500
5.21
16.22
4.36
1.95
"2"
3.81
0.0
"2"
"0"
0.0
0.0
2005-04-13 15:58:56.500 PDT
0.4
null
null
null
null
null
null
null
Visualization with one vehicle
Cool! We are almost done with transforming our dataset. It is time to do some visualization. The last transformation we learn is to filter the data to keep only one vehicle:
And now we use ggplot2 to create a plot of velocity over time. Subject vehicle in blue and preceding vehicle in orange.
from lets_plot import*LetsPlot.setup_html()(ggplot(data = data_py_veh) +\ geom_path(mapping = aes(x ='time', y ='v_vel'), color ='blue') +\ geom_path(mapping = aes(x ='time', y ='preceding_vel'), color ='orange') +\ labs(x ="Time (s)", y ="Velocity (m/s)", title ="Velocity of vehicle # 2 and its preceding vehicle") +\ theme_minimal())