Transform Data with R

import
transform
dplyr
arrow
janitor
Author
Published

January 21, 2024

Note

This is the second post in the NGSIM data analysis series. Previous post:

After completing this post, you will be able to:
  • Read parquet files
  • Do a quick clean-up of column names
  • Convert integer time to datetime fromat
  • Sort data by columns
  • Create new columns
  • Remove undesired columns
  • Filter data
  • Create this little animation of a car following another car:

This post shows code in R. You can see the same post written in Python here.

In the previous post, we partitioned the Interstate 80 (I80) vehicle trajectories data by time and then saved it on disk:

We now make use of the parquet file in the period=first directory for learning to transform data. Since this file is part of partitioned data, I saved it as a separate parquet file in a different location as shown below:

You may read the partitioned parquet file also, but R and Python load it slightly differently.
library(arrow)
library(dplyr)

path_to_partitioned_first_period_file <- "D:/ngsim/posts/import/data/I80/period=first/part-0.parquet"

open_dataset(path_to_partitioned_first_period_file) |>
  dplyr::collect() |>
  arrow::write_parquet(sink = "data/i80_period1.parquet")
1
Open the part-0.parquet.
2
Load it into memory.
3
Save it as i80_period1.parquet under the data directory.

Now, we read the i80_period1.parquet file:

Load parquet file

Reading the parquet file requires the arrow package:

library(arrow)
library(dplyr)

path_to_first_period_file <- "data/i80_period1.parquet"
data_R <- arrow::read_parquet(file = path_to_first_period_file)

head(data_R)

Clean dataframe names

As you can see above, the column names are in good shape i.e., without any spaces. However, it is easier typing lowercase letters than the sentence case. So, we use the clean_names function from the janitor package to make all column names lowercase. If the column names have spaces or periods in them, clean_names would replace them with underscores.

library(janitor)
data_R <- janitor::clean_names(data_R) 

head(data_R)

How to create a time column?

Since vehicle trajectories change over time, it is nice to see these changes over different time periods. However, the gloabl_time column in this dataset contains integers rather than the actual time. So, we create a new column called actual_time by dividing the global_time by 1000 and converting it to a datetime object.

In R, this can be done via lubridate::as_datetime. Note that we specify the time zone to be America/Los_Angeles.

library(lubridate)

data_R <- data_R |>
  dplyr::mutate(actual_time = lubridate::as_datetime(global_time / 1000, 
    origin = "1970-01-01",
    tz = "America/Los_Angeles"
  ))

head(data_R)

Note that the data is not in the correct order. The vehicle_ids in the first two rows are 3027 and 3214. However, we know that the same vehicle was observed for several seconds. This means that we should see a given vehicle_id repeated over multiple rows consecutively. We therefore sort by vehicle_id and frame_id:

## First: Sort by Vehicle ID and Time
data_R <- data_R |> 
  dplyr::arrange(vehicle_id, frame_id)

head(data_R)

When we want to compare several vehicle trajectories, e.g., how their speeds change over time regardless of when they were observed, we’d want a common time scale. The NGSIM documentation describes that vehicles were observed at a resolution of 0.1 seconds. So, we can create atime variable for each vehicle that starts at 0 and ends at (n-1)/10 where n = number of rows for which a vehicle_id is repeated:

data_R <- data_R |> 
  dplyr::group_by(vehicle_id) |>
  dplyr::mutate(time = (0:(dplyr::n()-1))/10) |>
  dplyr::ungroup()

head(data_R)
1
Group by vehicle_id so that time is calculated for each vehicle separately.
2
dplyr::n() gives the group size.
3
Don’t forget to ungroup.

How to create variables for the preceding vehicle?

We’d often need velocity, acceleration, and other variables for the preceding vehicle (vehicle in front of the subject vehicle). These variables can be useful in car-following modeling.

In this dataset, preceding vehicle variables do not exist as separate columns. The only relevant column is preceding which is the identifier of the preceding vehicle. But the data also contains the subject vehicle identifier vehicle_id, along with these variables:

  • local_y: longitudinal position of the front end of the subject vehicle (feet)
  • local_x: lateral position of the front end of the subject vehicle (feet)
  • v_length and v_width are the length and width of the subject vehicle (feet)
  • v_class is the class of the subject vehicle, i.e., 1 = motorcycle, 2 = car, and 3 = heavy vehicle (bus/truck)
  • v_vel and v_acc are the velocity (ft/s) and acceleration (ft/s/s) of the subject vehicle

Our goal now is to create new columns of the above variables for the preceding vehicle. To this end, we look for the value of preceding in the vehicle_id column at a given value of frame_id (identifier of time frame) and then grab the value of variable e.g., v_vel at that frame_id:

## Create new cols
data_R <- data_R |> 
  dplyr::group_by(frame_id) |>
  dplyr::mutate(preceding_local_y = local_y[match(preceding, vehicle_id)],
                preceding_local_x = local_x[match(preceding, vehicle_id)],
                preceding_length = v_length[match(preceding, vehicle_id)],
                preceding_width = v_width[match(preceding, vehicle_id)],
                preceding_class = v_class[match(preceding, vehicle_id)],
                preceding_vel = v_vel[match(preceding, vehicle_id)],
                preceding_acc = v_acc[match(preceding, vehicle_id)]) |> 
  dplyr::ungroup()

head(data_R)
1
Grouping by time using frame_id.
2
match function matches the values of preceding to vehicle_id and provide the positions of these matches. Then we use the <VAR>[] syntax to grab the value of a variable (where <VAR> means the variable of interest e.g., v_vel).

A NA value indicates missing value. In this dataset, NA indicates that the value is missing because there was no preceding vehicle observed. For vehicle_id 1 we can see this is true because the preceding value is 0.

How to remove undesired columns?

There are several variables in this dataset that we don’t need as they are completely devoid of any value. So we remove them:

data_R <- data_R |>
  select(-c(total_frames, starts_with("global"), following, 
            o_zone, d_zone, int_id, section_id, direction, 
            movement), global_time) 

head(data_R)

How to transform multiple columns?

Metric units

As discussed before, variables in this dataset have imperial units (feet, ft/s, etc.). You may want to transform the values of these variables to metric units. The conversion factor is 0.3048. Here, we utilize the across function to take all the desired columns and apply the conversion factor along with rounding to 2 decimal places:

## metric units
data_R <- data_R |> 
  dplyr::mutate(dplyr::across(
    .cols = c(dplyr::starts_with("local"), dplyr::starts_with("v_"), space_headway, dplyr::starts_with("preceding"), -preceding, -preceding_class, -v_class),
    .fns = ~ round(.x * .3048, 2)
  ))

head(data_R)

Convert numbers/strings to categorical data type

Moreover, we know that there are variables that should be treated as categorical (qualitative) rather than numbers or strings. For instance, lane_id has values 1-7 and we know that these are identifiers for lanes. Similarly, the class of a vehicle is encoded as 1, 2, and 3 but we know that these numbers do not have any quantitaive information, rather they are categories.

In R, categorical data is encoded as factor data type. So, we use the as.factor function to convert numbers/strings to factor data type:

## factor type
data_R <- data_R |> 
  mutate(across(
  .cols = c(vehicle_id, v_class, lane_id, preceding, preceding_class),
  .fns = ~ as.factor(.x)
))

head(data_R)

Visualization with one vehicle

Cool! We are almost done with transforming our dataset. It is time to do some visualization. The last transformation we learn is to filter the data to keep only one vehicle:

data_R_veh <- data_R |> 
  dplyr::filter(vehicle_id == "2")

And now we use ggplot2 to create a plot of velocity over time. Subject vehicle in blue and preceding vehicle in orange.

library(ggplot2)
ggplot(data = data_R_veh) +
  geom_path(mapping = aes(x = time, y = v_vel), color = "blue") +
  geom_path(mapping = aes(x = time, y = preceding_vel), color = "orange") +
  labs(x = "Time (s)", y = "Velocity (m/s)",
       title = "Velocity of vehicle # 2 and its preceding vehicle") +
  theme_minimal()

As you see, the lead vehicle speed is not seen after about 17 seconds. This is because the lead vehicle changed lanes. This can also be seen in this animation:

library(gganimate)
ggplot(data = data_R_veh) +
  geom_rect(aes(fill = "Preceding Veh", 
                xmin = preceding_local_y - preceding_length,
                xmax = preceding_local_y,
                ymin = preceding_local_x - (preceding_width/2),
                ymax = preceding_local_x + (preceding_width/2))) +
  geom_rect(aes(fill = "Subject Veh",
                xmin = local_y - v_length,
                xmax = local_y,
                ymin = local_x - (v_width/2),
                ymax = local_x + (v_width/2))) +
  geom_hline(yintercept = 3.6, linetype = "longdash") +
  coord_fixed(ratio=3.5) +
  labs(fill = NULL) +
  theme_minimal() + 
  theme(legend.position = "none") +
  transition_reveal(time)

Reuse

Citation

BibTeX citation:
@online{durrani2024,
  author = {Durrani, Umair},
  title = {Transform {Data} with {R}},
  date = {2024-01-21},
  langid = {en}
}
For attribution, please cite this work as:
Durrani, Umair. 2024. “Transform Data with R.” January 21, 2024.