Transform Data with Julia

import
transform
Tidier.jl
Author
Published

March 10, 2024

Note

This is the second post in the NGSIM data analysis series. Previous post:

After completing this post, you will be able to:
  • Read parquet files
  • Do a quick clean-up of column names
  • Convert integer time to datetime fromat
  • Sort data by columns
  • Create new columns
  • Remove undesired columns
  • Filter data

This post shows code in Julia. You can see the same post written in Python here and in R here.

In the previous post, we partitioned the Interstate 80 (I80) vehicle trajectories data by time and then saved it on disk:

We now make use of the parquet file in the period=first directory for learning to transform data. Since this file is part of partitioned data, I saved it as a separate parquet file (i80_period1.parquet) in a different location.

Load parquet file

Reading the parquet file requires the Parquet2 package:

# ] add Parquet2
using DataFrames
using Parquet2: Dataset
path_to_first_period_file = "data/i80_period1.parquet"
data_jl = DataFrame(Dataset(path_to_first_period_file));

first(data_jl, 5)
5×24 DataFrame
Row Vehicle_ID Frame_ID Total_Frames Global_Time Local_X Local_Y Global_X Global_Y v_length v_Width v_Class v_Vel v_Acc Lane_ID O_Zone D_Zone Int_ID Section_ID Direction Movement Preceding Following Space_Headway Time_Headway
Int32? Int32? Int32? Int64? Float64? Float64? Float64? Float64? Float64? Float64? Int32? Float64? Float64? Int32? Int32? Int32? Int32? Int32? Int32? Int32? Int32? Int32? Float64? Float64?
1 3027 8493 813 1113433984200 53.115 363.266 6.04284e6 2.13343e6 15.3 7.4 2 21.46 -8.14 5 missing missing missing missing missing missing 3014 3032 64.23 2.99
2 3214 9115 708 1113434046400 67.931 655.629 6.04282e6 2.13373e6 13.8 6.3 2 15.37 11.2 6 missing missing missing missing missing missing 3221 3229 31.71 2.06
3 3199 9329 575 1113434067800 17.026 1237.59 6.04268e6 2.1343e6 14.4 5.9 2 39.3 0.0 2 missing missing missing missing missing missing 3188 3206 72.36 1.84
4 3159 8919 572 1113434026800 16.541 306.905 6.04281e6 2.13337e6 16.4 5.9 2 14.71 3.61 2 missing missing missing missing missing missing 3152 3171 51.9 3.53
5 3314 9324 616 1113434067300 28.846 65.807 6.04285e6 2.13314e6 14.8 6.4 2 36.24 0.0 3 missing missing missing missing missing missing 3301 0 103.26 2.85

Clean dataframe names

As you can see above, the column names are in good shape i.e., without any spaces. However, it is easier typing lowercase letters than the sentence case. So, we use the clean_names macro from the TidierData package to make all column names lower case. If the column names have spaces or periods in them, clean_names would replace them with underscores.

using TidierData
data_jl = @clean_names(data_jl)

@glimpse(data_jl)
Rows: 1250932
Columns: 24
.vehicle_id    Union{Missing, Int32}3027, 3214, 3199, 3159, 3314, 3128, 3031, 29
.frame_id      Union{Missing, Int32}8493, 9115, 9329, 8919, 9324, 9052, 9030, 86
.total_frames  Union{Missing, Int32}813, 708, 575, 572, 616, 603, 958, 987, 760,
.global_time   Union{Missing, Int64}1113433984200, 1113434046400, 1113434067800,
.local_x       Union{Missing, Float64}53.115, 67.931, 17.026, 16.541, 28.846, 17
.local_y       Union{Missing, Float64}363.266, 655.629, 1237.592, 306.905, 65.80
.global_x      Union{Missing, Float64}6.042839372e6, 6.042817933e6, 6.042683444e
.global_y      Union{Missing, Float64}2.133434927e6, 2.133726884e6, 2.134296961e
.v_length      Union{Missing, Float64}15.3, 13.8, 14.4, 16.4, 14.8, 16.8, 15.3, 
.v_width       Union{Missing, Float64}7.4, 6.3, 5.9, 5.9, 6.4, 8.5, 7.4, 5.8, 8.
.v_class       Union{Missing, Int32}2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
.v_vel         Union{Missing, Float64}21.46, 15.37, 39.3, 14.71, 36.24, 35.69, 2
.v_acc         Union{Missing, Float64}-8.14, 11.2, 0.0, 3.61, 0.0, 11.2, 10.8, -
.lane_id       Union{Missing, Int32}5, 6, 2, 2, 3, 2, 4, 4, 5, 5, 4, 6, 6, 2, 4,
.o_zone        Union{Missing, Int32}missing, missing, missing, missing, missing,
.d_zone        Union{Missing, Int32}missing, missing, missing, missing, missing,
.int_id        Union{Missing, Int32}missing, missing, missing, missing, missing,
.section_id    Union{Missing, Int32}missing, missing, missing, missing, missing,
.direction     Union{Missing, Int32}missing, missing, missing, missing, missing,
.movement      Union{Missing, Int32}missing, missing, missing, missing, missing,
.preceding     Union{Missing, Int32}3014, 3221, 3188, 3152, 3301, 3125, 0, 2973,
.following     Union{Missing, Int32}3032, 3229, 3206, 3171, 0, 3132, 3032, 2987,
.space_headway Union{Missing, Float64}64.23, 31.71, 72.36, 51.9, 103.26, 90.81, 
.time_headway  Union{Missing, Float64}2.99, 2.06, 1.84, 3.53, 2.85, 2.54, 0.0, 3

The @glimpse macro shows that all variable names are now in lower case now. But variables have the Missing type common. This is perhaps because some of the columns have missing values. Let’s look at 2 variables:

count(x -> ismissing(x), data_jl.frame_id)
0
count(x -> ismissing(x), data_jl.o_zone)
1250932

We therefore remove all the columns that have missing values (completely missing!). Even after that, one of the type of the remaining variables is Missing. So, we replace missing values with zero just to change the type:

data_jl = @select(data_jl, Not(:o_zone, :d_zone, :int_id, :section_id, :direction, :movement))
for col in names(data_jl)
    data_jl[!, col] .= coalesce.(data_jl[!, col], 0)
end

How to create a time column?

Since vehicle trajectories change over time, it is nice to see these changes over different time periods. However, the gloabl_time column in this dataset contains integers rather than the actual time. So, we create a new column called actual_time by dividing the global_time by 1000 and converting it to a datetime object.

In Julia, this can be done via Dates package. The integer to datetime data conversion will report time in UTC. But the data were collected in Los Angeles, so we want to specify the America/Los_Angeles time zone. Thus, we subtract 7 hours as UTC is 7 hours ahead of PDT:

using Dates

data_jl.actual_time = DateTime.(Dates.unix2datetime.(data_jl.global_time ./ 1000))
data_jl.actual_time = data_jl.actual_time .- Hour(7)

first(data_jl, 5)
5×19 DataFrame
Row vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width v_class v_vel v_acc lane_id preceding following space_headway time_headway actual_time
Int32 Int32 Int32 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Int32 Float64 Float64 Int32 Int32 Int32 Float64 Float64 DateTime
1 3027 8493 813 1113433984200 53.115 363.266 6.04284e6 2.13343e6 15.3 7.4 2 21.46 -8.14 5 3014 3032 64.23 2.99 2005-04-13T16:13:04.200
2 3214 9115 708 1113434046400 67.931 655.629 6.04282e6 2.13373e6 13.8 6.3 2 15.37 11.2 6 3221 3229 31.71 2.06 2005-04-13T16:14:06.400
3 3199 9329 575 1113434067800 17.026 1237.59 6.04268e6 2.1343e6 14.4 5.9 2 39.3 0.0 2 3188 3206 72.36 1.84 2005-04-13T16:14:27.800
4 3159 8919 572 1113434026800 16.541 306.905 6.04281e6 2.13337e6 16.4 5.9 2 14.71 3.61 2 3152 3171 51.9 3.53 2005-04-13T16:13:46.800
5 3314 9324 616 1113434067300 28.846 65.807 6.04285e6 2.13314e6 14.8 6.4 2 36.24 0.0 3 3301 0 103.26 2.85 2005-04-13T16:14:27.300

Note that the data is not in the correct order. The vehicle_ids in the first two rows are 3027 and 3214. However, we know that the same vehicle was observed for several seconds. This means that we should see a given vehicle_id repeated over multiple rows consecutively. We therefore sort by vehicle_id and frame_id:

## First: Sort by Vehicle ID and Time
data_jl = @arrange(data_jl, vehicle_id, frame_id)
first(data_jl, 5)
5×19 DataFrame
Row vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width v_class v_vel v_acc lane_id preceding following space_headway time_headway actual_time
Int32 Int32 Int32 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Int32 Float64 Float64 Int32 Int32 Int32 Float64 Float64 DateTime
1 1 12 884 1113433136100 16.884 48.213 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.100
2 1 13 884 1113433136200 16.938 49.463 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.200
3 1 14 884 1113433136300 16.991 50.712 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.300
4 1 15 884 1113433136400 17.045 51.963 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.400
5 1 16 884 1113433136500 17.098 53.213 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.500

When we want to compare several vehicle trajectories, e.g., how their speeds change over time regardless of when they were observed, we’d want a common time scale. The NGSIM documentation describes that vehicles were observed at a resolution of 0.1 seconds. So, we can create atime variable for each vehicle that starts at 0 and ends at (n-1)/10 where n = number of rows for which a vehicle_id is repeated:

data_jl = @chain data_jl begin
    @group_by(vehicle_id)
    @mutate(time = (0:(first(n()) - 1))/10)
    @ungroup()
end;
1
Group by vehicle_id so that time is calculated for each vehicle separately.
2
n() gives the group size.
3
Don’t forget to ungroup.

How to create variables for the preceding vehicle?

We’d often need velocity, acceleration, and other variables for the preceding vehicle (vehicle in front of the subject vehicle). These variables can be useful in car-following modeling.

In this dataset, preceding vehicle variables do not exist as separate columns. The only relevant column is preceding which is the identifier of the preceding vehicle. But the data also contains the subject vehicle identifier vehicle_id, along with these variables:

  • local_y: longitudinal position of the front end of the subject vehicle (feet)
  • local_x: lateral position of the front end of the subject vehicle (feet)
  • v_length and v_width are the length and width of the subject vehicle (feet)
  • v_class is the class of the subject vehicle, i.e., 1 = motorcycle, 2 = car, and 3 = heavy vehicle (bus/truck)
  • v_vel and v_acc are the velocity (ft/s) and acceleration (ft/s/s) of the subject vehicle

Our goal now is to create new columns of the above variables for the preceding vehicle. To this end, we look for the value of preceding in the vehicle_id column at a given value of frame_id (identifier of time frame) and then grab the value of variable e.g., v_vel at that frame_id. In Julia, we achieve this by joining a few columns of the dataset with the full dataset while using the columns vehicle_id and preceding for the join:

# Join
data_jl = leftjoin(
           data_jl,
           data_jl[:, [:frame_id, :vehicle_id, :local_y, :v_length, :v_width, :v_class, :v_vel, :v_acc]],
           on = [:frame_id => :frame_id, :preceding => :vehicle_id],
           makeunique = true, 
           renamecols = "" => "_preceding"
       )

# Arrange again
data_jl = @arrange(data_jl, vehicle_id, frame_id)


# Rename columns to match with naming in R and Python posts
data_jl = @chain data_jl begin
    @rename(preceding_local_y = local_y_preceding,
            preceding_length =  v_length_preceding,
            preceding_width = v_width_preceding,
            preceding_class = v_class_preceding,
            preceding_vel = v_vel_preceding,
            preceding_acc = v_acc_preceding)
end

first(data_jl, 5)
5×26 DataFrame
Row vehicle_id frame_id total_frames global_time local_x local_y global_x global_y v_length v_width v_class v_vel v_acc lane_id preceding following space_headway time_headway actual_time time preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
Int32 Int32 Int32 Int64 Float64 Float64 Float64 Float64 Float64 Float64 Int32 Float64 Float64 Int32 Int32 Int32 Float64 Float64 DateTime Float64 Float64? Float64? Float64? Int32? Float64? Float64?
1 1 12 884 1113433136100 16.884 48.213 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.100 0.0 missing missing missing missing missing missing
2 1 13 884 1113433136200 16.938 49.463 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.200 0.1 missing missing missing missing missing missing
3 1 14 884 1113433136300 16.991 50.712 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.300 0.2 missing missing missing missing missing missing
4 1 15 884 1113433136400 17.045 51.963 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.400 0.3 missing missing missing missing missing missing
5 1 16 884 1113433136500 17.098 53.213 6.04284e6 2.13312e6 14.3 6.4 2 12.5 0.0 2 0 0 0.0 0.0 2005-04-13T15:58:56.500 0.4 missing missing missing missing missing missing

In this dataset, missing indicates that the value is missing because there was no preceding vehicle observed. For vehicle_id 1 we can see this is true because the preceding value is 0.

How to remove undesired columns?

There are several variables in this dataset that we don’t need as they are completely devoid of any value. We have removed those variables already. We further remove variables that we don’t need for analysis:

data_jl = @select(data_jl, Not(:total_frames, starts_with("global"), :following))
first(data_jl, 5)
5×21 DataFrame
Row vehicle_id frame_id local_x local_y v_length v_width v_class v_vel v_acc lane_id preceding space_headway time_headway actual_time time preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
Int32 Int32 Float64 Float64 Float64 Float64 Int32 Float64 Float64 Int32 Int32 Float64 Float64 DateTime Float64 Float64? Float64? Float64? Int32? Float64? Float64?
1 1 12 16.884 48.213 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.100 0.0 missing missing missing missing missing missing
2 1 13 16.938 49.463 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.200 0.1 missing missing missing missing missing missing
3 1 14 16.991 50.712 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.300 0.2 missing missing missing missing missing missing
4 1 15 17.045 51.963 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.400 0.3 missing missing missing missing missing missing
5 1 16 17.098 53.213 14.3 6.4 2 12.5 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.500 0.4 missing missing missing missing missing missing

How to transform multiple columns?

Metric units

As discussed before, variables in this dataset have imperial units (feet, ft/s, etc.). You may want to transform the values of these variables to metric units. The conversion factor is 0.3048. Here, we write a function to take a column and apply the conversion factor along with rounding to 2 decimal places:

## metric units
function convert_and_round(x)
    return round(x * 0.3048, digits=2)
end

cols_to_mutate = ["local_x", "local_y", "v_length", "v_width", 
                 "v_vel", "v_acc", "space_headway",
                 "preceding_local_y", "preceding_length", "preceding_width",
                 "preceding_vel", "preceding_acc"]

for col in cols_to_mutate
    data_jl[!, col] .= convert_and_round.(data_jl[!, col])
end

first(data_jl, 5)
5×21 DataFrame
Row vehicle_id frame_id local_x local_y v_length v_width v_class v_vel v_acc lane_id preceding space_headway time_headway actual_time time preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
Int32 Int32 Float64 Float64 Float64 Float64 Int32 Float64 Float64 Int32 Int32 Float64 Float64 DateTime Float64 Float64? Float64? Float64? Int32? Float64? Float64?
1 1 12 5.15 14.7 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.100 0.0 missing missing missing missing missing missing
2 1 13 5.16 15.08 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.200 0.1 missing missing missing missing missing missing
3 1 14 5.18 15.46 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.300 0.2 missing missing missing missing missing missing
4 1 15 5.2 15.84 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.400 0.3 missing missing missing missing missing missing
5 1 16 5.21 16.22 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.500 0.4 missing missing missing missing missing missing

Convert numbers to string data type

Moreover, we know that there are variables that should be treated as categorical (qualitative) rather than numbers. For instance, lane_id has values 1-7 and we know that these are identifiers for lanes. Similarly, the class of a vehicle is encoded as 1, 2, and 3 but we know that these numbers do not have any quantitaive information, rather they are categories.

We use the string function to convert numbers to string data type:

data_jl[!,:vehicle_id] = string.(data_jl[!,:vehicle_id])
data_jl[!,:v_class] = string.(data_jl[!,:v_class])
data_jl[!,:lane_id] = string.(data_jl[!,:lane_id])
data_jl[!,:preceding] = string.(data_jl[!,:preceding])
data_jl[!,:preceding_class] = string.(data_jl[!,:preceding_class])

first(data_jl, 5)
5×21 DataFrame
Row vehicle_id frame_id local_x local_y v_length v_width v_class v_vel v_acc lane_id preceding space_headway time_headway actual_time time preceding_local_y preceding_length preceding_width preceding_class preceding_vel preceding_acc
String Int32 Float64 Float64 Float64 Float64 String Float64 Float64 String String Float64 Float64 DateTime Float64 Float64? Float64? Float64? String Float64? Float64?
1 1 12 5.15 14.7 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.100 0.0 missing missing missing missing missing missing
2 1 13 5.16 15.08 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.200 0.1 missing missing missing missing missing missing
3 1 14 5.18 15.46 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.300 0.2 missing missing missing missing missing missing
4 1 15 5.2 15.84 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.400 0.3 missing missing missing missing missing missing
5 1 16 5.21 16.22 4.36 1.95 2 3.81 0.0 2 0 0.0 0.0 2005-04-13T15:58:56.500 0.4 missing missing missing missing missing missing

Visualization with one vehicle

Cool! We are almost done with transforming our dataset. It is time to do some visualization. The last transformation we learn is to filter the data to keep only one vehicle:

data_jl_veh = @filter(data_jl, vehicle_id == "2");

And now we use TidierPlots to create a plot of velocity over time. Subject vehicle in blue and preceding vehicle in orange.

using TidierPlots
ggplot(data = data_jl_veh) +
  geom_path(aes(x = "time", y = "v_vel"), color = "blue") +
  geom_path(aes(x = "time", y = "preceding_vel"), color = "orange") +
  labs(x = "Time (s)", y = "Velocity (m/s)",
       title = "Velocity of vehicle # 2 and its preceding vehicle") +
  theme_minimal()
height: 400
x: Time (s)
title: Velocity of vehicle # 2 and its preceding vehicle
width: 600
y: Velocity (m/s)

geom_path
data: inherits from plot
x: time 
y: v_vel 

geom_path
data: inherits from plot
x: time 
y: preceding_vel 

ggplot options
data: A DataFrame (415 rows, 21 columns)

As you see, the lead vehicle speed is not seen after about 17 seconds. This is because the lead vehicle changed lanes.

Reuse

Citation

BibTeX citation:
@online{durrani2024,
  author = {Durrani, Umair},
  title = {Transform {Data} with {Julia}},
  date = {2024-03-10},
  langid = {en}
}
For attribution, please cite this work as:
Durrani, Umair. 2024. “Transform Data with Julia.” March 10, 2024.