I have created a dummy sample of ~10_000 rows with two columns (pikcup and dropoff location points - encoded as string).
I read the sample into a polars Dataframe with the following command:
df = pl.read_csv("./taxi_coordinates.csv")
I would like to efficiently compute the distance between those points using the module from geopy.distance import geodesic
Please note that I am trying to discover the most efficient approach because my original sample is over 30 million rows.
My approach using map_rows()
def compute_coordinates_v2(df:pl.DataFrame, col:str) -> pl.DataFrame:
target_col:str = 'pu_polygon_centroid' if col == 'pickup' else 'do_polygon_centroid'
location_data:str = f'{col}_location_cleaned'
coordinates:str = f'{col}_coordinates'
df = df.with_columns(
pl.col(target_col).str.replace_all(r'POINT \(|\)', '').alias(location_data)
).with_columns(
pl.col(location_data).str.split(' ').alias(coordinates)
)
return df
df = compute_coordinates_v2(df, 'pickup')
df = compute_coordinates_v2(df, 'dropoff')
The above operation will generate two columns of list type
shape: (5, 1)
┌───────────────────────────────────┐
│ pickup_coordinates │
│ --- │
│ list[str] │
╞═══════════════════════════════════╡
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9656345353807", "40.768615… │
│ ["-73.9924375369761", "40.748497… │
└───────────────────────────────────┘
shape: (5, 1)
┌───────────────────────────────────┐
│ dropoff_coordinates │
│ --- │
│ list[str] │
╞═══════════════════════════════════╡
│ ["-73.9656345353807", "40.768615… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9924375369761", "40.748497… │
│ ["-74.007879708664", "40.7177727… │
└───────────────────────────────────┘
Now to compute the distance I use the following func
def compute_centroid_distance_v2(row):
if (row[0][0]) and (row[0][1]) and (row[1][0]) and (row[1][1]):
centroid_distance = geodesic(
(row[0][1], row[0][0]), #(latitude, longitude)
(row[1][1], row[1][0])
).kilometers
else:
centroid_distance = 0.0
return centroid_distance
df = df.with_columns(
df.select(["pickup_coordinates", "dropoff_coordinates"]).map_rows(compute_centroid_distance_v2).rename({'map': "centroid_distance"})
)
On a benchmark of 30 million rows the map_rows
took approximately 1.5 hours.
Obviously something like
df = df.with_columns(
pl.col("pickup_coordinates").list.first().cast(pl.Float32).alias('pickup_longitude'),
pl.col("pickup_coordinates").list.last().cast(pl.Float32).alias('pickup_latitude'),
pl.col("dropoff_coordinates").list.first().cast(pl.Float32).alias('dropoff_longitude'),
pl.col("dropoff_coordinates").list.last().cast(pl.Float32).alias('dropoff_latitude')
).with_columns(
coords = geodesic( (pl.col("pickup_latitude"), pl.col('pickup_longitude')), (pl.col("dropoff_latitude"), pl.col('dropoff_longitude'))).kilometers
)
didn't work because polars tries to apply a logical operation on (pl.col("pickup_latitude"), pl.col('pickup_longitude')
Thus, I would like to understand if map_rows
/map_elements
is my only solution or if there is a different work-around that could speed up the computations.
R = 6371.009
instead)