1

I have created a dummy sample of ~10_000 rows with two columns (pikcup and dropoff location points - encoded as string).

I read the sample into a polars Dataframe with the following command:

df = pl.read_csv("./taxi_coordinates.csv")

I would like to efficiently compute the distance between those points using the module from geopy.distance import geodesic

Please note that I am trying to discover the most efficient approach because my original sample is over 30 million rows.

My approach using map_rows()

def compute_coordinates_v2(df:pl.DataFrame, col:str) -> pl.DataFrame:
    target_col:str = 'pu_polygon_centroid' if col == 'pickup' else 'do_polygon_centroid'
    location_data:str = f'{col}_location_cleaned'
    coordinates:str = f'{col}_coordinates'
    df = df.with_columns(
        pl.col(target_col).str.replace_all(r'POINT \(|\)', '').alias(location_data)
    ).with_columns(
        pl.col(location_data).str.split(' ').alias(coordinates)
    )
    return df

df = compute_coordinates_v2(df, 'pickup')
df = compute_coordinates_v2(df, 'dropoff')

The above operation will generate two columns of list type

shape: (5, 1)
┌───────────────────────────────────┐
│ pickup_coordinates                │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9656345353807", "40.768615… │
│ ["-73.9924375369761", "40.748497… │
└───────────────────────────────────┘
shape: (5, 1)
┌───────────────────────────────────┐
│ dropoff_coordinates               │
│ ---                               │
│ list[str]                         │
╞═══════════════════════════════════╡
│ ["-73.9656345353807", "40.768615… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.95701169835736", "40.78043… │
│ ["-73.9924375369761", "40.748497… │
│ ["-74.007879708664", "40.7177727… │
└───────────────────────────────────┘

Now to compute the distance I use the following func

def compute_centroid_distance_v2(row):
    if (row[0][0]) and (row[0][1]) and (row[1][0]) and (row[1][1]):
        centroid_distance = geodesic(
            (row[0][1], row[0][0]), #(latitude, longitude)
            (row[1][1], row[1][0])
        ).kilometers
    else:
        centroid_distance = 0.0
    return centroid_distance

df = df.with_columns(
        df.select(["pickup_coordinates", "dropoff_coordinates"]).map_rows(compute_centroid_distance_v2).rename({'map': "centroid_distance"})
    )

On a benchmark of 30 million rows the map_rows took approximately 1.5 hours.

Obviously something like

df = df.with_columns(
        pl.col("pickup_coordinates").list.first().cast(pl.Float32).alias('pickup_longitude'),
        pl.col("pickup_coordinates").list.last().cast(pl.Float32).alias('pickup_latitude'),
        pl.col("dropoff_coordinates").list.first().cast(pl.Float32).alias('dropoff_longitude'),
        pl.col("dropoff_coordinates").list.last().cast(pl.Float32).alias('dropoff_latitude')
    ).with_columns(
        coords = geodesic( (pl.col("pickup_latitude"), pl.col('pickup_longitude')),  (pl.col("dropoff_latitude"), pl.col('dropoff_longitude'))).kilometers
    )

didn't work because polars tries to apply a logical operation on (pl.col("pickup_latitude"), pl.col('pickup_longitude')

Thus, I would like to understand if map_rows/map_elements is my only solution or if there is a different work-around that could speed up the computations.

7

2 Answers 2

1

As in the answer from https://stackoverflow.com/a/76265233/ you can attempt to replicate geodesic() using Polars Expressions.

Another potential option could be DuckDB, which can input/output Polars DataFrames easily.

DuckDB has a SPATIAL extension: https://duckdb.org/2023/04/28/spatial.html

duckdb.sql("install spatial") # needed once

If I increase your example for a simple comparison:

df_big = pl.concat([df] * 10)

Using your map_rows approach:

(df_big
  .select("pickup_coordinates", "dropoff_coordinates")
  .map_rows(compute_centroid_distance_v2)
  .rename({"map": "centroid_distance"})
)
shape: (98_510, 1)
┌───────────────────┐
│ centroid_distance │
│ ---               │
│ f64               │
╞═══════════════════╡
│ 1.50107           │
│ 0.0               │
│ 0.0               │
│ 3.18019           │
│ 3.652772          │
│ …                 │
│ 2.376629          │
│ 1.440797          │
│ 4.583181          │
│ 0.53954           │
│ 2.589928          │
└───────────────────┘

Elapsed time: 4.52725 seconds

Using duckdb:

duckdb.sql("load spatial")
duckdb.sql("""
from df_big
select
   st_distance_spheroid(
      st_point(
         pickup_coordinates[2]::float, -- NOTE: array indexing is 1-based
         pickup_coordinates[1]::float
      ),
      st_point(
         dropoff_coordinates[2]::float,
         dropoff_coordinates[1]::float
      )
   ) as geodesic
""") .pl()
shape: (98_510, 1)
┌─────────────┐
│ geodesic    │
│ ---         │
│ f64         │
╞═════════════╡
│ 1501.364    │
│ 0.0         │
│ 0.0         │
│ 3180.189287 │
│ 3652.673199 │
│ …           │
│ 2376.786018 │
│ 1440.740571 │
│ 4583.039701 │
│ 539.144276  │
│ 2590.085087 │
└─────────────┘

Elapsed time: 0.10821 seconds

I don't know too much about spatial data, so I'm not entirely sure why there are slight differences in the outputs.

It also seems like you could use st_read() to load the data directly into duckdb instead of having to manually chop it up with Polars first.

1

I have computed the haversine distance based on my data points and the solution provided here (in case someone wants to implement this)

def compute_haversine_disntance(df:pl.DataFrame, R:np.float64, coordinates:dict) -> pl.DataFrame:
    pl.Config.set_fmt_float("full")
    multiplier:float = np.pi/180
    rad_lat1:pl.Expr = (pl.col(coordinates["pickup_points"]).list.last().cast(pl.Float64) * (multiplier))
    rad_lat2:pl.Expr = (pl.col(coordinates["dropoff_points"]).list.last().cast(pl.Float64) * (multiplier))
    rad_lng1:pl.Expr = (pl.col(coordinates["pickup_points"]).list.first().cast(pl.Float64) * (multiplier))
    rad_lng2:pl.Expr = (pl.col(coordinates["dropoff_points"]).list.first().cast(pl.Float64) * (multiplier))
    haversin:pl.Expr = (
        (rad_lat2 - rad_lat1).truediv(2).sin().pow(2) +
        ((rad_lat1.cos() * rad_lat2.cos()) * (rad_lng2 - rad_lng1).truediv(2).sin().pow(2))
    ).cast(pl.Float64)
    df = df.with_columns(
        (
            2 * R * (haversin.sqrt().arcsin())
        ).cast(pl.Float64).alias("haversine_centroid_distance")
    )
    return df

However I have some differences on the final results compared to this calculator here. Even though my formula is the same to the one used in the calculator I have slight different results. For example the first pair of points:

  • lat1: 0.7117528862292272 (radians)
  • lat2: 0.7115465664360616
  • lng1: -1.2907933590722993
  • lng2: -1.2909438559692195

= has 1.34 distance based on my calculations while the calculator computes a distance of 1.501 (closer to geodesic)

Fixed the mismatching result cos(lat1)*cos(lat2) I was using cos(lng1)*cos(lng2) mistakenly I was computing the cosine of longitudes instead of latitudes. The result is the same to the calculator site I posted above.

Benchmark results:

  • 3 million rows with map_rows() ~ 3 minutes
  • 3 million rows with polars API ~ .05 seconds

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.