wherobots
diff --git a/‎Analyzing_Data/GPS_Map_Matching.ipynb‎
Lines changed: 74 additions & 194 deletions b/‎Analyzing_Data/GPS_Map_Matching.ipynb‎
Lines changed: 74 additions & 194 deletions
@@ -5,25 +5,13 @@
    "id": "a6418743-3dd6-48e4-83ab-b1bb0c0689d0",
    "metadata": {},
    "source": [
-    "![](https://wherobots.com/wp-content/uploads/2023/12/Inline-Blue_Black_onWhite@3x.png)\n",
+    "![Wherobots logo](../assets/img/header-logo.png)\n",
     "\n",
-    "# WherobotsAI Map Matching Example\n",
+    "# WherobotsAI Map Matching\n",
     "\n",
-    "In this notebook we introduce Wherobots Map Matching, a library for creating map applications with large scale geospatial data, and explore the task of matching noisy GPS trajectory data to underlying road segments using OpenStreetMap road network data. [Read more about Wherobots Map Matching in the Wherobots documentation.](https://docs.wherobots.com/latest/tutorials/sedonamaps/introduction/)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "58006583",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "from shapely.geometry import LineString\n",
-    "from pyspark.sql.window import Window\n",
-    "from pyspark.sql.functions import col, expr, udf, collect_list, struct, row_number, lit\n",
-    "from sedona.spark import *"
+    "In this notebook we introduce Wherobots Map Matching, a library for creating map applications with large scale geospatial data. Map matching is a crucial step in many transportation analyses, aligning a sequence of observed user positions (usually from GPS) onto a digital map. This identifies the most likely sequence of roads that a vehicle has traversed. \n",
+    "\n",
+    "We will explore matching noisy GPS trajectory data to road segments using OpenStreetMap (OSM) road network data. [Read more about Wherobots Map Matching in the Wherobots documentation.](https://docs.wherobots.com/latest/tutorials/wherobotsai/map-matching/map_matching/)\n"
    ]
   },
   {
@@ -41,31 +29,24 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import json\n",
+    "from shapely.geometry import LineString\n",
+    "from pyspark.sql.window import Window\n",
+    "from pyspark.sql.functions import col, expr, udf, collect_list, struct, row_number, lit\n",
+    "from sedona.spark import *\n",
+    "\n",
     "config = SedonaContext.builder().getOrCreate()\n",
     "sedona = SedonaContext.create(config)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "b6305680",
-   "metadata": {},
-   "source": [
-    "## Map Matching\n",
-    "Map matching is a crucial step in many transportation analyses. It involves aligning a sequence of observed user positions (usually from GPS) onto a digital map, identifying the most likely path or sequence of roads that a user has traversed. \n",
-    "\n",
-    "In this section, we will use Wherobots Map Matching for our map matching tasks."
-   ]
-  },
   {
    "cell_type": "markdown",
    "id": "da17a10b",
    "metadata": {},
    "source": [
-    "### Load Ann Arbor, Michigan Road Network Data from OSM File into Spatial Dataframe\n",
-    "We are utilizing the OpenStreetMap (OSM) data specific to the Ann Arbor, Michigan region to provide the foundational road network for our analysis. OpenStreetMap offers detailed and open-sourced road network data, making it a prime choice for transportation studies.\n",
+    "## Load OpenStreetMap road data\n",
     "\n",
-    "The step load_OSM is executed only once to load this road network data. Given the granularity and detail of OSM datasets, this process might take some time.\n",
-    "<br><br>"
+    "We will call `load_osm` for the road network we will match against. Whereobots Map Matcher uses OSM's XML file format to load detailed, open source road network data. We've got a sample dataset for the Ann Arbor, Michigan area that we will use. The `[car]` parameter tells `matcher` to filter anything out of the network that is not big enough for motor vehicle traffic."
    ]
   },
   {
@@ -76,22 +57,30 @@
    "outputs": [],
    "source": [
     "from wherobots import matcher\n",
-    "dfEdge = matcher.load_osm(\"s3://wherobots-examples/data/osm_AnnArbor_large.xml\", \"[car]\")\n",
-    "dfEdge.show(5)"
+    "\n",
+    "roads_df = matcher.load_osm(\"s3://wherobots-examples/data/osm_AnnArbor_large.xml\", \"[car]\")\n",
+    "\n",
+    "roads_df.show(10)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "2d293de7",
    "metadata": {},
    "source": [
-    "### Load GPS Tracks Data from VED Dataset\n",
-    "For this analysis, we're leveraging the Vehicle Energy Dataset (VED). VED is a comprehensive dataset capturing GPS trajectories of 383 vehicles (including gasoline vehicles, HEVs, and PHEV/EVs) in Ann Arbor, Michigan, USA, from Nov 2017 to Nov 2018. The data spans ~374,000 miles and includes details on fuel, energy, speed, and auxiliary power usage. Driving scenarios cover diverse conditions, from highways to traffic-dense downtown areas, across different seasons.\n",
+    "### Load sample GPS tracking data from VED\n",
+    "\n",
+    "For this analysis, we're leveraging the [Vehicle Energy Dataset (VED)](https://github.com/gsoh/VED). VED is a comprehensive dataset capturing one year of GPS trajectories for 383 vehicles (including gasoline vehicles, HEVs, and PHEV/EVs) in the Ann Arbor area. The data spans about 374,000 miles/600,000 km and includes details on fuel, energy, speed, and auxiliary power usage. Driving scenarios cover diverse conditions, from highways to traffic-dense downtown areas and across four seasons.\n",
     "\n",
-    "Source: \"Vehicle Energy Dataset (VED), A Large-scale Dataset for Vehicle Energy Consumption Research\" by Geunseob (GS) Oh, David J. LeBlanc, Huei Peng. Published in IEEE Transactions on Intelligent Transportation Systems (T-ITS), 2020.\n",
+    "> Source: \"Vehicle Energy Dataset (VED), A Large-scale Dataset for Vehicle Energy Consumption Research\" by Geunseob (GS) Oh, David J. LeBlanc, Huei Peng. Published in IEEE Transactions on Intelligent Transportation Systems (T-ITS), 2020.\n",
     "\n",
-    "GitHub: https://github.com/gsoh/VED\n",
-    "<br><br>"
+    "Each row in the dataset represents a spatial-temporal point of one vehicle's journey. We are going to use these five columns:\n",
+    "\n",
+    "- VehId — Vehicle ID\n",
+    "- Trip — Trip ID; unique per vehicle\n",
+    "- Timestamp(ms)\n",
+    "- Latitude[deg]\n",
+    "- Longitude[deg]"
    ]
   },
   {
@@ -101,63 +90,26 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "df = sedona.read.csv(\"s3://wherobots-examples/data/VED_171101_week.csv\", header=True, inferSchema=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "c935a3e9-7989-4d49-9edb-d2b521c5ab62",
-   "metadata": {},
-   "source": [
-    "<br>For the purpose of this analysis, we are specifically extracting the columns representing the vehicle id, trip id, timestamp, latitude, and longitude. Each row in the dataset represents a spatial-temporal point of a vehicle's journey, with columns detailing:\n",
+    "gps_tracks_df = sedona.read.csv(\"s3://wherobots-examples/data/VED_171101_week.csv\", header=True, inferSchema=True)\n",
+    "gps_tracks_df = gps_tracks_df.select(['VehId', 'Trip', 'Timestamp(ms)','Latitude[deg]', 'Longitude[deg]'])\n",
     "\n",
-    "**VehId**: Vehicle Identifier.<br>\n",
-    "**Trip**: Trip Identifier for a vehicle. It helps distinguish between different journeys of the same vehicle.<br>\n",
-    "**Timestamp(ms)**: Timestamp of the data point, typically represented in milliseconds.<br>\n",
-    "**Latitude[deg]**: Latitude coordinate of the vehicle at the given timestamp.<br>\n",
-    "**Longitude[deg]**: Longitude coordinate of the vehicle at the given timestamp.\n",
-    "<br><br>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "2f60a128-e3b4-4c3b-8fff-3283f3e375b2",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "df = df.select(['VehId', 'Trip', 'Timestamp(ms)','Latitude[deg]', 'Longitude[deg]'])"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "07e545a6-7ab7-462a-a1c6-834a74029014",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "df.show(10)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "742540f5-e5f6-41b8-bf02-04fc90b108bc",
-   "metadata": {},
-   "source": [
-    "<br>The combination of VehId and Trip together form a unique key for our dataset. This combination allows us to isolate individual vehicle trajectories. Every unique pair signifies a specific trajectory of a vehicle. Raw GPS points, while valuable, can be scattered, redundant, and lack context when viewed independently. By organizing these individual points into coherent trajectories represented by Linestrings, we enhance our ability to interpret, analyze, and apply the data in meaningful ways."
+    "gps_tracks_df.show(10)"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "1b515bc7-0a40-4914-a647-b3645718a560",
    "metadata": {},
    "source": [
-    "### Create LineString Geometries from GPS tracks\n",
+    "## Aggregate GPS points into LineString geometries\n",
+    "\n",
+    "The combination of VehId and Trip together form a unique key for our dataset. This combination allows us to isolate individual vehicle trajectories. Every unique pair signifies a specific trajectory of a vehicle. Raw GPS points, while valuable, can be scattered, redundant, and lack context when viewed independently. By organizing these individual points into coherent trajectories represented by LineString geometries, we enhance our ability to interpret, analyze, and apply the data in meaningful ways.\n",
     "\n",
-    "A groupBy operation is performed on 'VehId' and 'Trip' columns to isolate individual trajectories. The resulting LineString essentially captures the responding vehicle's trajectory over time. The rows are first sorted by their timestamps to ensure the LineString follows the chronological order of the GPS data points.\n",
+    "A `groupBy` operation on 'VehId' and 'Trip' isolates each trip, a LineString representing the vehicle's course. We sort the rows by timestamps so the LineString follows the correct order of the GPS data points.\n",
     "\n",
-    "A User Defined Function (UDF) is created for Spark that utilizes the function below to process Spatial DataFrame rows into LineString geometries.\n",
-    "<br><br>"
+    "We'll write a `rows_to_linestring` function for Spark to process Sedona DataFrame rows into LineString geometries, then collect them in a new DataFrame, `trips_df`.\n",
+    "\n",
+    "Finally, we'll give each trip a unique ID using `row_number`."
    ]
   },
   {
@@ -173,53 +125,35 @@
     "    linestring = LineString(coords)\n",
     "    return linestring\n",
     "\n",
-    "linestring_udf = udf(rows_to_linestring, GeometryType())"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "5207571f-9928-4645-b96c-f2aee95ad7f4",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Group by VehId and Trip and aggregate\n",
-    "dfPath = (df\n",
-    "          .groupBy(\"VehId\", \"Trip\")\n",
-    "          .agg(collect_list(struct(\"Timestamp(ms)\", \"Latitude[deg]\", \"Longitude[deg]\")).alias(\"coords\"))\n",
-    "          .withColumn(\"geometry\", linestring_udf(\"coords\"))\n",
-    "         )"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "b12b3607-ba5b-468b-9b64-5d21ef3bdbcc",
-   "metadata": {},
-   "source": [
-    "### Create a Spatial DataFrame of GPS Tracks"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "452ac443-2c60-4497-a892-74cecf7ef3a1",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Using row_number to generate unique IDs\n",
-    "window_spec = Window.partitionBy(lit(5)).orderBy(\"VehId\", \"Trip\")  # Ordering by existing columns to provide some deterministic order\n",
-    "dfPath = dfPath.withColumn(\"ids\", row_number().over(window_spec) - 1)\n",
-    "dfPath = dfPath.filter(dfPath['ids'] < 10)\n",
-    "dfPath = dfPath.select(\"ids\", \"VehId\", \"Trip\", \"coords\", \"geometry\")\n",
-    "dfPath.show()"
+    "linestring_udf = udf(rows_to_linestring, GeometryType())\n",
+    "\n",
+    "trips_df = (gps_tracks_df\n",
+    "            .groupBy(\"VehId\", \"Trip\")\n",
+    "            .agg(collect_list(struct(\"Timestamp(ms)\", \"Latitude[deg]\", \"Longitude[deg]\")).alias(\"coords\"))\n",
+    "            .withColumn(\"geometry\", linestring_udf(\"coords\"))\n",
+    "           )\n",
+    "\n",
+    "window_spec = Window.partitionBy(lit(5)).orderBy(\"VehId\", \"Trip\")\n",
+    "trips_df = trips_df.withColumn(\"ids\", row_number().over(window_spec) - 1)\n",
+    "trips_df = trips_df.filter(trips_df['ids'] < 100) # Filter to 100 trips because this is an example notebook; no need to be exhaustive\n",
+    "trips_df = trips_df.select(\"ids\", \"VehId\", \"Trip\", \"coords\", \"geometry\")\n",
+    "\n",
+    "trips_df.show()"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "9176442c-3e34-4ca1-97b7-229c95a87b59",
    "metadata": {},
    "source": [
-    "## Perform Map Matching"
+    "## Perform Map Matching\n",
+    "\n",
+    "Finally, we will pass the road network and the aggregated trips into `matcher`, and tell it the name of the relevant columns (`geometry` in both tables).\n",
+    "\n",
+    "- **ids**: A unique identifier for each trajectory, representing a distinct vehicle journey.\n",
+    "- **observed_points**: Represents the original GPS trajectories. These are the linestrings formed from the raw GPS points collected during each vehicle journey.\n",
+    "- **matched_points**: The processed trajectories post map-matching. These linestrings are aligned onto the actual road network, correcting for any GPS inaccuracies.\n",
+    "- **matched_nodes**: A list of node identifiers from the road network that the matched trajectory passes through. These nodes correspond to intersections, turns, or other significant points in the road network."
    ]
   },
   {
@@ -233,93 +167,39 @@
     "sedona.conf.set(\"wherobots.tools.mm.maxdistinit\", \"100\")\n",
     "sedona.conf.set(\"wherobots.tools.mm.obsnoise\", \"40\")\n",
     "\n",
-    "dfMmResult = matcher.match(dfEdge, dfPath, \"geometry\", \"geometry\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "a6c65c9b-b15a-40bd-a9bb-f59c41a2ad9d",
-   "metadata": {},
-   "source": [
-    "<br>The dataframe showcases the results of a map matching process on GPS trajectories:\n",
+    "matched_routes_df = matcher.match(roads_df, trips_df, \"geometry\", \"geometry\")\n",
     "\n",
-    "**ids**: A unique identifier for each trajectory, representing a distinct vehicle journey.<br>\n",
-    "**observed_points**: Represents the original GPS trajectories. These are the linestrings formed from the raw GPS points collected during each vehicle journey.<br>\n",
-    "**matched_points**: The processed trajectories post map-matching. These linestrings are aligned onto the actual road network, correcting for any GPS inaccuracies.<br>\n",
-    "**matched_nodes**: A list of node identifiers from the road network that the matched trajectory passes through. These nodes correspond to intersections, turns, or other significant points in the road network.\n",
-    "<br><br>"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "8707a991-d327-4c90-88e3-7e4df3e46128",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "dfMmResult.show()"
+    "matched_routes_df.show()"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "2f54472c-d309-4e95-ae8a-44a5a61a0e5b",
    "metadata": {},
    "source": [
-    "## Visualize the result using SedonaKepler"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "6440a667-09b0-41c0-b53d-d44cfa0b8306",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "with open('assets/conf/map_config.json', 'r') as file:\n",
-    "    map_config = json.load(file)"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "edb95cfd-4dcb-4e94-bb8e-b6dfcc198adc",
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "mapAll = SedonaKepler.create_map()\n",
+    "## Visualize the result using SedonaKepler\n",
     "\n",
-    "SedonaKepler.add_df(mapAll, dfEdge, name=\"Road Network\")\n",
-    "SedonaKepler.add_df(mapAll, dfMmResult.selectExpr(\"observed_points AS geometry\"), name=\"Observed Points\")\n",
-    "SedonaKepler.add_df(mapAll, dfMmResult.selectExpr(\"matched_points AS geometry\"), name=\"Matched Points\")\n",
-    "mapAll.config = map_config\n",
-    "\n",
-    "mapAll"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "adf3d9e6-40fd-4d51-912e-33babe98f90e",
-   "metadata": {},
-   "source": [
-    "<br>In this visualization, we are focusing on displaying the data corresponding to 'id' value 2. To visualize data for a different 'id' value, simply change the filter condition to the desired 'id' value.\n",
-    "<br><br>"
+    "The `map_config.json` file specifies the bounding box and how to draw the road network and the source and matched routes."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "49d05638-b35e-4079-83ad-ba433b7d5fc9",
+   "id": "edb95cfd-4dcb-4e94-bb8e-b6dfcc198adc",
    "metadata": {},
    "outputs": [],
    "source": [
-    "mapFil = SedonaKepler.create_map()\n",
+    "with open('assets/conf/map_config.json', 'r') as file:\n",
+    "    map_config = json.load(file)\n",
+    "    \n",
+    "viz = SedonaKepler.create_map()\n",
     "\n",
-    "SedonaKepler.add_df(mapFil, dfEdge, name=\"Road Network\")\n",
-    "SedonaKepler.add_df(mapFil, dfMmResult.filter(dfMmResult['ids']==2).selectExpr(\"observed_points AS geometry\"), name=\"Observed Points\")\n",
-    "SedonaKepler.add_df(mapFil, dfMmResult.filter(dfMmResult['ids']==2).selectExpr(\"matched_points AS geometry\"), name=\"Matched Points\")\n",
-    "mapFil.config = map_config\n",
+    "SedonaKepler.add_df(viz, roads_df.select(\"geometry\"), name=\"Road Network\")\n",
+    "SedonaKepler.add_df(viz, matched_routes_df.selectExpr(\"observed_points AS geometry\", \"ids AS trip_id\"), name=\"Observed Points\")\n",
+    "SedonaKepler.add_df(viz, matched_routes_df.selectExpr(\"matched_points AS geometry\", \"ids AS trip_id\"), name=\"Matched Points\")\n",
+    "viz.config = map_config\n",
     "\n",
-    "mapFil"
+    "viz"
    ]
   }
  ],