pypsa-eur/scripts/clean_osm_data.py

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

1808 lines
58 KiB
Python
Raw Normal View History

Introducing OpenStreetMap high-voltage grid to PyPSA-Eur (#1079) * Implemented which uses the overpass API to download power features for individual countries. * Extended rule by input. * Bug fixes and improvements to clean_osm_data.py. Added in retrieve_osm_data.py. * Updated clean_osm_data and retrieve_osm_data to create clean substations. * Finished clean_osm_data function. * Added check whether line is a circle. If so, drop it. * Extended build_electricity.smk by build_osm_network.py * Added build_osm_network * Working osm-network-fast * Bug fixes. * Finalised and cleaned including docstrings. * Added try catch to retrieve_osm_data. Allows for parallelisation of downloads. * Updated cleaning process. * Set maximum number of threads for retrieving to 4, wrt. fair usage policy and potential request errors. * Intermediate update on clean_osm_data.py. Added docstrings. * Bug fix. * Bug fix. * Bug fixes in data types out of clean_osm_data * Significant improvements to retrieve_osm_data, clean_osm_data. Cleaned code. Speed improvements * Cleaned config. * Fixes. * Bug fixes. * Updated default config * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed overpass from required packages. Not needed anymore. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added links_relations (route = power, frequency = 0) to retrieval. This will change how HVDC links are extracted in the near future. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Work-in-progress clean_osm_data * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added clean links output to clean_osm_data. Script uses OSM relations to retrieve clean HVDC links. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * New code for integrating HVDC links. Using relations. Base network implementation functioning. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed manual line dropping. * Updated clean script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reverted Snakefile to default: sync settings * added prebuilt functionality. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated build_electricity.smk to work with scenario management. * removed commented-out code. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed commented-out code. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed bug in pdf export by substituting pdf export with svg. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Bug-fix Snakefile * dropped not needed columns from build_osm_network. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated build_shapes, config.default and clean_osm_data. * pre-commit changes. * test * Added initial prepare_osm_network_release.py script * Finalised prepare_osm_network_release script to build clean and stable OSM base_network input files. * Added new rules/development.smk * Updated clean_osm_data to add substation_centroid to linestrings * Updated clean_osm_data to add substation_centroid to linestrings * Updated clean_osm_data to add substation_centroid to linestrings * Updated clean_osm_data to add substation_centroid to linestrings * Added osm-prebuilt functionality and zenodo sandbox repository. * Updated clean_osm_data to geopandas v.1.01 * Made base_network and build_osm_network function more robust for empty links. * Made base_network and build_osm_network function more robust for empty links. * Bug fix in base_network. Voltage level null is now kept (relevant e.g. for Corsica) * Merge with hcanges in upstream PR 1146. Fixing UA and MD. * Updated Zenodo and fixed prepare_osm_network_release * Updated osm network release. * Updated prepare osm network release. * Updated MD, UA scripts. * Cleaned determine_availability_matrix_MD_UA.py, removed redundant code * Bug fixes. * Bug fixes for UA MD scripts. * Rename of build script. * Bug fix: only distribute load to buses with substation. * Updated zenodo sandbox repository. * Updated config.default * Cleaned config.default.yaml: Related settings grouped together and redundant voltage settings aggregated. * Cleaned config.default.yaml: Related settings grouped together and redundant voltage settings aggregated. Added release notes. * Updated Zenodo repositories for OSM-prebuilt to offcial publication. * Updated configtables * Updated links.csv: Under_construction lines to in commission. * Updated link 8394 and parameter_corrections: Continuation of North-Sea-Link. * Major update: fix simplify_network, fix Corsica, updated build_osm_network to include lines overpassing nodes. * remove config backup * Bug fix: Carrier type of all supernodes corrected to 'AC' * Bug fix: Carrier type of all supernodes corrected to 'AC' * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Updated rules and base_network for compatibility with TYNDP projects. * Updated Zenodo repository and prebuilt network to include 150 kV HVDC connections. * Removed outdated config backup. * Implemented all comments from PR #1079. Cleaned up OSM implementation. * Bug fix: Added all voltages, 200 kV-750 kV, to default config. * Cleaning and bugfixes. * Updated Zenodo repository to https://zenodo.org/records/13358976. Added converter voltages, 'underground' property for DC lines/cables, and included Konti-Skan HVDC (DK-SE). Added compatibility with https://github.com/PyPSA/pypsa-eur/pull/1079 and https://github.com/PyPSA/pypsa-eur/pull/1085 * Apply suggestions from code review * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * simplify_network: handle complicated transformer topologies * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * syntax fix --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Fabian Neumann <fabian.neumann@outlook.de>
2024-08-22 13:01:20 +00:00
# -*- coding: utf-8 -*-
# SPDX-FileCopyrightText: : 2020-2024 The PyPSA-Eur Authors
#
# SPDX-License-Identifier: MIT
"""
This script is used to clean OpenStreetMap (OSM) data for creating a PyPSA-Eur
ready network.
The script performs various cleaning operations on the OSM data, including:
- Cleaning voltage, circuits, cables, wires, and frequency columns
- Splitting semicolon-separated cells into new rows
- Distributing values to circuits based on the number of splits
- Adding line endings to substations based on line data
"""
import json
import logging
import os
import re
import geopandas as gpd
import numpy as np
import pandas as pd
from _helpers import configure_logging, set_scenario_config
from shapely.geometry import LineString, MultiLineString, Point, Polygon
from shapely.ops import linemerge, unary_union
logger = logging.getLogger(__name__)
def _create_linestring(row):
"""
Create a LineString object from the given row.
Args:
row (dict): A dictionary containing the row data.
Returns:
LineString: A LineString object representing the geometry.
"""
coords = [(coord["lon"], coord["lat"]) for coord in row["geometry"]]
return LineString(coords)
def _create_polygon(row):
"""
Create a Shapely Polygon from a list of coordinate dictionaries.
Parameters:
coords (list): List of dictionaries with 'lat' and 'lon' keys
representing coordinates.
Returns:
shapely.geometry.Polygon: The constructed polygon object.
"""
# Extract coordinates as tuples
point_coords = [(coord["lon"], coord["lat"]) for coord in row["geometry"]]
# Ensure closure by repeating the first coordinate as the last coordinate
if point_coords[0] != point_coords[-1]:
point_coords.append(point_coords[0])
# Create Polygon object
polygon = Polygon(point_coords)
return polygon
def _find_closest_polygon(gdf, point):
"""
Find the closest polygon in a GeoDataFrame to a given point.
Parameters:
gdf (GeoDataFrame): A GeoDataFrame containing polygons.
point (Point): A Point object representing the target point.
Returns:
int: The index of the closest polygon in the GeoDataFrame.
"""
# Compute the distance to each polygon
gdf["distance"] = gdf["geometry"].apply(lambda geom: point.distance(geom))
# Find the index of the closest polygon
closest_idx = gdf["distance"].idxmin()
# Get the closest polygon's row
closest_polygon = gdf.loc[closest_idx]
return closest_idx
def _clean_voltage(column):
"""
Function to clean the raw voltage column: manual fixing and drop nan values
Args:
- column: pandas Series, the column to be cleaned
Returns:
- column: pandas Series, the cleaned column
"""
logger.info("Cleaning voltages.")
column = column.copy()
column = (
column.astype(str)
.str.lower()
.str.replace("400/220/110 kV'", "400000;220000;110000")
.str.replace("400/220/110/20_kv", "400000;220000;110000;20000")
.str.replace("2x25000", "25000;25000")
.str.replace("é", ";")
)
column = (
column.astype(str)
.str.lower()
.str.replace("(temp 150000)", "")
.str.replace("low", "1000")
.str.replace("minor", "1000")
.str.replace("medium", "33000")
.str.replace("med", "33000")
.str.replace("m", "33000")
.str.replace("high", "150000")
.str.replace("23000-109000", "109000")
.str.replace("380000>220000", "380000;220000")
.str.replace(":", ";")
.str.replace("<", ";")
.str.replace(",", ";")
.str.replace("kv", "000")
.str.replace("kva", "000")
.str.replace("/", ";")
.str.replace("nan", "")
.str.replace("<na>", "")
)
# Remove all remaining non-numeric characters except for semicolons
column = column.apply(lambda x: re.sub(r"[^0-9;]", "", str(x)))
column.dropna(inplace=True)
return column
def _clean_circuits(column):
"""
Function to clean the raw circuits column: manual fixing and drop nan
values
Args:
- column: pandas Series, the column to be cleaned
Returns:
- column: pandas Series, the cleaned column
"""
logger.info("Cleaning circuits.")
column = column.copy()
column = (
column.astype(str)
.str.replace("partial", "")
.str.replace("1operator=RTE operator:wikidata=Q2178795", "")
.str.lower()
.str.replace("1,5", "3")
.str.replace("1/3", "1")
.str.replace("<na>", "")
.str.replace("nan", "")
)
# Remove all remaining non-numeric characters except for semicolons
column = column.apply(lambda x: re.sub(r"[^0-9;]", "", x))
column.dropna(inplace=True)
return column.astype(str)
def _clean_cables(column):
"""
Function to clean the raw cables column: manual fixing and drop nan values
Args:
- column: pandas Series, the column to be cleaned
Returns:
- column: pandas Series, the cleaned column
"""
logger.info("Cleaning cables.")
column = column.copy()
column = (
column.astype(str)
.str.lower()
.str.replace("1/3", "1")
.str.replace("3x2;2", "3")
.str.replace("<na>", "")
.str.replace("nan", "")
)
# Remove all remaining non-numeric characters except for semicolons
column = column.apply(lambda x: re.sub(r"[^0-9;]", "", x))
column.dropna(inplace=True)
return column.astype(str)
def _clean_wires(column):
"""
Function to clean the raw wires column: manual fixing and drop nan values
Args:
- column: pandas Series, the column to be cleaned
Returns:
- column: pandas Series, the cleaned column
"""
logger.info("Cleaning wires.")
column = column.copy()
column = (
column.astype(str)
.str.lower()
.str.replace("?", "")
.str.replace("trzyprzewodowe", "3")
.str.replace("pojedyńcze", "1")
.str.replace("single", "1")
.str.replace("double", "2")
.str.replace("triple", "3")
.str.replace("quad", "4")
.str.replace("fivefold", "5")
.str.replace("yes", "3")
.str.replace("1/3", "1")
.str.replace("3x2;2", "3")
.str.replace("_", "")
.str.replace("<na>", "")
.str.replace("nan", "")
)
# Remove all remaining non-numeric characters except for semicolons
column = column.apply(lambda x: re.sub(r"[^0-9;]", "", x))
column.dropna(inplace=True)
return column.astype(str)
def _check_voltage(voltage, list_voltages):
"""
Check if the given voltage is present in the list of allowed voltages.
Parameters:
voltage (str): The voltage to check.
list_voltages (list): A list of allowed voltages.
Returns:
bool: True if the voltage is present in the list of allowed voltages,
False otherwise.
"""
voltages = voltage.split(";")
for v in voltages:
if v in list_voltages:
return True
return False
def _clean_frequency(column):
"""
Function to clean the raw frequency column: manual fixing and drop nan
values
Args:
- column: pandas Series, the column to be cleaned
Returns:
- column: pandas Series, the cleaned column
"""
logger.info("Cleaning frequencies.")
column = column.copy()
column = (
column.astype(str)
.str.lower()
.str.replace("16.67", "16.7")
.str.replace("16,7", "16.7")
.str.replace("?", "")
.str.replace("hz", "")
.str.replace(" ", "")
.str.replace("<NA>", "")
.str.replace("nan", "")
)
# Remove all remaining non-numeric characters except for semicolons
column = column.apply(lambda x: re.sub(r"[^0-9;.]", "", x))
column.dropna(inplace=True)
return column.astype(str)
def _clean_rating(column):
"""
Function to clean and sum the rating columns:
Args:
- column: pandas Series, the column to be cleaned
Returns:
- column: pandas Series, the cleaned column
"""
logger.info("Cleaning ratings.")
column = column.copy()
column = column.astype(str).str.replace("MW", "")
# Remove all remaining non-numeric characters except for semicolons
column = column.apply(lambda x: re.sub(r"[^0-9;]", "", x))
# Sum up all ratings if there are multiple entries
column = column.str.split(";").apply(lambda x: sum([int(i) for i in x]))
column.dropna(inplace=True)
return column.astype(str)
def _split_cells(df, cols=["voltage"]):
"""
Split semicolon separated cells i.e. [66000;220000] and create new
identical rows.
Parameters
----------
df : dataframe
Dataframe under analysis
cols : list
List of target columns over which to perform the analysis
Example
-------
Original data:
row 1: '66000;220000', '50'
After applying split_cells():
row 1, '66000', '50', 2
row 2, '220000', '50', 2
"""
if df.empty:
return df
# Create a dictionary to store the suffix count for each original ID
suffix_counts = {}
# Create a dictionary to store the number of splits associated with each
# original ID
num_splits = {}
# Split cells and create new rows
x = df.assign(**{col: df[col].str.split(";") for col in cols})
x = x.explode(cols, ignore_index=True)
# Count the number of splits associated with each original ID
num_splits = x.groupby("id").size().to_dict()
# Update the 'split_elements' column
x["split_elements"] = x["id"].map(num_splits)
# Function to generate the new ID with suffix and update the number of
# splits
def generate_new_id(row):
original_id = row["id"]
if row["split_elements"] == 1:
return original_id
else:
suffix_counts[original_id] = suffix_counts.get(original_id, 0) + 1
return f"{original_id}-{suffix_counts[original_id]}"
# Update the ID column with the new IDs
x["id"] = x.apply(generate_new_id, axis=1)
return x
def _distribute_to_circuits(row):
"""
Distributes the number of circuits or cables to individual circuits based
on the given row data.
Parameters:
- row: A dictionary representing a row of data containing information about
circuits and cables.
Returns:
- single_circuit: The number of circuits to be assigned to each individual
circuit.
"""
if row["circuits"] != "":
circuits = int(row["circuits"])
else:
cables = int(row["cables"])
circuits = cables / 3
single_circuit = int(max(1, np.floor_divide(circuits, row["split_elements"])))
single_circuit = str(single_circuit)
return single_circuit
def _add_line_endings_to_substations(
df_substations,
gdf_lines,
path_country_shapes,
path_offshore_shapes,
prefix,
):
"""
Add line endings to substations.
This function takes two pandas DataFrames, `substations` and `lines`, and
adds line endings to the substations based on the information from the
lines DataFrame.
Parameters:
- substations (pandas DataFrame): DataFrame containing information about
substations.
- lines (pandas DataFrame): DataFrame containing information about lines.
Returns:
- buses (pandas DataFrame): DataFrame containing the updated information
about substations with line endings.
"""
if gdf_lines.empty:
return df_substations
logger.info("Adding line endings to substations")
# extract columns from df_substations
bus_s = pd.DataFrame(columns=df_substations.columns)
bus_e = pd.DataFrame(columns=df_substations.columns)
# TODO pypsa-eur: fix country code to contain single country code
# Read information from gdf_lines
bus_s[["voltage", "country"]] = gdf_lines[["voltage", "country"]]
bus_s.loc[:, "geometry"] = gdf_lines.geometry.boundary.map(
lambda p: p.geoms[0] if len(p.geoms) >= 2 else None
)
bus_s.loc[:, "lon"] = bus_s["geometry"].map(lambda p: p.x if p != None else None)
bus_s.loc[:, "lat"] = bus_s["geometry"].map(lambda p: p.y if p != None else None)
bus_s.loc[:, "dc"] = gdf_lines["dc"]
bus_e[["voltage", "country"]] = gdf_lines[["voltage", "country"]]
bus_e.loc[:, "geometry"] = gdf_lines.geometry.boundary.map(
lambda p: p.geoms[1] if len(p.geoms) >= 2 else None
)
bus_e.loc[:, "lon"] = bus_e["geometry"].map(lambda p: p.x if p != None else None)
bus_e.loc[:, "lat"] = bus_e["geometry"].map(lambda p: p.y if p != None else None)
bus_e.loc[:, "dc"] = gdf_lines["dc"]
bus_all = pd.concat([bus_s, bus_e], ignore_index=True)
# Group gdf_substations by voltage and and geometry (dropping duplicates)
bus_all = bus_all.groupby(["voltage", "lon", "lat", "dc"]).first().reset_index()
bus_all = bus_all[df_substations.columns]
bus_all.loc[:, "bus_id"] = bus_all.apply(
lambda row: f"{prefix}/{row.name + 1}", axis=1
)
# Initialize default values
bus_all["station_id"] = None
# Assuming substations completed for installed lines
bus_all["under_construction"] = False
bus_all["tag_area"] = None
bus_all["symbol"] = "substation"
# TODO: this tag may be improved, maybe depending on voltage levels
bus_all["tag_substation"] = "transmission"
bus_all["tag_source"] = prefix
buses = pd.concat([df_substations, bus_all], ignore_index=True)
buses.set_index("bus_id", inplace=True)
# Fix country codes
# TODO pypsa-eur: Temporary solution as long as the shapes have a low,
# incomplete resolution (cf. 2500 meters for buffering)
bool_multiple_countries = buses["country"].str.contains(";")
gdf_offshore = gpd.read_file(path_offshore_shapes).set_index("name")["geometry"]
gdf_offshore = gpd.GeoDataFrame(
gdf_offshore, geometry=gdf_offshore, crs=gdf_offshore.crs
)
gdf_countries = gpd.read_file(path_country_shapes).set_index("name")["geometry"]
# reproject to enable buffer
gdf_countries = gpd.GeoDataFrame(geometry=gdf_countries, crs=gdf_countries.crs)
gdf_union = gdf_countries.merge(
gdf_offshore, how="outer", left_index=True, right_index=True
)
gdf_union["geometry"] = gdf_union.apply(
lambda row: gpd.GeoSeries([row["geometry_x"], row["geometry_y"]]).union_all(),
axis=1,
)
gdf_union = gpd.GeoDataFrame(geometry=gdf_union["geometry"], crs=crs)
gdf_buses_tofix = gpd.GeoDataFrame(
buses[bool_multiple_countries], geometry="geometry", crs=crs
)
joined = gpd.sjoin(
gdf_buses_tofix, gdf_union.reset_index(), how="left", predicate="within"
)
# For all remaining rows where the country/index_right column is NaN, find
# find the closest polygon index
joined.loc[joined["name"].isna(), "name"] = joined.loc[
joined["name"].isna(), "geometry"
].apply(lambda x: _find_closest_polygon(gdf_union, x))
joined.reset_index(inplace=True)
joined = joined.drop_duplicates(subset="bus_id")
joined.set_index("bus_id", inplace=True)
buses.loc[bool_multiple_countries, "country"] = joined.loc[
bool_multiple_countries, "name"
]
return buses.reset_index()
def _import_lines_and_cables(path_lines):
"""
Import lines and cables from the given input paths.
Parameters:
- path_lines (dict): A dictionary containing the input paths for lines and
cables data.
Returns:
- df_lines (DataFrame): A DataFrame containing the imported lines and
cables data.
"""
columns = [
"id",
"bounds",
"nodes",
"geometry",
"country",
"power",
"cables",
"circuits",
"frequency",
"voltage",
"wires",
]
df_lines = pd.DataFrame(columns=columns)
logger.info("Importing lines and cables")
for key in path_lines:
logger.info(f"Processing {key}...")
for idx, ip in enumerate(path_lines[key]):
if (
os.path.exists(ip) and os.path.getsize(ip) > 400
): # unpopulated OSM json is about 51 bytes
country = os.path.basename(os.path.dirname(path_lines[key][idx]))
logger.info(
f" - Importing {key} {str(idx+1).zfill(2)}/{str(len(path_lines[key])).zfill(2)}: {ip}"
)
with open(ip, "r") as f:
data = json.load(f)
df = pd.DataFrame(data["elements"])
df["id"] = df["id"].astype(str)
df["country"] = country
col_tags = [
"power",
"cables",
"circuits",
"frequency",
"voltage",
"wires",
]
tags = pd.json_normalize(df["tags"]).map(
lambda x: str(x) if pd.notnull(x) else x
)
for ct in col_tags:
if ct not in tags.columns:
tags[ct] = pd.NA
tags = tags.loc[:, col_tags]
df = pd.concat([df, tags], axis="columns")
df.drop(columns=["type", "tags"], inplace=True)
df_lines = pd.concat([df_lines, df], axis="rows")
else:
logger.info(
f" - Skipping {key} {str(idx+1).zfill(2)}/{str(len(path_lines[key])).zfill(2)} (empty): {ip}"
)
continue
logger.info("---")
return df_lines
def _import_links(path_links):
"""
Import links from the given input paths.
Parameters:
- path_links (dict): A dictionary containing the input paths for links.
Returns:
- df_links (DataFrame): A DataFrame containing the imported links data.
"""
columns = [
"id",
"bounds",
"nodes",
"geometry",
"country",
"circuits",
"frequency",
"rating",
"voltage",
]
df_links = pd.DataFrame(columns=columns)
logger.info("Importing links")
for key in path_links:
logger.info(f"Processing {key}...")
for idx, ip in enumerate(path_links[key]):
if (
os.path.exists(ip) and os.path.getsize(ip) > 400
): # unpopulated OSM json is about 51 bytes
country = os.path.basename(os.path.dirname(path_links[key][idx]))
logger.info(
f" - Importing {key} {str(idx+1).zfill(2)}/{str(len(path_links[key])).zfill(2)}: {ip}"
)
with open(ip, "r") as f:
data = json.load(f)
df = pd.DataFrame(data["elements"])
df["id"] = df["id"].astype(str)
df["id"] = df["id"].apply(lambda x: (f"relation/{x}"))
df["country"] = country
col_tags = [
"circuits",
"frequency",
"rating",
"voltage",
]
tags = pd.json_normalize(df["tags"]).map(
lambda x: str(x) if pd.notnull(x) else x
)
for ct in col_tags:
if ct not in tags.columns:
tags[ct] = pd.NA
tags = tags.loc[:, col_tags]
df = pd.concat([df, tags], axis="columns")
df.drop(columns=["type", "tags"], inplace=True)
df_links = pd.concat([df_links, df], axis="rows")
else:
logger.info(
f" - Skipping {key} {str(idx+1).zfill(2)}/{str(len(path_links[key])).zfill(2)} (empty): {ip}"
)
continue
logger.info("---")
logger.info("Dropping lines without rating.")
len_before = len(df_links)
df_links = df_links.dropna(subset=["rating"])
len_after = len(df_links)
logger.info(
f"Dropped {len_before-len_after} elements without rating. "
+ f"Imported {len_after} elements."
)
return df_links
def _create_single_link(row):
"""
Create a single link from multiple rows within a OSM link relation.
Parameters:
- row: A row of OSM data containing information about the link.
Returns:
- single_link: A single LineString representing the link.
This function takes a row of OSM data and extracts the relevant information
to create a single link. It filters out elements (substations, electrodes)
with invalid roles and finds the longest link based on its endpoints.
If the longest link is a MultiLineString, it extracts the longest
linestring from it. The resulting single link is returned.
"""
valid_roles = ["line", "cable"]
df = pd.json_normalize(row["members"])
df = df[df["role"].isin(valid_roles)]
df.loc[:, "geometry"] = df.apply(_create_linestring, axis=1)
df.loc[:, "length"] = df["geometry"].apply(lambda x: x.length)
list_endpoints = []
for idx, row in df.iterrows():
tuple = sorted([row["geometry"].coords[0], row["geometry"].coords[-1]])
# round tuple to 3 decimals
tuple = (
round(tuple[0][0], 3),
round(tuple[0][1], 3),
round(tuple[1][0], 3),
round(tuple[1][1], 3),
)
list_endpoints.append(tuple)
df.loc[:, "endpoints"] = list_endpoints
df_longest = df.loc[df.groupby("endpoints")["length"].idxmin()]
single_link = linemerge(df_longest["geometry"].values.tolist())
# If the longest component is a MultiLineString, extract the longest linestring from it
if isinstance(single_link, MultiLineString):
# Find connected components
components = list(single_link.geoms)
# Find the longest connected linestring
single_link = max(components, key=lambda x: x.length)
return single_link
def _drop_duplicate_lines(df_lines):
"""
Drop duplicate lines from the given dataframe. Duplicates are usually lines
cross-border lines or slightly outside the country border of focus.
Parameters:
- df_lines (pandas.DataFrame): The dataframe containing lines data.
Returns:
- df_lines (pandas.DataFrame): The dataframe with duplicate lines removed
and cleaned data.
This function drops duplicate lines from the given dataframe based on the
'id' column. It groups the duplicate rows by 'id' and aggregates the
'country' column to a string split by semicolon, as they appear in multiple
country datasets. One example of the duplicates is kept, accordingly.
Finally, the updated dataframe without multiple duplicates is returned.
"""
logger.info("Dropping duplicate lines.")
duplicate_rows = df_lines[df_lines.duplicated(subset=["id"], keep=False)].copy()
# Group rows by id and aggregate the country column to a string split by semicolon
grouped_duplicates = (
duplicate_rows.groupby("id")["country"].agg(lambda x: ";".join(x)).reset_index()
)
duplicate_rows.drop_duplicates(subset="id", inplace=True)
duplicate_rows.drop(columns=["country"], inplace=True)
duplicate_rows = duplicate_rows.join(
grouped_duplicates.set_index("id"), on="id", how="left"
)
len_before = len(df_lines)
# Drop duplicates and update the df_lines dataframe with the cleaned data
df_lines = df_lines[~df_lines["id"].isin(duplicate_rows["id"])]
df_lines = pd.concat([df_lines, duplicate_rows], axis="rows")
len_after = len(df_lines)
logger.info(
f"Dropped {len_before - len_after} duplicate elements. "
+ f"Keeping {len_after} elements."
)
return df_lines
def _filter_by_voltage(df, min_voltage=200000):
"""
Filter rows in the DataFrame based on the voltage in V.
Parameters:
- df (pandas.DataFrame): The DataFrame containing the substations or lines data.
- min_voltage (int, optional): The minimum voltage value to filter the
rows. Defaults to 200000 [unit: V].
Returns:
- filtered df (pandas.DataFrame): The filtered DataFrame containing
the lines or substations above min_voltage.
- list_voltages (list): A list of unique voltage values above min_voltage.
The type of the list elements is string.
"""
if df.empty:
return df, []
logger.info(
f"Filtering dataframe by voltage. Only keeping rows above and including {min_voltage} V."
)
list_voltages = df["voltage"].str.split(";").explode().unique().astype(str)
# Keep numeric strings
list_voltages = list_voltages[np.vectorize(str.isnumeric)(list_voltages)]
list_voltages = list_voltages.astype(int)
list_voltages = list_voltages[list_voltages >= int(min_voltage)]
list_voltages = list_voltages.astype(str)
bool_voltages = df["voltage"].apply(_check_voltage, list_voltages=list_voltages)
len_before = len(df)
df = df[bool_voltages]
len_after = len(df)
logger.info(
f"Dropped {len_before - len_after} elements with voltage below {min_voltage}. "
+ f"Keeping {len_after} elements."
)
return df, list_voltages
def _clean_substations(df_substations, list_voltages):
"""
Clean the substation data by performing the following steps:
- Split cells in the dataframe.
- Filter substation data based on specified voltages.
- Update the frequency values based on the split count.
- Split cells in the 'frequency' column.
- Set remaining invalid frequency values that are not in ['0', '50']
to '50'.
Parameters:
- df_substations (pandas.DataFrame): The input dataframe containing
substation data.
- list_voltages (list): A list of voltages above min_voltage to filter the
substation data.
Returns:
- df_substations (pandas.DataFrame): The cleaned substation dataframe.
"""
df_substations = df_substations.copy()
df_substations = _split_cells(df_substations)
bool_voltages = df_substations["voltage"].apply(
_check_voltage, list_voltages=list_voltages
)
df_substations = df_substations[bool_voltages]
df_substations.loc[:, "split_count"] = df_substations["id"].apply(
lambda x: x.split("-")[1] if "-" in x else "0"
)
df_substations.loc[:, "split_count"] = df_substations["split_count"].astype(int)
bool_split = df_substations["split_elements"] > 1
bool_frequency_len = (
df_substations["frequency"].apply(lambda x: len(x.split(";")))
== df_substations["split_elements"]
)
op_freq = lambda row: row["frequency"].split(";")[row["split_count"] - 1]
df_substations.loc[bool_frequency_len & bool_split, "frequency"] = (
df_substations.loc[bool_frequency_len & bool_split,].apply(op_freq, axis=1)
)
df_substations = _split_cells(df_substations, cols=["frequency"])
bool_invalid_frequency = df_substations["frequency"].apply(
lambda x: x not in ["50", "0"]
)
df_substations.loc[bool_invalid_frequency, "frequency"] = "50"
return df_substations
def _clean_lines(df_lines, list_voltages):
"""
Cleans and processes the `df_lines` DataFrame heuristically based on the
information available per respective line and cable. Further checks to
ensure data consistency and completeness.
Parameters
----------
df_lines : pandas.DataFrame
The input DataFrame containing line information with columns such as
'voltage', 'circuits', 'frequency', 'cables', 'split_elements', 'id',
etc.
list_voltages : list
A list of unique voltage values above a certain threshold. (type: str)
Returns
-------
df_lines : pandas.DataFrame
The cleaned DataFrame with updated columns 'circuits', 'frequency', and
'cleaned' to reflect the applied transformations.
Description
-----------
This function performs the following operations:
- Initializes a 'cleaned' column with False, step-wise updates to True
following the respective cleaning step.
- Splits the voltage cells in the DataFrame at semicolons using a helper
function `_split_cells`.
- Filters the DataFrame to only include rows with valid voltages.
- Sets circuits of remaining lines without any applicable heuristic equal
to 1.
The function ensures that the resulting DataFrame has consistent and
complete information for further processing or analysis while maintaining
the data of the original OSM data set wherever possible.
"""
logger.info("Cleaning lines and determining circuits.")
# Initiate boolean with False, only set to true if all cleaning steps are
# passed
df_lines = df_lines.copy()
df_lines["cleaned"] = False
df_lines["voltage_original"] = df_lines["voltage"]
df_lines["circuits_original"] = df_lines["circuits"]
df_lines = _split_cells(df_lines)
bool_voltages = df_lines["voltage"].apply(
_check_voltage, list_voltages=list_voltages
)
df_lines = df_lines[bool_voltages]
bool_ac = df_lines["frequency"] != "0"
bool_dc = ~bool_ac
valid_frequency = ["50", "0"]
bool_invalid_frequency = df_lines["frequency"].apply(
lambda x: x not in valid_frequency
)
bool_noinfo = (df_lines["cables"] == "") & (df_lines["circuits"] == "")
# Fill in all values where cables info and circuits does not exist. Assuming 1 circuit
df_lines.loc[bool_noinfo, "circuits"] = "1"
df_lines.loc[bool_noinfo & bool_invalid_frequency, "frequency"] = "50"
df_lines.loc[bool_noinfo, "cleaned"] = True
# Fill in all values where cables info exists and split_elements == 1
bool_cables_ac = (
(df_lines["cables"] != "")
& (df_lines["split_elements"] == 1)
& (df_lines["cables"] != "0")
& (df_lines["cables"].apply(lambda x: len(x.split(";")) == 1))
& (df_lines["circuits"] == "")
& (df_lines["cleaned"] == False)
& bool_ac
)
df_lines.loc[bool_cables_ac, "circuits"] = df_lines.loc[
bool_cables_ac, "cables"
].apply(lambda x: str(int(max(1, np.floor_divide(int(x), 3)))))
df_lines.loc[bool_cables_ac, "frequency"] = "50"
df_lines.loc[bool_cables_ac, "cleaned"] = True
bool_cables_dc = (
(df_lines["cables"] != "")
& (df_lines["split_elements"] == 1)
& (df_lines["cables"] != "0")
& (df_lines["cables"].apply(lambda x: len(x.split(";")) == 1))
& (df_lines["circuits"] == "")
& (df_lines["cleaned"] == False)
& bool_dc
)
df_lines.loc[bool_cables_dc, "circuits"] = df_lines.loc[
bool_cables_dc, "cables"
].apply(lambda x: str(int(max(1, np.floor_divide(int(x), 2)))))
df_lines.loc[bool_cables_dc, "frequency"] = "0"
df_lines.loc[bool_cables_dc, "cleaned"] = True
# Fill in all values where circuits info exists and split_elements == 1
bool_lines = (
(df_lines["circuits"] != "")
& (df_lines["split_elements"] == 1)
& (df_lines["circuits"] != "0")
& (df_lines["circuits"].apply(lambda x: len(x.split(";")) == 1))
& (df_lines["cleaned"] == False)
)
df_lines.loc[bool_lines & bool_ac, "frequency"] = "50"
df_lines.loc[bool_lines & bool_dc, "frequency"] = "0"
df_lines.loc[bool_lines, "cleaned"] = True
# Clean those values where number of voltages split by semicolon is larger
# than no cables or no circuits
bool_cables = (
(df_lines["voltage_original"].apply(lambda x: len(x.split(";")) > 1))
& (df_lines["cables"].apply(lambda x: len(x.split(";")) == 1))
& (df_lines["circuits"].apply(lambda x: len(x.split(";")) == 1))
& (df_lines["cleaned"] == False)
)
df_lines.loc[bool_cables, "circuits"] = df_lines[bool_cables].apply(
_distribute_to_circuits, axis=1
)
df_lines.loc[bool_cables & bool_ac, "frequency"] = "50"
df_lines.loc[bool_cables & bool_dc, "frequency"] = "0"
df_lines.loc[bool_cables, "cleaned"] = True
# Clean those values where multiple circuit values are present, divided by
# semicolon
has_multiple_circuits = df_lines["circuits"].apply(lambda x: len(x.split(";")) > 1)
circuits_match_split_elements = df_lines.apply(
lambda row: len(row["circuits"].split(";")) == row["split_elements"],
axis=1,
)
is_not_cleaned = df_lines["cleaned"] == False
bool_cables = has_multiple_circuits & circuits_match_split_elements & is_not_cleaned
df_lines.loc[bool_cables, "circuits"] = df_lines.loc[bool_cables].apply(
lambda row: str(row["circuits"].split(";")[int(row["id"].split("-")[-1]) - 1]),
axis=1,
)
df_lines.loc[bool_cables & bool_ac, "frequency"] = "50"
df_lines.loc[bool_cables & bool_dc, "frequency"] = "0"
df_lines.loc[bool_cables, "cleaned"] = True
# Clean those values where multiple cables values are present, divided by
# semicolon
has_multiple_cables = df_lines["cables"].apply(lambda x: len(x.split(";")) > 1)
cables_match_split_elements = df_lines.apply(
lambda row: len(row["cables"].split(";")) == row["split_elements"],
axis=1,
)
is_not_cleaned = df_lines["cleaned"] == False
bool_cables = has_multiple_cables & cables_match_split_elements & is_not_cleaned
df_lines.loc[bool_cables, "circuits"] = df_lines.loc[bool_cables].apply(
lambda row: str(
max(
1,
np.floor_divide(
int(row["cables"].split(";")[int(row["id"].split("-")[-1]) - 1]), 3
),
)
),
axis=1,
)
df_lines.loc[bool_cables & bool_ac, "frequency"] = "50"
df_lines.loc[bool_cables & bool_dc, "frequency"] = "0"
df_lines.loc[bool_cables, "cleaned"] = True
# All remaining lines to circuits == 1
bool_leftover = df_lines["cleaned"] == False
if sum(bool_leftover) > 0:
str_id = "; ".join(str(id) for id in df_lines.loc[bool_leftover, "id"])
logger.info(f"Setting circuits of remaining {sum(bool_leftover)} lines to 1...")
logger.info(f"Lines affected: {str_id}")
df_lines.loc[bool_leftover, "circuits"] = "1"
df_lines.loc[bool_leftover & bool_ac, "frequency"] = "50"
df_lines.loc[bool_leftover & bool_dc, "frequency"] = "0"
df_lines.loc[bool_leftover, "cleaned"] = True
return df_lines
def _create_substations_geometry(df_substations):
"""
Creates geometries.
Parameters:
df_substations (DataFrame): The input DataFrame containing the substations
data.
Returns:
df_substations (DataFrame): A new DataFrame with the
polygons ["polygon"] of the substations geometries.
"""
logger.info("Creating substations geometry.")
df_substations = df_substations.copy()
# Create centroids from geometries and keep the original polygons
df_substations.loc[:, "polygon"] = df_substations["geometry"]
return df_substations
def _create_substations_centroid(df_substations):
"""
Creates centroids from geometries and keeps the original polygons.
Parameters:
df_substations (DataFrame): The input DataFrame containing the substations
data.
Returns:
df_substations (DataFrame): A new DataFrame with the centroids ["geometry"]
and polygons ["polygon"] of the substations geometries.
"""
logger.info("Creating substations geometry.")
df_substations = df_substations.copy()
df_substations.loc[:, "geometry"] = df_substations["polygon"].apply(
lambda x: x.centroid
)
df_substations.loc[:, "lon"] = df_substations["geometry"].apply(lambda x: x.x)
df_substations.loc[:, "lat"] = df_substations["geometry"].apply(lambda x: x.y)
return df_substations
def _create_lines_geometry(df_lines):
"""
Create line geometry for the given DataFrame of lines.
Parameters:
- df_lines (pandas.DataFrame): DataFrame containing lines data.
Returns:
- df_lines (pandas.DataFrame): DataFrame with transformed 'geometry'
column (type: shapely LineString).
Notes:
- This function transforms 'geometry' column in the input DataFrame by
applying the '_create_linestring' function to each row.
- It then drops rows where the geometry has equal start and end points,
as these are usually not lines but outlines of areas.
"""
logger.info("Creating lines geometry.")
df_lines = df_lines.copy()
df_lines.loc[:, "geometry"] = df_lines.apply(_create_linestring, axis=1)
bool_circle = df_lines["geometry"].apply(lambda x: x.coords[0] == x.coords[-1])
df_lines = df_lines[~bool_circle]
return df_lines
def _add_bus_centroid_to_line(linestring, point):
"""
Adds the centroid of a substation to a linestring by extending the
linestring with a new segment.
Parameters:
linestring (LineString): The original linestring to extend.
point (Point): The centroid of the bus.
Returns:
merged (LineString): The extended linestring with the new segment.
"""
start = linestring.coords[0]
end = linestring.coords[-1]
dist_to_start = point.distance(Point(start))
dist_to_end = point.distance(Point(end))
if dist_to_start < dist_to_end:
new_segment = LineString([point.coords[0], start])
else:
new_segment = LineString([point.coords[0], end])
merged = linemerge([linestring, new_segment])
return merged
def _finalise_substations(df_substations):
"""
Finalises the substations column types.
Args:
df_substations (pandas.DataFrame): The input DataFrame
containing substations data.
Returns:
df_substations (pandas.DataFrame(): The DataFrame with finalised column
types and transformed data.
"""
logger.info("Finalising substations column types.")
df_substations = df_substations.copy()
# rename columns
df_substations.rename(
columns={
"id": "bus_id",
"power": "symbol",
"substation": "tag_substation",
},
inplace=True,
)
# Initiate new columns for subsequent build_osm_network step
df_substations.loc[:, "symbol"] = "substation"
df_substations.loc[:, "tag_substation"] = "transmission"
df_substations.loc[:, "dc"] = False
df_substations.loc[df_substations["frequency"] == "0", "dc"] = True
df_substations.loc[:, "under_construction"] = False
df_substations.loc[:, "station_id"] = None
df_substations.loc[:, "tag_area"] = None
df_substations.loc[:, "tag_source"] = df_substations["bus_id"]
# Only included needed columns
df_substations = df_substations[
[
"bus_id",
"symbol",
"tag_substation",
"voltage",
"lon",
"lat",
"dc",
"under_construction",
"station_id",
"tag_area",
"country",
"geometry",
"polygon",
"tag_source",
]
]
# Substation data types
df_substations["voltage"] = df_substations["voltage"].astype(int)
return df_substations
def _finalise_lines(df_lines):
"""
Finalises the lines column types.
Args:
df_lines (pandas.DataFrame): The input DataFrame containing lines data.
Returns:
df_lines (pandas.DataFrame(): The DataFrame with finalised column types
and transformed data.
"""
logger.info("Finalising lines column types.")
df_lines = df_lines.copy()
# Rename columns
df_lines.rename(
columns={
"id": "line_id",
"power": "tag_type",
"frequency": "tag_frequency",
},
inplace=True,
)
# Initiate new columns for subsequent build_osm_network step
df_lines.loc[:, "bus0"] = None
df_lines.loc[:, "bus1"] = None
df_lines.loc[:, "length"] = None
df_lines.loc[:, "underground"] = False
df_lines.loc[df_lines["tag_type"] == "line", "underground"] = False
df_lines.loc[df_lines["tag_type"] == "cable", "underground"] = True
df_lines.loc[:, "under_construction"] = False
df_lines.loc[:, "dc"] = False
df_lines.loc[df_lines["tag_frequency"] == "50", "dc"] = False
df_lines.loc[df_lines["tag_frequency"] == "0", "dc"] = True
# Only include needed columns
df_lines = df_lines[
[
"line_id",
"circuits",
"tag_type",
"voltage",
"tag_frequency",
"bus0",
"bus1",
"length",
"underground",
"under_construction",
"dc",
"country",
"geometry",
]
]
df_lines["circuits"] = df_lines["circuits"].astype(int)
df_lines["voltage"] = df_lines["voltage"].astype(int)
df_lines["tag_frequency"] = df_lines["tag_frequency"].astype(int)
return df_lines
def _finalise_links(df_links):
"""
Finalises the links column types.
Args:
df_links (pandas.DataFrame): The input DataFrame containing links data.
Returns:
df_links (pandas.DataFrame(): The DataFrame with finalised column types
and transformed data.
"""
logger.info("Finalising links column types.")
df_links = df_links.copy()
# Rename columns
df_links.rename(
columns={
"id": "link_id",
"rating": "p_nom",
},
inplace=True,
)
# Initiate new columns for subsequent build_osm_network step
df_links["bus0"] = None
df_links["bus1"] = None
df_links["length"] = None
df_links["underground"] = True
df_links["under_construction"] = False
df_links["dc"] = True
# Only include needed columns
df_links = df_links[
[
"link_id",
"voltage",
"p_nom",
"bus0",
"bus1",
"length",
"underground",
"under_construction",
"dc",
"country",
"geometry",
]
]
df_links["p_nom"] = df_links["p_nom"].astype(int)
df_links["voltage"] = df_links["voltage"].astype(int)
return df_links
def _import_substations(path_substations):
"""
Import substations from the given input paths. This function imports both
substations from OSM ways as well as relations that contain nested
information on the substations shape and electrical parameters. Ways and
relations are subsequently concatenated to form a single DataFrame
containing unique bus ids.
Args:
path_substations (dict): A dictionary containing input paths for
substations.
Returns:
pd.DataFrame: A DataFrame containing the imported substations data.
"""
cols_substations_way = [
"id",
"geometry",
"country",
"power",
"substation",
"voltage",
"frequency",
]
cols_substations_relation = [
"id",
"country",
"power",
"substation",
"voltage",
"frequency",
]
df_substations_way = pd.DataFrame(columns=cols_substations_way)
df_substations_relation = pd.DataFrame(columns=cols_substations_relation)
logger.info("Importing substations")
for key in path_substations:
logger.info(f"Processing {key}...")
for idx, ip in enumerate(path_substations[key]):
if (
os.path.exists(ip) and os.path.getsize(ip) > 400
): # unpopulated OSM json is about 51 bytes
country = os.path.basename(os.path.dirname(path_substations[key][idx]))
logger.info(
f" - Importing {key} {str(idx+1).zfill(2)}/{str(len(path_substations[key])).zfill(2)}: {ip}"
)
with open(ip, "r") as f:
data = json.load(f)
df = pd.DataFrame(data["elements"])
df["id"] = df["id"].astype(str)
# new string that adds "way/" to id
df["id"] = df["id"].apply(
lambda x: (
f"way/{x}" if key == "substations_way" else f"relation/{x}"
)
)
df["country"] = country
col_tags = ["power", "substation", "voltage", "frequency"]
tags = pd.json_normalize(df["tags"]).map(
lambda x: str(x) if pd.notnull(x) else x
)
for ct in col_tags:
if ct not in tags.columns:
tags[ct] = pd.NA
tags = tags.loc[:, col_tags]
df = pd.concat([df, tags], axis="columns")
if key == "substations_way":
df.drop(columns=["type", "tags", "bounds", "nodes"], inplace=True)
df_substations_way = pd.concat(
[df_substations_way, df], axis="rows"
)
elif key == "substations_relation":
df.drop(columns=["type", "tags", "bounds"], inplace=True)
df_substations_relation = pd.concat(
[df_substations_relation, df], axis="rows"
)
else:
logger.info(
f" - Skipping {key} {str(idx+1).zfill(2)}/{str(len(path_substations[key])).zfill(2)} (empty): {ip}"
)
continue
logger.info("---")
df_substations_way.drop_duplicates(subset="id", keep="first", inplace=True)
df_substations_relation.drop_duplicates(subset="id", keep="first", inplace=True)
df_substations_way["geometry"] = df_substations_way.apply(_create_polygon, axis=1)
# Normalise the members column of df_substations_relation
cols_members = ["id", "type", "ref", "role", "geometry"]
df_substations_relation_members = pd.DataFrame(columns=cols_members)
for index, row in df_substations_relation.iterrows():
col_members = ["type", "ref", "role", "geometry"]
df = pd.json_normalize(row["members"])
for cm in col_members:
if cm not in df.columns:
df[cm] = pd.NA
df = df.loc[:, col_members]
df["id"] = str(row["id"])
df["ref"] = df["ref"].astype(str)
df = df[df["type"] != "node"]
df = df.dropna(subset=["geometry"])
df = df[~df["role"].isin(["", "incoming_line", "substation", "inner"])]
df_substations_relation_members = pd.concat(
[df_substations_relation_members, df], axis="rows"
)
df_substations_relation_members.reset_index(inplace=True)
df_substations_relation_members["linestring"] = (
df_substations_relation_members.apply(_create_linestring, axis=1)
)
df_substations_relation_members_grouped = (
df_substations_relation_members.groupby("id")["linestring"]
.apply(lambda x: linemerge(x.tolist()))
.reset_index()
)
df_substations_relation_members_grouped["geometry"] = (
df_substations_relation_members_grouped["linestring"].apply(
lambda x: x.convex_hull
)
)
df_substations_relation = (
df_substations_relation.join(
df_substations_relation_members_grouped.set_index("id"), on="id", how="left"
)
.drop(columns=["members", "linestring"])
.dropna(subset=["geometry"])
)
# reorder columns and concatenate
df_substations_relation = df_substations_relation[cols_substations_way]
df_substations = pd.concat(
[df_substations_way, df_substations_relation], axis="rows"
)
return df_substations
def _remove_lines_within_substations(gdf_lines, gdf_substations_polygon):
"""
Removes lines that are within substation polygons from the given
GeoDataFrame of lines. These are not needed to create network (e.g. bus
bars, switchgear, etc.)
Parameters:
- gdf_lines (GeoDataFrame): A GeoDataFrame containing lines with 'line_id'
and 'geometry' columns.
- gdf_substations_polygon (GeoDataFrame): A GeoDataFrame containing
substation polygons.
Returns:
GeoDataFrame: A new GeoDataFrame without lines within substation polygons.
"""
logger.info("Identifying and removing lines within substation polygons...")
gdf = gpd.sjoin(
gdf_lines[["line_id", "geometry"]],
gdf_substations_polygon,
how="inner",
predicate="within",
)["line_id"]
logger.info(
f"Removed {len(gdf)} lines within substations of original {len(gdf_lines)} lines."
)
gdf_lines = gdf_lines[~gdf_lines["line_id"].isin(gdf)]
return gdf_lines
def _merge_touching_polygons(df):
"""
Merge touching polygons in a GeoDataFrame.
Parameters:
- df: pandas.DataFrame or geopandas.GeoDataFrame
The input DataFrame containing the polygons to be merged.
Returns:
- gdf: geopandas.GeoDataFrame
The GeoDataFrame with merged polygons.
"""
gdf = gpd.GeoDataFrame(df, geometry="polygon", crs=crs)
combined_polygons = unary_union(gdf.geometry)
if combined_polygons.geom_type == "MultiPolygon":
gdf_combined = gpd.GeoDataFrame(
geometry=[poly for poly in combined_polygons.geoms], crs=crs
)
else:
gdf_combined = gpd.GeoDataFrame(geometry=[combined_polygons], crs=crs)
gdf.reset_index(drop=True, inplace=True)
for i, combined_geom in gdf_combined.iterrows():
mask = gdf.intersects(combined_geom.geometry)
gdf.loc[mask, "polygon_merged"] = combined_geom.geometry
gdf.drop(columns=["polygon"], inplace=True)
gdf.rename(columns={"polygon_merged": "polygon"}, inplace=True)
return gdf
def _add_endpoints_to_line(linestring, polygon_dict):
"""
Adds endpoints to a line by removing any overlapping areas with polygons.
Parameters:
linestring (LineString): The original line to add endpoints to.
polygon_dict (dict): A dictionary of polygons, where the keys are bus IDs and the values are the corresponding polygons.
Returns:
LineString: The modified line with added endpoints.
"""
if not polygon_dict:
return linestring
polygon_centroids = {
bus_id: polygon.centroid for bus_id, polygon in polygon_dict.items()
}
polygon_unary = polygons = unary_union(list(polygon_dict.values()))
# difference with polygon
linestring_new = linestring.difference(polygon_unary)
if type(linestring_new) == MultiLineString:
# keep the longest line in the multilinestring
linestring_new = max(linestring_new.geoms, key=lambda x: x.length)
for p in polygon_centroids:
linestring_new = _add_bus_centroid_to_line(linestring_new, polygon_centroids[p])
return linestring_new
def _get_polygons_at_endpoints(linestring, polygon_dict):
"""
Get the polygons that contain the endpoints of a given linestring.
Parameters:
linestring (LineString): The linestring for which to find the polygons at the endpoints.
polygon_dict (dict): A dictionary containing polygons as values, with bus_ids as keys.
Returns:
dict: A dictionary containing bus_ids as keys and polygons as values, where the polygons contain the endpoints of the linestring.
"""
# Get the endpoints of the linestring
start_point = Point(linestring.coords[0])
end_point = Point(linestring.coords[-1])
# Initialize dictionary to store bus_ids as keys and polygons as values
bus_id_polygon_dict = {}
for bus_id, polygon in polygon_dict.items():
if polygon.contains(start_point) or polygon.contains(end_point):
bus_id_polygon_dict[bus_id] = polygon
return bus_id_polygon_dict
def _extend_lines_to_substations(gdf_lines, gdf_substations_polygon):
"""
Extends the lines in the given GeoDataFrame `gdf_lines` to the centroid of
the nearest substations represented by the polygons in the
`gdf_substations_polygon` GeoDataFrame.
Parameters:
gdf_lines (GeoDataFrame): A GeoDataFrame containing the lines to be extended.
gdf_substations_polygon (GeoDataFrame): A GeoDataFrame containing the polygons representing substations.
Returns:
GeoDataFrame: A new GeoDataFrame with the lines extended to the substations.
"""
gdf = gpd.sjoin(
gdf_lines,
gdf_substations_polygon.drop_duplicates(subset="polygon", inplace=False),
how="left",
lsuffix="line",
rsuffix="bus",
predicate="intersects",
).drop(columns="index_bus")
# Group by 'line_id' and create a dictionary mapping 'bus_id' to 'geometry_bus', excluding the grouping columns
gdf = (
gdf.groupby("line_id")
.apply(
lambda x: x[["bus_id", "geometry_bus"]]
.dropna()
.set_index("bus_id")["geometry_bus"]
.to_dict(),
include_groups=False,
)
.reset_index()
)
gdf.columns = ["line_id", "bus_dict"]
gdf["intersects_bus"] = gdf.apply(lambda row: len(row["bus_dict"]) > 0, axis=1)
gdf.loc[:, "line_geometry"] = gdf.join(
gdf_lines.set_index("line_id")["geometry"], on="line_id"
)["geometry"]
# Polygons at the endpoints of the linestring
gdf["bus_endpoints"] = gdf.apply(
lambda row: _get_polygons_at_endpoints(row["line_geometry"], row["bus_dict"]),
axis=1,
)
gdf.loc[:, "line_geometry_new"] = gdf.apply(
lambda row: _add_endpoints_to_line(row["line_geometry"], row["bus_endpoints"]),
axis=1,
)
gdf.set_index("line_id", inplace=True)
gdf_lines.set_index("line_id", inplace=True)
gdf_lines.loc[:, "geometry"] = gdf["line_geometry_new"]
return gdf_lines
# Function to bridge gaps between all lines
if __name__ == "__main__":
if "snakemake" not in globals():
from _helpers import mock_snakemake
snakemake = mock_snakemake("clean_osm_data")
configure_logging(snakemake)
set_scenario_config(snakemake)
# Parameters
crs = "EPSG:4326" # Correct crs for OSM data
min_voltage_ac = 200000 # [unit: V] Minimum voltage value to filter AC lines.
min_voltage_dc = 150000 # [unit: V] Minimum voltage value to filter DC links.
logger.info("---")
logger.info("SUBSTATIONS")
# Input
path_substations = {
"substations_way": snakemake.input.substations_way,
"substations_relation": snakemake.input.substations_relation,
}
# Cleaning process
df_substations = _import_substations(path_substations)
df_substations["voltage"] = _clean_voltage(df_substations["voltage"])
df_substations, list_voltages = _filter_by_voltage(
df_substations, min_voltage=min_voltage_ac
)
df_substations["frequency"] = _clean_frequency(df_substations["frequency"])
df_substations = _clean_substations(df_substations, list_voltages)
df_substations = _create_substations_geometry(df_substations)
# Merge touching polygons
df_substations = _merge_touching_polygons(df_substations)
df_substations = _create_substations_centroid(df_substations)
df_substations = _finalise_substations(df_substations)
# Create polygon GeoDataFrame to remove lines within substations
gdf_substations_polygon = gpd.GeoDataFrame(
df_substations[["bus_id", "polygon", "voltage"]],
geometry="polygon",
crs=crs,
)
gdf_substations_polygon["geometry"] = gdf_substations_polygon.polygon.copy()
logger.info("---")
logger.info("LINES AND CABLES")
path_lines = {
"lines": snakemake.input.lines_way,
"cables": snakemake.input.cables_way,
}
# Cleaning process
df_lines = _import_lines_and_cables(path_lines)
df_lines = _drop_duplicate_lines(df_lines)
df_lines.loc[:, "voltage"] = _clean_voltage(df_lines["voltage"])
df_lines, list_voltages = _filter_by_voltage(df_lines, min_voltage=min_voltage_ac)
df_lines.loc[:, "circuits"] = _clean_circuits(df_lines["circuits"])
df_lines.loc[:, "cables"] = _clean_cables(df_lines["cables"])
df_lines.loc[:, "frequency"] = _clean_frequency(df_lines["frequency"])
df_lines.loc[:, "wires"] = _clean_wires(df_lines["wires"])
df_lines = _clean_lines(df_lines, list_voltages)
# Drop DC lines, will be added through relations later
len_before = len(df_lines)
df_lines = df_lines[df_lines["frequency"] == "50"]
len_after = len(df_lines)
logger.info(
f"Dropped {len_before - len_after} DC lines. Keeping {len_after} AC lines."
)
df_lines = _create_lines_geometry(df_lines)
df_lines = _finalise_lines(df_lines)
# Create GeoDataFrame
gdf_lines = gpd.GeoDataFrame(df_lines, geometry="geometry", crs=crs)
gdf_lines = _remove_lines_within_substations(gdf_lines, gdf_substations_polygon)
gdf_lines = _extend_lines_to_substations(gdf_lines, gdf_substations_polygon)
logger.info("---")
logger.info("HVDC LINKS")
path_links = {
"links": snakemake.input.links_relation,
}
df_links = _import_links(path_links)
df_links = _drop_duplicate_lines(df_links)
df_links.loc[:, "voltage"] = _clean_voltage(df_links["voltage"])
df_links, list_voltages = _filter_by_voltage(df_links, min_voltage=min_voltage_dc)
# Keep only highest voltage of split string
df_links.loc[:, "voltage"] = df_links["voltage"].apply(
lambda x: str(max(map(int, x.split(";"))))
)
df_links.loc[:, "frequency"] = _clean_frequency(df_links["frequency"])
df_links.loc[:, "rating"] = _clean_rating(df_links["rating"])
df_links.loc[:, "geometry"] = df_links.apply(_create_single_link, axis=1)
df_links = _finalise_links(df_links)
gdf_links = gpd.GeoDataFrame(df_links, geometry="geometry", crs=crs).set_index(
"link_id"
)
# Add line endings to substations
path_country_shapes = snakemake.input.country_shapes
path_offshore_shapes = snakemake.input.offshore_shapes
df_substations = _add_line_endings_to_substations(
df_substations,
gdf_lines,
path_country_shapes,
path_offshore_shapes,
prefix="line-end",
)
df_substations = _add_line_endings_to_substations(
df_substations,
gdf_links,
path_country_shapes,
path_offshore_shapes,
prefix="link-end",
)
# Drop polygons and create GDF
gdf_substations = gpd.GeoDataFrame(
df_substations.drop(columns=["polygon"]), geometry="geometry", crs=crs
)
output_substations_polygon = snakemake.output["substations_polygon"]
output_substations = snakemake.output["substations"]
output_lines = snakemake.output["lines"]
output_links = snakemake.output["links"]
logger.info(
f"Exporting clean substations with polygon shapes to {output_substations_polygon}"
)
gdf_substations_polygon.drop(columns=["geometry"]).to_file(
output_substations_polygon, driver="GeoJSON"
)
logger.info(f"Exporting clean substations to {output_substations}")
gdf_substations.to_file(output_substations, driver="GeoJSON")
logger.info(f"Exporting clean lines to {output_lines}")
gdf_lines.to_file(output_lines, driver="GeoJSON")
logger.info(f"Exporting clean links to {output_links}")
gdf_links.to_file(output_links, driver="GeoJSON")
logger.info("Cleaning OSM data completed.")