Introduction

According to United Nations projections, the global trend of urbanization is accelerating, and by 2050, 68% of the world’s population is expected to live in urban areas (United Nations 2018). This shift increases the demand for housing, transportation, and public services, putting pressure on social and economic systems (Glaeser 2011). Identifying the functions of urban buildings provides insights into space utilization, enabling planners to optimize infrastructure, allocate resources efficiently, and anticipate future needs, thereby supporting strategic decisions for sustainable urban development (Lin et al. 2021; Platt 2014).

Building function classification involves assigning buildings to specific semantic categories, such as residential, commercial, administrative and public facilities, and educational services (Du et al. 2015). The availability of this data is typically limited, even in official records, and when it does exist, it is rarely accessible to the public (Xu et al. 2022). Additionally, building function is primarily determined through on-site surveys conducted by government agencies, which are labor-intensive and prone to subjective interpretations. Furthermore, the data is often aggregated by block rather than provided for each individual building, limiting its applicability (Hecht et al. 2015). To address these challenges, researchers have developed methods to automatically classify building functions.

The classification of building functions has traditionally relied on analyzing their physical (such as area and perimeter), morphological (shape complexity metrics such as squareness and compactness), and spatial characteristics (neighborhood related metrics such as adjacency and shared walls ratio) derived from Remote Sensing Images (RSIs), Street-View Images (SVIs), and vector-building data. These features are processed using a combination of expert-defined rules and Machine Learning (ML) models, which have been well-documented in recent studies (Du et al. 2015; Hoffmann et al. 2023; Kong et al. 2024; Steiniger et al. 2008). Additionally, buildings have also been categorized based on the density and categories of nearby Points of Interest (POIs) which reflect human activity patterns (Lin et al. 2021; Liu et al. 2018; Zhang X et al. 2023).

For cases where POI categories do not readily align with building functions, or when the textual data involves complex human activities such as interactions on social media, researchers have recently used Natural Language Processing (NLP) techniques. Existing studies have employed text similarity, topic modeling and most recently, word embedding methods (Chen et al. 2020; Häberle et al. 2022). While word embeddings improved building function classification, they struggle with capturing deep semantic contexts and polysemous words. In the context of geotagged text analysis, they may struggle to accurately interpret the subtle meanings that arise from diverse human activities. For instance, the word 'bank' might be misclassified as a river bank instead of a financial institution. Transformer-based Large Language Models (LLMs) overcome these limitations by generating contextual embeddings that consider the broad context of words. Our study proposes a novel multi-source building classification approach that combines building features, spatial arrangement, and local POI density with contextual embeddings generated from OpenStreetMap (OSM) building tags using the BERT-based General Text Embeddings (GTE) model. The models are developed and applied across six cities in the United States and Europe, for the classification of residential, commercial, industrial, and institutional building functions. The contribution of feature types to model accuracy is explored through ablation analysis, and the GTE model's performance is compared to traditional methods like FastText for building tag interpretation.

The remainder of this paper is structured as follows: Section 2 provides an overview of related studies on building function classification and NLP techniques. Section 3 introduces the experimental data and study areas. Section 4 describes the proposed building function classification method. Section 5 presents the experimental results. Finally, section 6 presents a discussion and section 7 the conclusions.

Background

Classification of building functions based on physical, morphological, and spatial attributes

The function of a building determines its need for structural strength, energy efficiency, and visual design. These requirements are met through the choice of building materials and architectural styles, each characterized by distinct spectral, textural, and geometric properties (Du et al. 2015). To capture these characteristics, some studies have employed semantic segmentation methods on remote sensing images to delineate building outlines and extract relevant metrics. Notable features extracted include roof materials (Xie and Zhou 2017), as well as the size, shape, and orientation of buildings (Yan et al. 2019). For example, taller structures are often used for commercial or residential high-rises, while circular buildings frequently house public arenas or theaters (Bachman 2004). In some cases, researchers utilize GIS databases that already contain vector representations of building outlines, along with useful attributes such as construction materials. Researchers have subsequently classified buildings into different functions using machine learning and rule-based models based on these features (Arunplod et al. 2017; Lu et al. 2014; Memduhoglu and Basaraner 2024). Additionally, SVIs, panoramic images taken continuously along streets, provide a ground-level perspective of urban spaces, revealing objects that are not visible from the airborne perspective (Li et al. 2017). Convolutional Neural Network (CNN) models have been utilized primarily with Google SVIs to identify building properties and classify into categories (Kang et al. 2018; Srivastava et al. 2018; Taoufiq et al. 2020).

The spatial arrangement and morphological characteristics of buildings within urban environments also indicate their function, as these environments are often intentionally designed for specific uses (Yan et al. 2019). For instance, commercial districts typically feature buildings that are closely spaced, with wide frontages and ample parking areas to accommodate frequent public access, while residential areas might prioritize privacy with detached houses and landscaped yards that foster more secluded environments (Davis 2009). The spatial relationship between buildings is assessed using measures such as adjacency and shared wall ratios, which can indicate the density typical of residential or commercial environments (Fleischmann 2019; Hamaina et al. 2012). Inter-building distances are utilized to identify open spaces or less densely populated areas, which are often indicative of industrial or institutional functions (Caruso et al. 2017). Street alignment refers to the orientation of buildings in relation to road networks, highlighting how placement can enhance or inhibit accessibility and connectivity—important considerations in the location of commercial and institutional buildings (Fleischmann 2019). Morphological properties are analyzed by grouping buildings into divisions such as neighborhoods and blocks (Yan et al. 2019). This includes metrics related to the shape and layout of these units. For example, larger blocks might suggest a lower density of development, which is often associated with suburban or industrial areas where more space is available or required.

Textual data and building function classification

The classification approaches previously discussed focus on the physical attributes of buildings and their spatial organization. Studies have enriched these models by incorporating information about human activity around the buildings of interest, including data from POIs, social media, and floating car trajectories. Researchers have developed methods to quantify the density of activity and patterns of vehicle movement at different times, utilizing datasets such as smart card usage in public transportation, taxi GPS (Global Positioning System) trajectories, and geolocated social media messages. This activity data serves as a proxy to infer the functions of nearby buildings. For example, high activity levels and vehicle traffic in the morning are often indicative of commercial areas or office buildings, while similar patterns in the evening might suggest residential zones (Liu et al. 2018; Zhong et al. 2014).

More recently, the focus has shifted towards leveraging textual data from POIs to determine building functions. Atwal et al. (2022) utilized structured OSM building tags, encoding each building’s attributes as one-hot vectors, where each tag is represented as a binary indicator (0 or 1) based on its presence or absence, and used this data in a decision tree model to predict residential and non-residential building functions. Other studies have mapped categories of POIs to specific functions, categorizing areas densely populated with food-related establishments as commercial zones and those with entertainment outlets as leisure areas (Lin et al. 2021; Liu et al. 2018; Zhang X et al. 2023). In contrast, other studies have applied NLP techniques to interpret complex textual data that is not easily mappable to building functions. Chen et al. (2020) assigned building types to unclassified POIs by comparing their names to those already classified, using the Jaro-Winkler text similarity measurement, which assesses short strings and names by considering character-level differences and common prefixes. For POIs without similar names, they employed the topic modeling TF-IDF method, which quantifies a word's importance in a document based on its frequency and rarity across a corpus, to identify key terms and match them to the most relevant building types. Although this marks an advancement from direct mapping of POI class to building function, these foundational NLP techniques do not capture deep contexts or semantic meanings (Ramos 2003).

Häberle et al. (2022) analyzed geotagged Tweets, converting them into a machine-readable format using FastText embeddings. These embeddings transform words into machine-readable low-dimensional vectors within a multi-dimensional space, positioning similar elements closer together and dissimilar ones further apart. They then used these embeddings to train a neural network to classify the functions of buildings. Word embeddings like Word2Vec and FastText are preferable to simpler vector embeddings such as one-hot encoding, which are sparse and high-dimensional with no inherent semantic information. They also have improved text representation over traditional NLP methods by creating dense vector representations of words from their contexts, capturing semantic and syntactic similarities through surrounding words and large corpora. However, these models are limited by their use of a single static vector per word, which restricts their ability to represent polysemous words (Mikolov et al. 2013; Pennington et al. 2014). This limitation may reduce their effectiveness in analyzing geotagged text that reflects diverse human activities, as they cannot fully capture the dynamic and nuanced meanings that arise from varying contexts.

Transformer-based LLMs such as BERT (Devlin et al. 2018), OpenAI's GPT (Brown et al. 2020), and Llama-2 (Touvron et al. 2023) overcome the limitations of traditional word embeddings by generating contextual embeddings that consider the full context of words from both the preceding and following text in a sentence. This approach utilizes extensive datasets to capture the nuances of human language at both the word and sentence levels, offering deeper semantic representations and a richer understanding of language subtleties compared to earlier technologies. The benefits of LLM embeddings are extensive in processing unstructured data across various fields, including processing unstructured data for diverse applications, like predicting protein structures from genetic sequences (Sadeghi et al. 2024) and analyzing customer sentiment from reviews for marketing purposes (Zhou et al. 2023). Recent work on the application of LLMs in geography has illustrated their ability to effectively translate geospatial tasks into procedural operations and demonstrate some capabilities in spatial reasoning, although they face challenges in precise numerical reasoning and managing abstract spatial tasks (Fulman et al. 2024a;Fulman et al. 2024b; Li and Ning 2023; Mai et al. 2022; Roberts et al. 2023; Zhang Y et al. 2023). Leveraging these innovations, our study proposes a novel multi-source approach that integrates predictions based on the physical and morphological features, the spatial arrangement of buildings, and POI density with LLM embeddings to create vector representations of OSM building tags.

Study area and data

Study area

We selected six cities for building function classification: Fairfax, VA; Mecklenburg, NC; and Boulder, CO in the United States; Berlin, Germany; Madrid, Spain; and Liberec, Czechia in Europe (Fig. 1). Our analysis considers buildings within the boundaries of these cities. We selected these cities based on the availability of detailed building function data, their geographic and cultural diversity, and the completeness of their building datasets within OpenStreetMap.

Fig. 1
figure 1

Boundaries of the selected cities. Buildings within these boundaries are represented as black polygons

Ground truth building functions

The administrations of Fairfax County (Fairfax County Government 2024), Mecklenburg County (Mecklenburg County Government 2024), and the City of Boulder (The City of Boulder Government 2024) provide GIS layers of buildings, including their functions. The datasets for Berlin, Madrid, and Liberec were gathered from EUBUCCO, which also provides detailed information about building functions (Milojevic-Dupont et al. 2023). Detailed information about these cities and their metrics can be found in Table 1.

Table 1 Detailed Information about the cities selected for the study

Building height

World Settlement Footprint 3D (WSF3D) Building Height dataset was provided by the German Aerospace Center (DLR) (Esch et al. 2022). This dataset maps the average building height in every settlement worldwide using a 90-meter measurement grid.

Population density

The WorldPop dataset for 2010-2011, in GeoTIFF format, at a resolution of 3 arc-seconds (approximately 100 meters at the equator), was downloaded (Bondarenko et al. 2020).

OSM building tags

OSM is a geographic information platform that offers volunteered open-source geospatial data for the entire planet. OSM provides geometric information as well as semantic details about buildings, contributed by users and presented as tags in a key:value format. For example, a tag such as 'building:apartment' may indicate that the building serves a residential function. While OSM provides a wiki-style guide for using predefined keys and values for certain types of features, users have the flexibility to create and apply their own tags. OSM data generally exhibits heterogeneity in terms of spatial distribution, the information provided by the tags, and their completeness. We collected OSM buildings data using the Ohsome (2024) API provided by HeiGIT (2024), a non-profit organization specializing in humanitarian aid, smart mobility and climate change research. We selected only those OSM buildings that overlapped with buildings from the official administrative datasets. Overall, we gathered an average of 4 tags per building from the OSM building attributes.

POI data

In addition to building data, OSM contains data on POIs, such as restaurants, shops, amenities, and more. Although these POIs are often associated with specific buildings, they are distinctly represented as point geometries in OSM. We utilized a filter through the Ohsome API to selectively extract POIs that involve human activities. This selection includes a diverse range of practical amenities (e.g., cafes, banks, cinemas, restaurants), healthcare facilities, historic places, leisure places (e.g. ice rinks, amusement parks, swimming pools), and various types of shops. Note that in our analysis, we focus on their spatial distribution and do not use the tag data associated with these POIs.

Methods

The methodology for building classification involves two main steps (Fig. 2). First, 32 building metrics are calculated from various data sources. Second, a machine learning model is used to classify the building functions.

Fig. 2
figure 2

The flowchart of the methodology

Calculating building metrics

In this paper, the terms 'metrics' and 'features' are used interchangeably. For each city, the building functions from administrative datasets were first transferred to the corresponding OSM datasets via a spatial join operation. These OSM datasets were then used to calculate all metrics for the ML model. The building metrics were categorized into four groups: Physical attributes, shape complexity, spatial relationships, and text embeddings.

Physical attributes

Building size metrics

Initially, height information was derived from the WSF3D raster file, which provides height information for buildings in global settlements. The WSF3D raster data was overlaid onto the buildings, and the mean height values were calculated for each building. Subsequently, five size (area, height, perimeter, volume, and longest axis length.) features pertaining to individual buildings were calculated. Size metrics of the individual buildings and their descriptions are detailed in Table 2.

Table 2 Size metrics of the individual buildings
Shape complexity features

The shape complexity features pertain to individual buildings and building blocks. Building blocks were delineated using the Momepy library, an open-source Python library designed for the quantitative analysis and visualization of urban form and morphology (Fleischmann 2019). Momepy creates Voronoi diagrams around buildings to delineate blocks. These diagrams are constrained by a distance threshold, in our case 100 meters around each building, and are further clipped to ensure that blocks do not extend across major roads or other barriers (Fig. 3).

Fig. 3
figure 3

Voronoi Tessellation-Based Blocks Created for Block-Based Calculations

Ten shape complexity (circular compactness, convexity, elongation, equivalent rectangular index (ERI), fractal dimension, orientation, rectangularity, roughness index, square compactness, and squareness) features pertaining to individual buildings and seven block-based metrics (block area, block perimeter, block convexity, block elongation, block ERI, block fractal dimension, and block squareness) were calculated using the formulas provided in Table 3. Some of these formulas were already implemented in the Momepy library, while others were manually implemented in Python. Block-based metrics were calculated using the same equations as those for buildings, but with block boundaries substituted for building footprints.

Table 3 Metrics for Shape Complexity with Descriptions
Spatial relationship metrics

The spatial relationships metrics regard buildings and their surrounding buildings and streets. This category also includes the population density around the building, which is an indicator of human activity and service demand. High-density areas might indicate residential zones or commercial hubs, while lower density areas could correspond to industrial or institutional regions (Luo et al. 2019). Finally, we applied the Gaussian Kernel Density Estimation (KDE) method to estimate density values of POIs across city boundaries. This non-parametric method transforms discrete POI data into a continuous density surface that reveals spatial patterns of amenities and infrastructure, revealing central places or hot spots within the urban environment (Lin et al. 2021; Miao et al. 2021). The KDE was calculated with a 10m resolution raster output for each city dataset (Fig. 4). A 10-meter resolution was chosen due to its balance between detail and computational efficiency. These KDE values were then intersected with building footprints to compute mean density values for each building. For spatial relationships, nine (adjacency, inter-building distance, neighbor distance, shared walls ratio, street alignment, block building count, block building density, block population density, and KDE mean) metrics were calculated. These metrics, along with their equations and descriptions, are detailed in Table 4.

Fig. 4
figure 4

KDE Rasters for Each City

Table 4 Spatial relationship metrics
Text embeddings

Although the OSM wiki provides tag usage guidelines, users often generate a diverse array of tags, deviating from these recommendations and complicating interpretation. Using LLM word embeddings, tags are converted to a machine-readable format that preserves their semantic and syntactic features by placing them within a multi-dimensional space, positioning similar elements closer together and dissimilar ones further apart (Zhou et al. 2023).

We utilized the GTE-large model to generate embeddings from OSM building tags (Fig. 5). Developed by Alibaba DAMO Academy, the GTE models are primarily based on the BERT framework and come in three different sizes: GTE-large, GTE-base, and GTE-small (Li et al 2023). These models are trained on an extensive corpus of relevant text pairs, encompassing a diverse array of domains and scenarios. This comprehensive training allows GTE models to be effectively utilized in various downstream tasks involving text embeddings, such as information retrieval, semantic textual similarity, and text reranking. The GTE-large model has a size of 0.67 GB, generates 1024-dimensional vectors for each input, and supports a sequence length of up to 512 tokens. A token can be any meaningful unit in a text sequence, such as a word, punctuation mark, or emoji. However, it primarily caters to English texts, and truncates lengthy text to a maximum of 512 tokens.

Fig. 5
figure 5

An illustration of the OSM tags and embeddings

We chose GTE-large as at the time of writing, it was the leading model for text classification in benchmarks used to assess LLMs (Li et al. 2023). It consistently outperformed other models, including those significantly larger, on the Massive Text Embedding Benchmark (MTEB). This benchmark tests models across diverse tasks such as text retrieval, classification, and semantic similarity. GTE-large also excelled in zero-shot text classification on the SST-2 sentiment analysis task, which evaluates a model's ability to classify text without specific fine-tuning.

As a preliminary step, we cleaned up the OSM tags for building function classification, and we implemented a series of targeted filters to retain only the most pertinent data. This process involved iterating through each key-value pair in the OSM tags dictionary and selectively excluding entries that did not contribute meaningful information for classification purposes. Firstly, we excluded tags where the key contains the substring ‘wiki,’ as these tags often reference external sources or metadata, which are not directly relevant to functional classification. Similarly, tags with keys prefixed by ‘source’ were removed, since these typically indicate the origin of the information rather than providing intrinsic characteristics of the feature itself. We also filtered out tags with values of ‘yes’ or ‘no,’ as these binary indicators lack the specificity required for distinguishing between different building functions. Additionally, tags with purely numeric values—whether representing quantities, codes, house numbers, or other numerical data—were excluded, as they generally do not offer descriptive information that is useful for functional classification. Then, the remaining key:value pairs were tokenized to create embeddings using the GTE-Large model.

To assess the capabilities of LLM text embeddings, we compare model outputs using them directly with those using one-hot encoding, which lacks inherent semantic information (Mokhtarani 2021), and with NLP models like GloVe, Word2Vec, and FastText, which are limited in capturing deep contextual relationships.

Building metric completeness

To maximize the effectiveness of an ML model, we ensure that all values for each building are complete. We employed a K-nearest neighbor (KNN) imputer to fill in missing values by averaging the values from the four nearest neighboring buildings that have such values. For heights and volumes, which do not exhibit spatial dependence and hence should not be interpolated using KNN, zeros were used as fill values. Similarly, tag vectors were filled with zero vectors.

Building function classification

Function categories

Building functions were categorized into four main types for simplicity: residential, commercial, industrial, and institutional. The classes and related sub-types each of these categories includes are detailed in Table 5. To maintain simplicity, we excluded mixed-type buildings from our analysis, as their inclusion would have introduced additional functional complexities.

Table 5 Classes and Sub-types of Building Functions

Each building’s 1024-dimensional vector embedding makes predicting building functions computationally expensive. To allow classification within reasonable computational resources, we employed eXtreme Gradient Boosting (XGBoost) for its speed and efficiency in handling large datasets and high-dimensional feature spaces (Chen and Guestrin 2016). XGBoost is an ensemble learning method that constructs an optimal predictive model from multiple decision trees using a gradient boosting framework. In this iterative learning process, the decision trees are grown sequentially. The first tree is trained and evaluated, and the incorrectly predicted data points are given more weight in subsequent iterations. This method enhances model performance by focusing on the errors of previous trees. We compare the performance and computation time of XGBoost to other models: Random Forest (Breiman 2001), Decision Tree (Quinlan 1986), KNN (Cover and Hart 1967) and Logistic Regression (Cox 1958), all of which are expected to produce lesser results.

The dataset was split into 80% for training and 20% for testing. Residential buildings dominate the data distribution, resulting in imbalanced data. Accuracy, while potentially higher, can be misleading in scenarios with imbalanced class distributions because it does not consider the disparities in class size (Branco et al. 2016; Galar et al. 2011). For this reason, macro average F1-score was used for performance evaluation. The F1-score is the harmonic mean of precision—which is the proportion of true positive results among all positive results—and recall, which is the proportion of true positive results among all relevant samples (Equation 1). This score was averaged across all classes to provide the macro average.

$$F1-score =\frac{2\times TP}{2\times TP\times FP\times FN}=\frac{2\times precision\times recall}{precision+recall}$$
(1)

We employed the XGBoost algorithm with the default parameters: a learning rate of 0.3, a maximum tree depth of 6, and 100 estimators, with a focus on optimizing the multi-class logarithmic loss (mlogloss). This feature evaluates mode accuracy by penalizing the deviation from the class labels. The experiments were conducted in a Python (v3.10) environment using the macOS (v14.5) operating system with an Apple M1 chip. For NLP processing, Apple’s MPS device was utilized.

To analyze the impacts of each group of building features on classification accuracy, we conducted an ablation analysis. Ablation analysis is a systematic method of removing or "ablating" individual features or groups of features from a model to assess their impact on the model's performance and identify which components are most important for predictions (Girshick et al. 2014; Meyes et al. 2019).

Finally, we assessed the contribution of contextual understanding provided by text embeddings to model accuracy. We compared the performance of the model with GTE-Large embeddings to the use of one-hot encodings (direct tag approach), which lack contextual understanding, and to established NLP models such as GloVe, Word2Vec, and FastText, which offer limited contextual capabilities. The latter evaluation focused solely on the embeddings themselves, excluding physical and spatial attributes of the buildings.

Results

A total of 32 features, as detailed earlier in this paper, were prepared for training the XGBoost model. Despite some of these features being correlated with each other, the XGBoost model managed to extract valuable information from these correlations, enhancing its predictive capabilities. Although feature selection and optimization techniques were considered to further improve the model and reduce computational costs, they were not implemented due to minimal or negligible improvements. Ultimately, all features were used as input for the model and applied to six city datasets. The results are presented in Fig. 6. The model's performance reaches a low F1 score of 67.80% in Madrid and peaks at 91.59% in Liberec. These levels and the variation between them align with state-of-the-art benchmarks (Haberle et al. 2022; Atwal et al. 2022).

Fig. 6
figure 6

Building Function Classification Results across city datasets

XGBoost demonstrated better accuracy and lower computation time compared to other models that were tested, across all cities in the study. Figure 7 presents the results from different models using their default parameters, applied to the full set of 32 features for the Fairfax dataset.

Fig. 7
figure 7

Comparison of Model Performance

The ablation results for each city are shown in Fig. 8. Without text embeddings, Boulder, Mecklenburg, and Fairfax grouped together with F1 scores ranging from 72.44% to 84.82%. Conversely, Liberec, Madrid, and Berlin had lower scores between 52.59% and 59.49%. Incorporating text embeddings significantly improved classification accuracy across all cities; Fairfax saw the smallest increase at 3.86%, while Madrid the highest at 9.66%, excluding Liberec, which showed a dramatic improvement of 39%.

Fig. 8
figure 8

Ablation Analysis Results. The rows representing the models where all physical features are used in conjunction (first row from the top) and without (fourth row from top) LLM embeddings are highlighted

The results of the comparison of the direct tag approach to LLM-generated text embeddings are presented in Fig. 9. As expected, direct tags enhance results in all cases, adding between 1.8% in Fairfax and 8.6% in Madrid, with Liberec as an outlier at 32.2%. Yet language model embeddings provide additional improvements across all locations except for Boulder, where direct tags outperform language model embeddings by 0.9%. The range of improvement from language model embeddings varies from 6.8% in Liberec to a minimum of 1% in Berlin.

Fig. 9
figure 9

Comparison of Direct Tags with Embeddings

Comparing the performance of the GTE-Large embeddings with GloVe, Word2Vec, and FastText, the GTE-Large embeddings demonstrated superior performance across most evaluations, with an average improvement of 6.1% over the next best approach (Fig. 10). The improvement ranged from 0.9% in Berlin to 11.0% in Fairfax. The only exception was in Liberec, where the performance was comparable across all embedding models, including GTE-Large.

Fig. 10
figure 10

LLM Embeddings Performance Compared to State-of-the-Art Models

The observed relationships between F1-scores and population density across cities drew our attention. Consequently, we conducted a correlation analysis between these variables. The Pearson correlation coefficient between population density and F1-scores in the data is approximately -0.858, meaning that as population density increases, F1-scores tend to decrease.

Discussion

Studies in the field of urban planning and geographic information science have increasingly focused on introducing new types of metrics, such as spatial and text embeddings, and implementing advanced methodologies that fuse various data sources. To make significant qualitative advances, we should study the discrepancies in model performance between cities, and within the same cities, when utilizing different attributes. Through this analysis, it may become possible to identify innovative features and appropriate modeling approaches for generating more generalized models that will be applicable across various urban settings.

In our case, we observed dramatic variations in model performance across different cities, both with and without the inclusion of text embeddings in the models. In addition, text embeddings lead to exceptionally high accuracy improvements in one city compared to the others. These variations in model performance might stem from differences in tag quality and availability, but also from the spatial properties of building functions and urban layout. Our observation that cities with lower population density are easier to predict accurately than those with higher densities suggests a potential direction for investigation. A possible explanation could be that, in densely built-up cities, the physical composition is optimized for efficiency and thus does not reflect function as clearly as in less dense environments. A more thorough analysis is required to understand the underlying causes of these disparities.

The generalizability of future research could benefit from extending the analysis to include cities from more diverse regions. Our study focused on six cities in the United States, Germany, Spain, and Czechia, all of which feature well-structured urban planning. This focus may overlook the complexities and irregularities of more diverse urban morphologies. Expanding this research faces several challenges. First, OSM tags have poor coverage in certain locations, particularly in underdeveloped countries. Our approach can incorporate various forms of geotagged text, allowing for broader application beyond OSM tags. Haberle et al. (2022) demonstrated the effectiveness of using FastText embeddings for analyzing geotagged tweets in urban environments. Flickr images could also serve as a potentially useful source of geotagged text. Applying LLMs to such textual data, which is typically longer and more complex than OSM tags, may improve classification accuracy since LLMs excel at capturing deeper semantic contexts and managing polysemy compared to simpler models.

An additional challenge that arises in less developed regions, is that more tags may be in languages other than English, which are less represented in LLMs, especially in GTE-Large (Li et al. 2023). This could potentially lead to biases in processing linguistic data. To address this, future research could test more advanced LLM models as they become available. As LLMs evolve, a potential, though highly futuristic, approach could involve directly prompting an LLM to predict a building's function using its training data, based on physical attributes and OSM tags or other textual data provided in the prompt. However, at present, LLMs are not typically trained on specific physical metrics of buildings and their surroundings, which are essential for this task.

Conclusion

This study has demonstrated that integrating text embeddings with traditional spatial and physical metrics can significantly enhance the accuracy of building function classification. The application of LLMs to OpenStreetMap tags has particularly shown to be effective over traditional NLP approaches, revealing the potential of advanced NLP techniques in geographic information science and urban planning. Variations in model performance across different cities suggest that factors like population density and the physical characteristics of cities influence classification outcomes. These insights indicate that a deeper analysis of these factors could lead to more generalized models that are adaptable to various urban settings. Moving forward, continued exploration of multilingual capabilities and the integration of additional geotagged data sources are recommended to further enhance the accuracy of these classification systems.