Introduction

Urbanization contributes to economic growth by attracting investors and creating entrepreneurial opportunities in urban areas (UN-Habitat, 2016). Urbanization is also associated with an increase in urban population, projected to be 68% of the world's population by 2050 (United Nations, 2018). However, this rate of urbanization has brought challenges linked to spatial socio-economic disparities, which manifest across various dimensions that denote the quality of life of urban dwellers and can be represented by multiple factors such as income, wealth, deprivation rate, educational attainment, employment rate, living conditions, and environment (Clark et al., 2022; United Nations, 2020). The urban socio-economic disparities are translated into social inequalities, poverty, and insecurity (United Nations, 2018). Urban socio-economic disparity is worse, particularly in the Global South, where the number of urban poor is increasing, leading to severe deprivation of access to safe water, sanitation facilities, infrastructure, and decent housing, forcing urban poor to live in overcrowded areas with poor living conditions and insecure land tenure (UN-Habitat, 2016). However, a comprehensive understanding of such disparities requires access to adequate data, which is scarce, inadequate, or lacking in most Global South.

Data scarcity is a challenge in most of the global south, where data at granular spatial scales, such as socio-economic conditions of urban neighborhoods, is neither available from government agencies nor commercial entities (Yeh et al., 2020). This is the data required to support urban development initiatives and policies aimed at promoting equitable and just cities (Kuffer et al., 2018). Surveys and censuses are widely used to obtain socio-economic data (McCallum et al., 2022). However, they are expensive and laborious, and many countries in the Global South do not have enough funds to continuously collect and update socio-economic data (Skinner, 2018). Where these data exist, they are aggregated at country or regional levels, while socio-economic conditions can significantly vary at a very fine spatial scale. Moreover, strict controls on data generation and dissemination by governments are sometimes criticized for potential alteration and forgery of data to cover deficits in implementing development strategies, which raises questions about their validity and reliability (Skinner, 2018). Unfortunately, despite the above biases, many development strategies that accept and use fail to capture the actual experiences of urban dwellers' socio-economic conditions, leading to ineffective policy implementation (Kuffer et al., 2018).

Nevertheless, the advent of technology is providing new and alternative data sources to tackle urbanization challenges (Luca et al., 2023). In this regard, the availability of satellite images has been a prominent data source and provides new opportunities for obtaining data for various uses, such as environmental and urban monitoring (Head et al., 2017; Jean et al., 2016). Recently, various researchers have benefited from improvements in image analysis techniques, such as machine learning-based fields like deep learning and computer vision, to extract meaningful information from satellite images (Steele et al., 2017; Yeh et al., 2020). These techniques use algorithms that follow a logical sequence inspired by how the human brain develops a high-level understanding of patterns depicted in digital images (Shelhamer et al., 2017). Similarly, the development of state-of-the-art Convolutional Neural Network (CNN) has propelled the field of computer vision forward, with considerable progress in image analysis (Lecun et al., 2015). This has influenced the field of remote sensing in terms of satellite image analysis, whereby CNN models have gained popularity in satellite image processing over other machine learning models, such as Support Vector Machine and Random Forest, due to their ability to produce efficient results (Sameen et al., 2018). Moreover, they possess capabilities for detecting spatial characteristics, complex spatial context, texture features, and intra-class spectral variability, which are essential for characterizing urban areas based on their morphology (Persello & Kuffer, 2020).

The application of CNN models on satellite images to extract information on socio-economic conditions yields significant results (Yeh et al., 2020). However, practical implementation faces notable challenges. Firstly, the success of CNN models relies heavily on the availability of large training data (Yeh et al., 2020). This poses a significant barrier, particularly in regions with limited or no ground truth data. Secondly, the inherent black-box nature of CNN models, lacking interpretability, hinders their application to address real-world problems (Gulum et al., 2021; Hall et al., 2022). Additionally, the computational complexity of these models and dependence on extensive computational resources present a barrier (Gevaert, 2022; Thompson et al., 2020), especially in regions with limited computing infrastructure.

Therefore, this study proposed and applied a CNN method to infer neighbourhood level urban socio-economic disparity maps in a data and resource scarce urban environment. The study first highlights the challenge of applying DHS wealth index data to predict wealth at the neighborhood level. Then, it proposes a method that uses human intelligence and CNN-based intelligence to address the challenges of limited training data, interpretability, and computational complexity for predicting wealth. This would bridge the gap in data and resource-scarce regions and enhance the adoption of such approaches among practitioners in urban planning and management. The study could contribute to enhanced monitoring efforts for implementing the Sustainable Development Goal, specifically Goal 11, which promotes safe and inclusive cities and human settlements (United Nations, 2015).

The rest of this paper is organized as follows: The next section presents CNN and satellite imagery for urban development and socio-economic analysis. The following section provides an overview of the study area and data. This is followed by the methods section, which explains the data preparation, model development process, and model application to predict wealth from satellite imagery. Finally, the results and discussion section present the findings, their corresponding discussions, and limitations, followed by a concluding section.

CNN models and satellite imagery for urban development and socio-economic analysis

Satellites capture information about urban areas, such as structural quality of housing, building density, and environmental conditions, which reflect the socio-economic conditions of these areas (Duque et al., 2015; Li et al., 2022; Vaishnnave et al., 2019). They have been used in mapping urban deprived areas based on access to essential infrastructure and services (Kuffer et al., 2020), deriving information about socio-economic status and wealth (Arribas-Bel et al., 2017), and modeling wealth (Georganos et al., 2019). Advances in computer vision, particularly the application of CNN models, have recently gained notable research enthusiasm in employing them to analyze satellite imagery for a comprehensive understanding of socio-economic conditions. Yeh et al. (2020) employed CNN to predict asset wealth across the African continent using multispectral satellite imagery. They successfully predicted wealth measurements matching ground data estimates, whereby their model was able to capture 70% of the variation in wealth. Castro and Álvarez (2023) utilized daytime and night-time satellite imagery and transfer learning to estimate average income, Gross Domestic Product per capita, and water index in Brasilia at the city level, achieving a 64% prediction accuracy. Li et al. (2022) presented an approach for predicting multi-level socio-economic indicators based on satellite images and street view images, outperforming baseline models by over 10% in accuracy. Hu et al. (2022) introduced a model for village-level poverty prediction using machine learning, high-resolution images, and geospatial data. Jean et al. (2016) developed a deep neural network model that leverages daytime and night-time satellite images to estimate poverty levels in Sub-Saharan countries. They established a link between image features and economic activities, with illuminated areas as key indicators.

Moreover, Hofer et al. (2020) applied medium-resolution satellite images to predict and map poverty statistics in the Philippines and Thailand, whereby their predictions were comparable to the official poverty estimates. Daoud et al. (2023) combined satellite data with census data to measure health and living standards in India. Using transfer learning, they have predicted a set of human well-being with an R2 of around 0.6 for 70 out of 93 measured health outcomes. In addition, several studies in Rwanda have leveraged CNN models and satellite imagery to investigate economic development. Uwizera et al. (2022) utilized satellite images and mobile phone data to assess road conditions by ranking and prioritizing roads for construction based on mobility trends. Mugiraneza et al. (2022) demonstrated the capability of CNN models in monitoring urbanization and its environmental impacts in the City of Kigali. Their study demonstrated that using satellite images, CNN can help monitor detailed urbanization and associated environmental impacts. Dufitimana and Niyonzima (2023) automated slum mapping in the City of Kigali using CNN on very-high resolution satellite images. Their approach differentiated slum areas from non-slum areas based on their physical characteristics, such as high building density and small-sized unplanned buildings.

Despite these advancements in the application of CNNs for measuring and tracking socio-economic development, few studies have specifically focused on urban socio-economic disparities in Rwanda. Head et al. (2017) used satellite images and CNN to predict poverty in Rwanda and three other countries using indicators such as electricity, mobile phone ownership, and child nutrition. However, their approach was sensitive to hyperparameter tuning and does not trivially generalize to predict other development aspects, such as access to drinking water. Jean et al. (2016) proposed a CNN method for poverty prediction in five African countries, including Rwanda, using survey and satellite data. Their proposed CNN method identified image features that explain 75% of the variation of economic outcomes at the country level using publicly available satellite images. However, these studies were based on the availability of reliable ground truth data and provided results that lack granularity for understanding socio-economic disparities at a small spatial scale level. Consequently, there is a pressing need for new approaches to address these gaps; hence, this study.

Study area and data

Study area

The study was conducted in Kigali, the capital of Rwanda, located in the eastern region of Africa. Administratively, Rwanda is divided into four provinces and the City of Kigali. Kigali has three administrative districts, as illustrated in Fig. 1. These districts are further subdivided into 35 sectors, encompassing 161 cells and 1,176 villages, with the village representing the smallest administrative unit/neighborhood (NISR, 2018).

Fig. 1
figure 1

Location of the City of Kigali in Rwanda (Data source: National Land Authority)

The city covers approximately 730 km2 and accommodates about 1.7 million residents (NISR, 2022). It also comprises land-use types, mainly commercial, residential, industrial, infrastructure, wetlands, agriculture, and forests (City of Kigali, 2019). There is a noticeable trend of land use change towards built-up areas, particularly in the core area under urbanization, where continuous densification and modernization are evident through modern buildings and upgraded road networks. The peri-urban areas have also been undergoing spatial transformations, reflected in the emerging modern residential neighborhoods around the old informal settlements and peri-urban zones, which still portray rural characteristics in significant areas under urbanization. Rapid population growth and considerable rural-to-urban migration significantly shape the shifting patterns of land use (NISR, 2022). Despite these changes, a substantial portion of the city's territory remains rural (Manirakiza et al., 2019).

The City of Kigali was selected for this study because it has witnessed unplanned urbanization over the past few decades, resulting in significant spatial disparities among its residents (Finn, 2018). This has led to the proliferation of unplanned/informal settlements, which house approximately 60% of households, many of which are located in hazardous and disaster-prone areas like steep slopes inclined as much as 50% and low-lying areas with less than 2% elevation (City of Kigali, 2019).

The urban landscape in Kigali is characterized by a mix of planned neighborhoods, informal settlements, and transitional mixed neighborhoods (Manirakiza et al., 2023; Uwizeye et al., 2022). Planned neighborhoods are characterized by clear spatial organization, well-structured road networks, and improved access to urban amenities and services. Higher-income residents often inhabit these neighborhoods. In contrast, informal settlements consist of small, overcrowded dwelling units without clear spatial organization. Low-income people are highly occupied with limited access to essential urban amenities and services. Mixed neighborhoods represent a fusion of the planned and informal, which emerge through expanding planned neighborhoods or gradually improving informal settlements.

The urban landscape of Kigali highlights the urbanization challenges, which underscore the urgent need for effective urban governance to achieve socio-economic development, reduce inequality, and promote inclusivity and equity. In response, the government of Rwanda has made efforts to address these urbanization challenges through crafting and implementing urban redevelopment schemes, which have been relatively successful. The most recent master plan, approved in 2020 through the review of the master plan released in 2013, is one of the schemes guiding the Kigali city redevelopment whereby the current focus is the provision of urban infrastructure, including upgraded roads, green spaces, and new housing (City of Kigali, 2019).

Based on spatial transformations driven by the implementation of the previous master plan, the City of Kigali is now recognized as an emerging model of urban development in Africa (Baffoe et al., 2020). However, these schemes face criticism for their lack of inclusiveness and equity. For instance, the city residents have witnessed more informal settlement clearance and relocation cases, which infringed upon the livelihoods of the affected urban dwellers who are primarily employed in the informal sector (Esmail & Corburn, 2020; Finn, 2018). Consequently, many of them, such as the poor and vulnerable populations, can no longer afford to live in the urban fabric, which results in their displacement towards rural settings outside of Kigali (Finn, 2018; Nikuze et al., 2020).

Therefore, the above highly diversified socio-economic environment positions the City of Kigali as a representative case for rapidly urbanizing cities, and the findings are likely to hold relevance for numerous similar cities in the Global South.

Data and data source

The study employed Very High-Resolution satellite images covering the City of Kigali as the primary input data. The images were sourced from Google Maps via QGIS, a free and open-source Geographic Information System software, with a zoom level of 15, sufficient to achieve a spatial resolution of 50 cm. This resolution is enough to identify various features in the images, such as buildings and roads. These images were a set of mosaic and mostly cloud-free image tiles captured between 2020 and 2022 (see detailed information at https://developers.google.com/maps/documentation/tile/satellite), and they consisted of Red, Green, and Blue spectral bands. Additionally, the research used geolocated Rwandan Demographic and Health Surveys (DHS) data for 2019–2020 obtained from https://dhsprogram.com/data/dataset/Rwanda_Standard-DHS_2019.cfm?flag=0. The survey was conducted on 61 geographical clusters across the city, which are enumeration areas corresponding to the village level or a part of it in Rwanda, through which a random selection of households involved in the survey was conducted.

The DHS generally collects data on a diverse range of socio-economic indicators related to household wealth and well-being (Steele et al., 2017; Yeh et al., 2020). This study focused specifically on the wealth index, which is a variable included in Rwandan DHS data and represents a score that ranks each household's wealth based on ownership of assets such as television, bicycle, or car and housing characteristics such as roof materials, source of drinking water and toilet facilities (Fatehkia et al., 2020; Georganos et al., 2019). This index is widely used for assessing and modeling the socio-economic conditions of areas (Fatehkia et al., 2020; Head et al., 2017; Jean et al., 2016; Yeh et al., 2020). Thus, the index fits for this study for understanding socio-economic disparities. The index ranges from 1 to 5, denoting the lowest to the highest wealth households, respectively, with a mean of 4.37 and a standard deviation of 0.96 across the City of Kigali. This study also employed village boundaries of the City of Kigali, the smallest administrative level in Rwanda, as the neighborhood level that denotes the analysis unity. Villages data were obtained from the National Land Authority (NLA).

Methods

Data preparation

The data preparation involved labeling satellite image patches through two distinctive methods: data-driven labeling and human-guided labeling. In the data-driven approach, we utilized the DHS wealth index data to generate training data, drawing inspiration from previous studies (Head et al., 2017; Jean et al., 2016; Steele et al., 2017; Yeh et al., 2020) that employed similar techniques to model socio-economic disparities. On the other hand, human-guided image labeling employed human intelligence for satellite image annotation. This method is crucial in the proposed approach to assess urban socio-economic disparities using high-resolution satellite images, especially when baseline statistical data is unavailable or unreliable. These data labeling approaches allowed a comprehensive understanding of further analysis of urban socio-economic dynamics.

Data-driven labelling

The data-driven labeling process undertaken in this study aimed to establish associations between satellite images and the DHS wealth index across 61 geolocated survey clusters. Initially, the wealth index of each survey cluster was represented by its frequent value (statistical mode) to label training images. This approach was necessitated by the availability of survey cluster geographical coordinates. However, a significant challenge emerged due to the random displacement of DHS cluster locations as a privacy protection measure employed by the DHS program. This led to variations ranging from 2 km in urbanized areas to 5–10 km in rural areas. This displacement, outlined as a potential drawback by Watmough et al. (2019), negatively affects the analysis using DHS geolocations at smaller geographic scales. In response to this challenge, the study adopted a method proposed by Perez-Heydrich et al. (2016) and recommended by the DHS program. This technique addresses the negative impact of random displacement by generating additional points within neighborhood buffers ranging from 0 to 5 km. Accordingly, eight duplicate random location points were generated for each original DHS cluster point location, considering the four main directions: east, west, north, and south, based on buffers of 500 and 1000 m (four points for each buffer). The selection of these buffer sizes considered the complexity and heterogeneity of urban areas across the City of Kigali. This could cause points to shift between areas with significantly different urban morphology.

Consequently, the original 61 DHS survey points resulted in the generation of 488 additional points, considering the buffer areas and directions. Combined with the actual DHS survey points, these points formed the basis for labeling 549 image patches (Fig. 2). Each image patch measured 512 by 512 pixels, corresponding to a size of 256 by 256 m, a scale assumed to be enough for capturing most spatial features of a given area based on the spatial resolution of the satellite images. The process of refining the dataset underwent further enhancements, which involved the exclusion of data points located beyond the city boundary.

Fig. 2
figure 2

Points representing DHS clusters (red) and generated points (yellow) across the study area (Imagery data: Google imagery)

Additionally, efforts were directed toward resolving instances where multiple points were assigned to the same image patch. In these cases, the image patch received a label corresponding to the most frequently occurring wealth index among the associated points. To ensure a balanced dataset, image patches labeled with an index of 5, which were few, were merged with those labeled as index 4, indicating high wealth. The resultant dataset comprised a total of 513 labeled images distributed across wealth categories as follows: 143 were labeled as high wealth, 114 as medium wealth, 120 as low wealth, and 136 as very low wealth.

It is crucial to acknowledge potential biases introduced in this data preparation process. Such biases could emerge from data displacement and the utilization of a data-driven methodology and contribute to the introduction of random noise, imposing constraints on the performance of the developed model. Consequently, this study has stimulated the exploration of methods applicable in scenarios lacking baseline statistical data or facing questions regarding reliability. This step was essential in the relentless pursuit of accurate and reliable insights. Therefore, by recognizing these challenges, this study attempts to enhance the robustness and validity of input data to the model by using human intelligence to interpret and label satellite images, as discussed in the following section.

Human-guided image labeling

Human-guided image labeling serves as the foundation for this study, where the core idea involves integrating human expertise into the data labeling process to produce labeled image patches for training the model. The underlying hypothesis is that satellite images capture essential features such as roads, buildings, and outdoor spaces, which provide insights into spatial patterns of human activities and can reflect conditions related to resource allocation and socio-economic development (Han et al., 2020; Li et al., 2022).

In a departure from conventional reliance on statistical data, this study benefited from the support of four graduate students to label image patches. They specialize in urban studies and comprehensively understand the local context. A briefing session was conducted to establish consistency and mitigate potential biases. This session aimed to reach a consensus on labeling standards, drawing insights from relevant literature (e.g., Duque et al., 2015; Han et al., 2020; Li et al., 2022), as outlined in Table 1.

Table 1 Labelling standard for manual expert rating-based image labeling. Developed based on a synthesis of the literature on earth observation for urban studies, including Kuffer et al. (2018), Ahn et al. (2023), Duque et al. (2015), Han et al. (2020), Li et al. (2022), Steele et al. (2017), Yeh et al. (2020) and Head et al. (2017)

To further ensure the robustness of the labeling process, the labeled image patches underwent a careful comparative analysis whereby any contradictions in labeling were excluded. This process labeled 806 image patches, categorized into 211 as high, 187 as medium, 199 as low, and 209 as very low wealth. This methodology was adopted based on the recommendations by Head et al. (2017), indicating that manually labeled features from satellites would yield good results for predicting human development. The methodology was also drawn from prior approaches used to generate scores for urbanization level (Han et al., 2020) and assess the quality of the urban environment (Liu et al., 2017). By employing human expertise with a carefully crafted process, this approach ensures the accuracy of labeled data and addresses potential biases, ultimately enhancing the reliability and relevance of model training.

The process of manually labeling satellite images based on features capturing human activities and reflecting economic development adheres to defined standards, as outlined in Table 1. This standardization ensures consistency and reliability in the labeling process, which is crucial for generating accurate and meaningful data. For example, areas with high wealth signify the prevalence of features associated with wealthy neighborhoods, such as fashionable buildings, well-maintained outdoor spaces, and improved road networks. The significance of this methodology lies in its alignment with recent research findings, which have demonstrated a strong correlation between visual features captured in satellite imagery and various socio-economic indicators in urban settings (Head et al., 2017; Jean et al., 2016; Yeh et al., 2020).

By leveraging visual features such as building size and arrangement, environmental quality, and infrastructure density, this approach offers a good understanding of socio-economic dynamics within cities. Importantly, it distinguishes the current study from others that rely on the conventional assumption of readily available baseline socio-economic data from census and surveys for training machine learning models (e.g., Castro & Álvarez, 2023; Daoud et al., 2023; Jean et al., 2016). Unlike traditional methods, which may be constrained by the availability and accuracy of such data, satellite imagery provides a more comprehensive and real perspective on urban landscapes.

Moreover, this approach presents a viable alternative to mitigate potential biases caused by DHS cluster location displacement, a common challenge in the application of DHS data. By directly analyzing visual indicators present in satellite images, the study avoided these limitations, which would also enhance the adaptability of the study to regions lacking reliable statistical data. This adaptability is particularly advantageous in contexts where official statistics may be outdated, incomplete, or inaccessible, allowing for a more holistic and inclusive analysis of socio-economic conditions.

Figure 3 showcases examples derived from human-guided labeling, offering tangible insights into the practical application of the approach.

Fig. 3
figure 3

Expert rating-based image labeling examples (Data source: Google Imagery)

Model development process

Overview

Given a dataset of labeled image patches \(\text{D= }{\left\{\left({\text{x}}_{\text{i}}\text{, }{\text{y}}_{\text{i}}\right)\right\}}_{\text{i=1}}^{\text{N}}\), where xi represents the image patch, and yi is the corresponding label indicating the wealth class. The goal was to propose a model f: x ↦ y that learns from the sequence of image patches to the corresponding label using a CNN architecture, which minimizes the loss function over the entire training dataset by solving the optimization problem as shown in Alzubaidi et al. (2021) following the Eq. 1:

$$\underset{\theta }{\text{min}}\text{F} = \frac{1}{{\text{N}}}\sum_{\text{i=1}}^{\text{N}}\text{L(}{\text{y}}_{\text{i}}\text{ , f(}{\text{x}}_{\text{i}}\text{))}$$
(1)

Here, L(yi, ƒ(xi)) expresses the loss function that measures variation between the actual and predicted class labels, and N is the number of image patches in the dataset. The study employed categorical cross-entropy loss, a standard loss function that quantifies the difference between predictions and actual classes, due to its computational efficiency (Wang et al., 2022). From Eq. 1, it is defined in Eq. 2 as follows:

$${\text{L}}\left({\text{y}}_{\text{i }}\text{, f}\left({\text{x}}_{\text{i}}\right)\right)\text{ = -}\sum_{\text{j=1}}^{\left|{\text{Y}}\right|}\text{[}{\text{y}}_{\text{i}}^{\text{j}}\text{ . log (f(}{{\text{x}}_{\text{i }}\text{)}}^{\text{j}}\text{]}$$
(2)

Here, |Y| represents the number of classes in Y, yij is the actual class for class j, and ƒ(xi​)j is the predicted class that the i-th image patch belongs to class j. The solution involves a series of steps, including model setup, data augmentation and feature extraction, model training, validation and inferencing, and interpretation, as described in the following sections.

Model setup

Three CNN model architectures, built based on essential building blocks of CNNs (see Alzubaidi et al., 2021), were explored (Fig. 4). The first model architecture, termed Base-CNN, comprises four convolutional layers using 3 × 3 filters and a Rectified Linear Unit (ReLU) activation function to introduce non-linearity to the network by replacing negative pixel values with zero (max (0, x)). Max-pooling layers with a 2 × 2 pool size follow the convolutional layers, downsampling feature maps and identifying image features like edges and corners. A flattening layer converts the pooled feature maps into a linear vector, followed by two fully connected layers. The initial fully connected layer contains 128 units with ReLU activation. The final fully connected layer is equipped with a SoftMax activation for multiple classes and assigns probabilities to each class based on the output distribution. The second model architecture, the Transfer-CNN model, employs transfer learning based on the VGG16 architecture, chosen for its balance of compactness and high accuracy in the ImageNet image classification challenge (Russakovsky et al., 2015). The top fully connected layers of VGG16 were removed to form a feature extractor with pre-trained weights. A new classifier is added, including a flattening layer, a dense layer with 128 units with ReLU activation, and a final layer similar to the Base-CNN model above. The last model architecture, the Custom-CNN model, combines transfer learning using VGG16 with four additional convolutional layers. The VGG16 layers are frozen to preserve their features. The four convolutional layers are similar to those in the Base-CNN model, and the last layers are similar to those of the previous architectures, with Max-pooling added to enhance feature extraction. The output is then flattened and connected to fully connected layers for classification. All models are optimized using the Adam optimizer and cross-entropy loss function (see Eq. 2).

Fig. 4
figure 4

Proposed model architectures for predicting wealth index and urban socio-economic development based on a synthesis of literature including Alzubaidi et al. (2021), (Wang et al., 2022), and (Russakovsky et al., 2015)

Data augmentation and feature extraction

The study used data augmentation to generate more variations of the training dataset (Taylor & Nitschke, 2018), following the popular data augmentation techniques of geometric transformations with small random horizontal and vertical shifts that create mirror images of image patches (Han et al., 2018). On the other hand, the study used a pre-trained VGG-16 network as a feature extractor since it has the capacity to identify low-level image features (Russakovsky et al., 2015). This feature extraction technique is known as transfer learning and is advantageous when working with limited training data, as highlighted by Suel et al. (2019). This technique has been previously used by Castro and Álvarez (2023) to predict socio-economic indicators and by Xie et al., (2016) to predict poverty using night-time lights and daytime satellite imagery.

Model training, validation, and mapping of socio-economic disparities

The model training and validation processes followed a recommendation by Alzubaidi et al. (2021). The batch size, which indicates the number of image patches simultaneously fed into the model, was fixed at 32, and the models underwent 30 epochs, striking a balance between minimizing training time and reducing computational resources. This setup reflects a consideration of the trade-off between model accuracy and computational efficiency. The input data were split into training (70%), validation (15%), and test (15%) sets, aligning with the approach suggested by Nguyen et al. (2021) to ensure a well-distributed dataset for effective training and evaluation. This data splitting was also influenced by the limited data available for this study.

Furthermore, the research conducted a sensitivity analysis. This followed a one-at-a-time sensitivity analysis approach whereby only input to the model was modified at a time while keeping the model architectures constant (Borgonovo et al., 2022). This analysis involved testing the performance of the model trained using the entire DHS dataset, followed by iterative augmentation with 5% of human-guided data in each subsequent iteration until all human-guided data were integrated. Subsequently, the predictive performance of the model trained with both DHS and human-guided data was compared to that of the model trained solely on DHS data. This analysis relied on the concept that integrating satellite imagery, machine learning, and community knowledge could facilitate the mapping and comprehension of spatial disparities on a large scale (Head et al., 2017). Furthermore, DHS data is often collected at a large scale and covers extensive demographic and health-related variables across different (Steele et al., 2017; Yeh et al., 2020). Integrating this DHS data with Human-guided data could enhance the performance and extend its relevance to diverse geographic locations, thus increasing its scalability and applicability.

Accuracy, recall, and F1-Score evaluation metrics were applied to the validation set, following the methodology proposed by Nivaggioli and Randrianarivo (2019), to objectively validate the performance of the models. Furthermore, the comparison of model predictions with actual DHS data provided a practical benchmark for the model's real-world performance. Moreover, the deployment of the trained model involved a mapping process, following the approach used by Hofer et al. (2020), where the average predicted probabilities within each village were calculated. Thus, a spatial map M was created to visualize socio-economic disparities in the City of Kigali following Eq. 3 below:

$${\text{M}}_{\text{ij}}\text{=}\frac{{\sum }_{\text{x }\in \, {\text{neighborhood}}_{\text{i}}}\prod \left({\text{f}}\left({\text{x}}\right)= \text{j} \right)}{{\text{N}}_{\text{i}}}$$
(3)

Here, x represents the image patch, Π (f(x) = j) is the product notation, and Π represents the product of the indicator function f(x) = j for each image patch x in the neighborhood. The indicator function yields a value of 1 if the predicted wealth class of image patch x is j and 0 otherwise. Ni presents the total number of image patches within the neighborhood i.

Model interpretability

Attribution maps were used to visually examine the areas and associated features in input images that were most important for the model to make a particular class prediction, as described by Hall et al. (2022). These maps are commonly used in computer vision applications to understand and interpret the basis of CNN model predictions (Montavon et al., 2018). The attribution maps have been instrumental in medical imaging, identifying visual patterns learned by CNN models dedicated to predicting various diseases (Caruana et al., 2015; Gulum et al., 2021). The methodology employed for the computation of attribution maps is Gradient Class Activation Mapping (Grad-CAM). This technique facilitates the visualization of decision-influencing areas as heat maps (Selvaraju et al., 2017). This method was employed by initially calculating the Gradients of the highest predicted class for the input image relative to the activation of the final convolutional layer. This was followed by a calculation of the importance of each channel in the feature map of the last convolutional layer based on the gradient. Lastly, the Grad-CAM was obtained by multiplying each channel in the last convolutional layers element-wise by its corresponding weight W, as illustrated in Eq. 4.

$$\text{Grad-CAM = g(x) }\bullet \, \frac{1}{{\text{C}}} \, \sum_{\text{i=1}}^{\text{c}} \, \left(\frac{\partial {\text{S}}}{\partial {\text{g}}\left({\text{x}}\right)}\right)$$
(4)

where g(x) represents the output of the final convolutional layer (before fully connected layers), \(\frac{\partial {\text{S}}}{\partial {\text{g}}\left({\text{x}}\right)}\) is the gradient of the score with respect to the feature maps, and C represents the number of channels in the feature map.

The experiments were conducted using TensorFlow on Google Collaboratory notebooks with a standard GPU on an ordinary personal computer. This facilitated the process for optimal performance and efficiency.

Results and discussion

Performance of the models

The initial findings present evaluation metrics for assessing models trained with DHS wealth index data and human-guided data to predict wealth at the village level throughout the City of Kigali. The results obtained from the experiment are presented in Table 2, with the most effective model architecture highlighted in bold.

Table 2 Performance of models on DHS- wealth index and Human-guided data (Data source: Data analysis)

The Transfer-CNN model architecture exhibits a moderate performance with an accuracy of 0.524, a recall of 0.434, and an F1-Score of 0.487 compared to the Base-CNN and Custom-CNN model architectures trained on the DHS wealth index dataset. This modest performance may stem from introducing random noise during data preparation and the inherent disturbances caused by the displacement and regeneration of survey cluster locations, negatively impacting the model's effectiveness, as also highlighted by Perez-Heydrich et al. (2016). In contrast, the Custom-CNN model architecture, trained on human-guided data, excels with an accuracy of 0.886, a recall of 0.880, and an F1-Score of 0.880. It surpasses both the Base-CNN and Transfer-CNN model architectures on the same dataset. This performance is attributed to the inclusion of human knowledge in data preparation, enabling the model to adeptly learn and identify features present in the image.

Moreover, the Transfer-CNN model architecture demonstrates efficiency compared to others, requiring only 15 min for the DHS wealth index and 31 min for human-guided data. This notable reduction in training time can be attributed to the utilization of pre-trained weights from the VGG-16 model, which speeds up the learning process, as highlighted by Castro and Álvarez (2023), Han et al. (2018) and Xie et al., (2016). In contrast, the Custom-CNN necessitates a longer training duration, falling between the Transfer-CNN and the Base-CNN. Notably, the Base-CNN, trained from scratch, incurs the lengthiest training times for both datasets. These results highlight the inherent trade-offs between model complexity and training time, underlining the advantages of employing transfer learning in scenarios where time efficiency is paramount in model development, as emphasized by Thompson et al. (2020). Moreover, the model selection and technique employed were chosen, considering constraints such as limited computational resources, as outlined by Gevaert (2022). This approach aligns with a practical computational resource-based strategy for predicting wealth and ensuring applicability in resource-constrained settings, particularly in cities with limited computational resources.

Further assessment of the model performance included quantitative evaluation of the relationship between the predictions generated by the model trained on the DHS wealth index and human-guided data and the actual DHS wealth index at the cluster level. This analysis employed Spearman correlation, due to the ordinal nature of the data (de Winter et al., 2016), to provide crucial insights into the model's accuracy in predicting wealth. The performance yielded an R2 of 0.70 for predicted wealth based on the DHS wealth index and an R2 of 0.65 for predicted wealth based on human-guided data (Fig. 5). This indicates that the model trained on the Wealth index was able to explain 70% of the variation of the cluster-level wealth index. In contrast, the model trained on human-guided data was able to capture 65% of the variation of the cluster-level wealth index. This demonstrates the predictive power of the model, highlighting its usefulness in assessing the socio-economic well-being of different areas. This result is comparable to that of Jean et al. (2016), Steele et al. (2017), and Yeh et al. (2020) that outperformed previous studies such as Blumenstock et al. (2015), Pokhriyal and Jacques (2017) and Tusting et al. (2019).

Fig. 5
figure 5

DHS wealth index (a) and socio-economic development (b) predicted at the cluster level (The size of the circles and the number represent the DHS survey cluster and their numbers) (Source: Data analysis)

Model sensitivity analysis

Experimental analyses assessing sensitivity involved initially training the model using the entire DHS dataset and subsequently augmenting it with portions of human-guided data in each iteration until all human-guided data were utilized. The results (Fig. 6) highlight the training performance of the model trained with a different subset of both DHS and human-guided data.

Fig. 6
figure 6

Performance by the amount of human-guided added to DHS training data used: a Accuracy, b Recall, c F1-Score (Source: Data analysis)

The results compare the performance of the Base-CNN, Transfer-CNN, and Custom-CNN, as they are increasingly supplemented with human-guided data. Among these, the Base-CNN architecture demonstrates the lowest level of performance, indicating potential limitations in its capacity to leverage additional data for improvement. In contrast, the Transfer-CNN shows a substantial enhancement in performance when compared to the Base-CNN, suggesting that it benefits more significantly from integrating human-guided data. Notably, the Custom-CNN surpasses the performance of the other two models and exhibits a progressively more significant advantage as more human-guided data is added. This observed trend strongly supports the conclusion that the Custom-CNN architecture is particularly well-suited to effectively utilize larger volumes of human-guided data. These findings underscore the importance of choosing an appropriate model architecture like the Custom-CNN to maximize the efficacy of machine learning systems with extensive human-guided data inputs.

Furthermore, the research compared the prediction results of the model trained using a combination of DHS and Human-guided data to the original DHS wealth data at the cluster level through Spearman correlation (see Fig. 7). The model trained on the combined datasets achieved an R2 of 0.73, indicating that it could explain 73% of the variation in the DHS wealth index at the cluster level. This performance exceeds the model trained solely on DHS and Human-guided data. It is highly comparable to the performance reported by Yeh et al. (2020), where their model explained 70% of the variation in ground-measured village wealth. It is also comparable to the findings of Steele et al. (2017), who achieved 78% of the variation in poverty.

Fig. 7
figure 7

DHS wealth index predicted at the cluster level using DHS data combined with Human-guided data (The size of the circles and the number represent the DHS survey cluster and their numbers). (Source: Data analysis)

Urban socio-economic disparities maps of Kigali

The spatial patterns of wealth inferred from the DHS wealth index (Fig. 8) offer insights into the socio-economic landscape of the City of Kigali. Notably, the central core of the city emerges as a focal point of wealthy areas, representing its socioeconomically advanced standing. This concentration of wealth mirrors a global urbanization trend observed in many cities, where the central business district and wealthy residential areas gravitate towards the heart of the urban area, as reported by the United Nations (2020).

Fig. 8
figure 8

Wealth predicted at the village level (based on DHS Wealth Index) across the City of Kigali (Source: Data analysis)

On the contrary, the outskirts of the city, characterized by peri-urban and dominant rural settings, exhibit low and the lowest wealth, respectively. This contrast underscores the prevalent urban–rural divide, shedding light on the challenges faced by peri-urban areas and rural communities in accessing urban resources, as also highlighted by Carter (2018). Particularly noteworthy is the southeastern side of the city, which stands out as a focal point undergoing gentrification, marked by wealthy areas. This phenomenon in this region can be attributed to strategic urban development initiatives, which involve significant relocations of residences from the urban center to more affordable areas, coupled with substantial investments in infrastructure, notably in industries and housing. Consequently, this transformation is gradually altering the landscape of what were previously underdeveloped areas toward wealthy areas across the city boundary and its surroundings, as underlined by Baffoe et al. (2020), Nikuze et al. (2020), and Uwayezu and de Vries (2019).

However, the results indicate a limited representation of areas demonstrating medium wealth. Typically, these areas are characterized by moderately sized and modern buildings, improved building roof structures, and reasonably paved or arranged roads, as detailed in Table 1. The model did not adequately capture these areas, and this deficiency can be attributed to the model's performance, which was negatively impacted by the use of noisy data during training, as discussed earlier. This limitation implies that the inferred results should be interpreted with caution. Overreliance on such information for decision-making could lead to suboptimal outcomes, as it may overlook the needs and challenges faced by communities with low or medium wealth indices.

In the search to address the weaknesses of the model to adequately predict wealth disparities when trained on the DHS wealth index, a crucial shift was made in the data labeling process. Human knowledge was introduced to create a training dataset, providing an alternative source of information without solely using non-reliable statistical data. Furthermore, the research conducted experiments involving training the model using a combination of DHS and human-guided data. Consequently, the study pursued to refine the model's understanding of visual features, which can be associated with socio-economic conditions in urban areas. This approach was built upon the foundational research of Yeh et al. (2020), Jean et al. (2016) and Xie et al., (2016). However, this research differentiates from its predecessors, which primarily focused on predicting poverty, wealth, and economic well-being at coarse levels such as the country, regional, and continental scale based on readily available statistical data. The research introduces this approach, which uses human-guided labeled data to enhance the model's comprehension of socio-economic conditions within specific contexts. Unlike conventional models that rely on pre-defined visual features associated with socio-economic development, this approach allows the model to learn from all visible features in images. This imitates the human process of identifying wealthy or poor urban areas. Consequently, this could facilitate scalability across diverse regions, enabling the derivation of crucial maps depicting socio-economic disparities.

The outcomes of this approach are presented in Fig. 9, illustrating the efficacy of the model in leveraging human-guided labeled data, and Fig. 10, showcasing the combination of DHS wealth data and human-guided data to enhance its capacity for recognizing and interpreting visual indicators of socio-economic conditions.

These findings (Figs. 9 and 10) enhance the value of the previous results (Fig. 8) by comprehensively representing spatial socio-economic disparities aggregated to the smallest administrative units. The findings align with previous observations made by Baffoe et al. (2020) and Manirakiza et al. (2019). Furthermore, the results reveal that wealthy villages are characterized by residents enjoying enhanced access to urban infrastructure and services. Villages with medium wealth also exhibit less access to infrastructure and services at a moderate rate than wealthy villages. Conversely, villages with low wealth consist of a mixture of planned and informal settlements with limited urban infrastructure. These findings are supported by another study by Uwizeye et al., (2022), who revealed that some residents of informal settlements in the City of Kigali have inadequate access to education and healthcare services despite the existing policies, which support the provision of water and efforts to urban informal settlement upgrading.

Fig. 9
figure 9

Wealth predicted at the village level (based on human-guided data) across the City of Kigali (Source: Data analysis)

Fig. 10
figure 10

Wealth predicted at the village level (based on DHS data combined with human-guided data) across the City of Kigali (Source: Data analysis)

On the other hand, areas with very low wealth exhibit scarce access to urban infrastructure, such as transportation facilities, and predominantly exist in rural settings, which are less attractive for urban development. The results affirm the existence of urban socio-economic inequalities in the City of Kigali, as observed in previous studies by Carter (2018) and Manirakiza et al., (2019, 2023), highlighting significant socio-economic disparities across the city based on income sources, employment, and living conditions. According to these studies, higher-income groups typically have self-employment or formal sector jobs and live in planned and well-served neighborhoods. In contrast, low-income groups often work in casual or seasonal employment and reside in informal settlement areas. Additionally, lower-income groups often depend on farming and live in rural settings. This complex interplay of spatial wealth distribution underscores the challenges faced by different regions, considering not only urban cores but also taking into the peri-urban and rural areas.

Model interpretability

The activation maps generated by the Grad-CAM were used for visual interpretation of the basis of the model, which is highlighted and overlaid on top of sample images in yellow (Fig. 11). They indicate features in the images that contributed most to model's basis to make the decision for prediction. The areas depicted by the model trained on the DHS wealth index show that the basis for the prediction made by the model trained on DHS wealth index data is difficult to interpret using human observation since no clear visual pattern can guide the visual judgment. The model trained on a combination of DHS wealth data and Human-guided data reveals slight patterns of features that served as the basis for the model decision. Differently, the areas depicted by the model trained on human-guided data show clear patterns that serve as the basis for model predictions whereby the model learned most of the visible features for the respective class, as described in Table 1. The model learned features such as road quality, buildings, and outdoor environment to make predictions. Thus, the model focused on areas with key features that help to differentiate classes more accurately. This improves the interpretability of the CNN model, which sheds more light on visually interpreting the model's basis for prediction, even for non-machine learning experts. Therefore, this would facilitate understanding how the model works and allow urban planners and other stakeholders who do not necessarily have machine learning skills to integrate this approach as a tool to guide decision-making. These results show that the study enhanced the interpretability of CNN models for predicting wealth based on satellite images, which might cut the barrier of these models for acting as black-box functions, as mentioned by Gulum et al. (2021) and Hall et al. (2022).

Fig. 11
figure 11

Heat maps extracted by Grad-CAM display parts of the images (in yellow), contributing to the model prediction (Source: Data analysis)

Limitation

The study highlights the challenges of employing DHS wealth index data to predict socio-economic disparities at the village level. It proposes the use of human-labeled data to enhance the accuracy of predictions. However, the involvement of human knowledge in data labeling may potentially lead to bias linked to socio-cultural perspectives of socio-economic development. Moreover, though the proposed approach provided meaningful predictions, socio-economic disparities may vary within the village and household levels. Thus, the proposed approach could not capture intra-village or house-to-house socio-economic disparities. Still, village-level estimates are believed to be efficient for policy implementation since they reflect the smallest administrative areas (in the context of the city of Kigali). Additionally, the predictions are at a very fine-grained spatial scale compared to the fine-grained 10 × 10 km grid level used in most research (see Xie et al., 2016). Nevertheless, the use of a grid can be beneficial when scaling this research to other cities where the smallest administrative unit size is bigger, or data about the boundaries of small administrative units are not available (see Ahn et al., 2023).

This study introduces the approach that does not rely on the availability of statistical data; it focuses on visual features captured in satellite images. These features exhibit correlations with the socio-economic conditions prevalent among urban residents. However, it is essential to acknowledge that this approach might not capture the entire range of socio-economic conditions as comprehensively as traditional censuses or surveys. For instance, widely utilized tools like DHS typically capture asset ownership, including possession of items like mobile phones, radios, and televisions, commonly employed to measure wealth and poverty. Tracking such possessions through satellite imagery is challenging. Nevertheless, it is worth questioning the appropriateness of such material ownership as an indicator of socio-economic status in contemporary times, where many urban populations may own such assets while still struggling with poverty.

Scalability

This study serves as the initial step in profiling city-scale urban socio-economic disparities at the smallest administrative level and addressing the scarcity and unreliability of baseline statistical data in the Global South. Future directions arising from this study involve extending this research across different cities for a cross-city comparison of urban socio-disparities and consideration of multiple aspects of socio-economic status such as income, education, and health for analysis. This expansion would necessitate investigating the transferability of trained models to cities with similar and distinct urban morphologies. By doing so, the application of the findings of this study can be scaled across diverse geographical contexts by leveraging machine learning and local knowledge. In addition, scalability also pivots on the availability and integration of diverse data, such as points of interest and street networks, for enriching predictive capabilities and providing a more holistic understanding of urban socio-economic disparities.

Moreover, scalability also entails considerations of data accessibility and inclusivity. While DHS data might be readily available in certain areas, there are regions where such datasets are either incomplete or outdated. In such cases, the integration of human-guided data offers a valuable complement, providing localized insights that conventional sources may not capture. However, manually labeling satellite images is labor-intensive, time-consuming, and costly. To address this issue, crowdsourcing platforms should be considered, which enable a large pool of contributors to complete distributed annotation tasks. The annotation process can be expedited by harnessing the collective efforts of volunteers or paid workers, thereby reducing both time and cost burdens associated with data labeling.

Conclusion

This study employed a CNN-based method to map urban socio-economic disparities at the village level, shedding light on the challenges associated with utilizing DHS wealth index data. The inherent issue of location displacement within DHS data introduces noise into the training dataset, affecting the accuracy of model predictions and potentially misleading decision-making processes. The resulting map generated by the model trained solely on DHS wealth index data often fails to adequately capture wealth disparities, particularly among medium and low-income classes, thus highlighting the limitations of relying solely on statistical data.

To address these shortcomings, the study introduced a CNN method incorporating human-guided labeled satellite images to predict village-level wealth and map spatial socio-economic disparities. This approach overcomes the limitations of traditional methods by combining statistical data with local knowledge, resulting in improved predictive accuracy compared to using DHS data alone. Importantly, this study achieves these results with relatively low computational cost settings, making it accessible and practical for resource and data-scarce urban settings. Furthermore, the study found that visual interpretation of the model's basis for prediction was challenging when using only DHS wealth index data but improved with the incorporation of human-guided data. This enhancement in interpretability facilitates a deeper understanding of the model's predictions and enhances its utility for decision-makers and urban planners.

The findings underscore the potential of this approach as an alternative method for generating comprehensive socio-economic disparity estimates in urban settings with limited data availability. However, it is crucial to note that this study does not intend to replace ground-based surveys and censuses but rather to amplify and complement existing approaches. This study complements existing methods in providing cities with valuable information to assess the outcomes of implemented urban development policies and regulations and track the trends in achieving the Sustainable Development Goals (SDGs). Thus, this approach contributes to accountability and efforts to address inequality, ultimately fostering more inclusive and sustainable urban environments.