Introduction

One of the most frequent mortality and morbidity etiologies is cerebrovascular accident with 6.2 million cases each year [1,2,3]. The changing demographics of the populations with more senior composition is expected to introduce more pressure on the healthcare system related to stroke and its sequels [1, 3]. Stroke can be classified into two major subgroups according to the mechanism which are ischemic stroke which accounts 87% of all stroke patients and hemorrhagic stroke [3, 4]. For the determination of stroke mechanism imaging studies are mandatory. Bamford classification is the most used classification of stroke based on clinical presentation [5]. In Bamford classification, the stroke patients are classified into 4 types which are total anterior, partial anterior, posterior and lacunar stroke based on the deficits detected in neurological examination. Determining the stroke territory is the usual one of the first steps to understand the mechanism of the stroke and to formulate possible etiologies. However due to the complexity of brain anatomy, both the vascular feeding zones and spatial localization of functional clusters of the brain are not fixed. Tao et al. assessed the diagnostic value of symptoms and signs for determining stroke territory for anterior or posterior stroke and found clinical neurological deficits were insufficient for accurate and reliable determination of stroke territory [6].

Non contrast CT coupled with CT angiography for large vessel occlusion and CT perfusion for at risk tissue evaluation are usually the main imaging modalities for acute stroke imaging. However, this largely depends on the available professional human resource and advanced medical imaging equipment. The initial evaluation with DWI with a shortened 5 min acute ischemia protocol for stroke confirmation and fast referral to a stroke center is also a feasible option. Additionally, DWI is gradually being accepted as the gold standard for the diagnosis of hyperacute ischemic stroke with a sensitivity of 73–92% within first 3 h of onset [1, 7].

Machine learning and deep learning methods gain considerable importance among medical imaging community for detection, segmentation, and classification tasks [8,9,10,11]. These methods help to decrease errors due to intense workload, decreased attention times and to compensate shortage of specialists to some extent. The fast and accurate diagnosis of stroke as anterior or posterior territory infarct can help for faster triage and formulation of the next steps in the absence of radiologists or other specialized staff to interpret the DWI scans.

In this study our main purpose was developing a deep learning based acute stroke classifier in DWI into anterior and posterior circulation infarcts, using voxel level information and dense prediction approach.

Materials and methods

Patient selection and image acquisition

Institutional Review Board approval was obtained from the local ethics committee for this retrospective study. Informed patient consent was waived by the board. DWIs of consecutive 291 acute ischemic stroke patients with first imaging from January 2017 to April 2020 in our hospital were reviewed. Follow up images were not included in this study. After excluding cases with severe artefact, with < 1 cm3 lesion volume and with watershed infarcts, 218 patients were remained (Fig. 1). The posterior stroke samples were composed of all the strokes arising from vertebro-basilary system including small thalamic and brain stem infarcts. Although the strokes related to fetal PCA technically arises from the anterior circulation the usual feeding region is similar to PCA arising from vertebro-basilary system. Since we did not have angiographic confirmation of the lesions, we couldn’t differentiate strokes arising from fetal PCA and included them in posterior stroke region. This dataset was used in another study of the authors [12]. However, both the methodology, targets and outcomes were completely different. In the previous study the images were classified slice-wise into middle cerebral artery, posterior circulation and watershed infarcts by transfer learning leveraging off the shelf pretrained classifiers which were EfficientNetB0 and MobileNetV2 [13, 14]. Furthermore, as the principal method, 3D volumetric assessment of the patients instead of 2D slice-wise classification approach and voxel wise dense classification that converted the multiclass segmentation results to multiclass classification results instead of weak slice-wise classification approach of the patients from the previous study were used (Fig. 2). Images were acquired at 1.5 T MR (MAGNETOM, Siemens Healthcare, Erlangen, Germany). Two diffusion protocols both with b = 1000 and 5 mm slice thickness were used. TR, TE, FOV and Matrix values for protocol 1 were 6600, 89, 230 × 230 and 192 × 192 while these values for protocol 2 were 5880, 61, 221 × 221 and 168 × 168 respectively. Protocol 1 was mainly used with more unstable patients to decrease movement artefacts by increasing the number of excitation (NEX). NEX was 3 for protocol 1 and 1 for protocol 2.

Fig. 1
figure 1

Overview of patient selection. After initial screening 303 eligible patients were detected. 11 patients were excluded due to severe artefacts interfering with diagnostic evaluation. 65 patients were further excluded since they were not anterior or posterior circulation infarcts. Finally, 9 patients were excluded which had very small infarct volumes less than 1 cm3. 218 patients with 141 anterior and 77 with posterior stroke were used for this study

Fig. 2
figure 2

Model architecture and study workflow for the classification pipeline. U-Net like model with 4 down sampling and 4 up sampling steps was built. Both single class stroke segmentation network and multiclass segmentation network were built with similar architecture. Ground truth for multi class model included two different masks representing anterior or posterior stroke regions. Predictions of multiclass U-Net model were further processed by majority voting. The most frequent class label for the voxels predicted by the multiclass U-Net (excluding background pixels) was assigned as patient level label for the affected territory by stroke. Initial input size of the model was 320 × 224 and initial number of filters was 64. Zero padding was applied in order to keep the dimensions of the feature map during convolution operations

Image preprocessing and mask preparation

3D Slicer (version 4.10.2) was used for preparing segmentation masks [15]. Infarct areas in the images were segmented using threshold and brush functions of 3D Slicer slice by slice. All 3D extent of the infarct zones was annotated volumetrically. A radiologist with 5 years of neuroradiology experience (YKÇ) segmented the images, labeled the territorial class and a senior neuroradiologist with 27 years of neuroradiology experience (FG) made a second evaluation. Their initial interrater agreement before consensus annotation was 0.86. The accepted segmentation masks and final ground truth class label assignments were determined by consensus. Two sets of masks were constructed. Initially the mask labels were annotated as a single class denoted by 1. For the main objective of this study which was voxel wise dense classification of stroke territory based on segmentation annotations, a mask transformation approach was utilized. The mask annotations of anterior territory infarcts were remained as they were, and the mask annotations of posterior territory infarcts were transformed into integer 2. By this way, the single segmentation masks of infarct zones were essentially turned into distinct mask objects representing different stroke regions (Fig. 3).

Fig. 3
figure 3

Overview of dataset building strategy. Mask transformations. Upper row: 55 year, male with left anterior stroke. Lower row: 59 year, male with left posterior stroke Red: Class 1 Green: Class 2. a Single class mask for all stroke lesions. In single class experiments, there was just one class, and model need to predict a binary decision either stroke or not. Therefore, all the stroke lesions were denoted as class 1 (red). b Multi class mask based on territory of stroke lesion. In multiclass experiments each stroke territory was assigned a class number, and each class number was denoted with a color. Red was for class 1 which was anterior stroke and green was class 2 which was posterior stroke. The conversion of masks into different classes based on the stroke location which it represents, rendered them as different segmentation targets. The multiclass segmentation model would predict class 0 (background) 1 (anterior stroke) or 2 (posterior stroke) for each voxel

Three sets of images were obtained for benchmarking purposes. First set was the original images. Second set was N4 Bias corrected version, and third set was registered version of N4 Bias corrected images to a common template of MNI [16, 17]. All three image sets were exposed to the same preprocessing steps which was provided in supplementary material preprocessing details.

Model architecture and training

2D U-Net like architecture which consists of an encoder part that transforms the image into low level feature vector as a bottleneck layer and a decoder which takes the bottleneck and transforms it into ground truth segmentation maps was employed [18]. 4 down sampling and 4 up sampling steps with 64 initial filter numbers were used. The details of the common U-Net model, model architecture and training were provided in the supplementary material model training Sect. 10% of the dataset was reserved for validation and 10% for testing (19 for each). The remaining 180 volumes were used for both single class stroke region segmentation and multiclass stroke territory-based segmentation objects with separate models. Dice scores were reported as a measure of segmentation efficiency of single class model. We also built multiclass segmentation model using the transformed masks with the same architecture and pipeline. Each mask label represented different stroke area in this input. We calculated the mean dice score averaged over all predicted stroke areas and over all patients for benchmarking purposes. The main aim of building this multiclass segmentation model was leveraging its outputs for the classification of affected territory by stroke. After getting 3D multiclass segmentation maps for each patient, majority voting to the predictions was applied to evaluate the most frequent class in each volume. Majority voting was utilized by counting the number of voxels for each class. The most frequent class which was denoted by the largest mask was assigned as final territory class of each patient. Accuracy, sensitivity, precision and f1 scores were reported for the classification.

Human readers’ study

Two radiologists with 14 years (ND) and 15 years (MS) of experience in neuroradiology were asked to relabel the images. The correlation of the assigned classes between readers were calculated using Cohen’s Kappa.

Subgroup analysis

We further analyzed the effect of lesion size on the classification performance of the models. In the posterior stroke group, the largest lesion size was 91 cm3 whereas in the anterior circulation group that was 371 cm3. We rearranged the dataset into 3 subgroups. The lesions with greater than 50 cm3, 20 cm3 and 10 cm3 were selected as subgroups and the classification performance of the model were analyzed in these subgroups.

Statistics

Phyton (Version 3.7.3 https://www.python.org) with standard modules and matplotlib package for visualization were used. Numpy package for matrix manipulations, pandas package for dataset manipulation and scikit learn package for statistics were used. TensorFlow coupled with Keras wrapper were used for deep learning implementation [19, 20]. Student’s t test was used for comparison of population means of ages and infarct volumes among two classes with p < 0.05 as threshold for statistical significance. Dice scores of ground truth masks and segmentation masks were reported. Classification report function of scikit-learn package was used for reporting the performance of classification model. Precision, recall, f1 score and accuracy were selected as classification metrics. Macro average of metrics was reported.

Results

Patients

Of the 218 patients, 141 (65%) were anterior stroke, and 77 were posterior stroke (35%). Of the 141 cases in the anterior stroke group, 74 (52%) were male and 67 (48%) were female and in the posterior stroke group 43 (55%) were male and 34 (35%) were female. The median age of the patients in the anterior stroke group was 65 (IQR: 29, the lowest: 32—the highest: 94) and in the posterior stroke group was 68 (IQR:24, the lowest: 42- the highest: 89). There were no statistical differences in two groups either for age or gender (p values were 0.73 and 0.44 respectively). All the demographic data, and the stroke volume statistics were shown in Table 1 and the histogram distribution of stroke volumes in Fig. 4. The mean stroke volume of the population was 32.29 cc (SD: 41.25) with greater volumes in anterior circulation (Mean: 41.32, SD: 46.75) than posterior circulation (Mean: 15.63, SD:19.65) (p < 0.01).

Table 1 Patient demographics and stroke volume distribution
Fig. 4
figure 4

Histogram of stroke volumes in anterior stroke, posterior stroke and whole dataset. The anterior stroke lesions had statistically significantly larger mean volume compared to posterior stroke lesions. Horizontal axis: Stroke volumes. Vertical axis: Number of patients

Human readers’ study

The agreement of both radiologist1 and radiologist 2 with the ground truth was good to excellent with 0.81 and 0.84 kappa values respectively. All disagreements were about small lesions residing at the intersection of anterior and posterior stroke regions. Although the watershed infarcts were excluded from the study, small lesions at the periphery of anterior and posterior stroke regions were included provided that the ground truth determining radiologists agreed on anterior or posterior class label for these lesions. Actually, these lesions had inherent uncertainty due to their small size and critical location. The ground truth determining radiologists (YKÇ and FG) had the clinical information, opportunity to discuss the evolution of the lesions with their colleagues and to see the follow up imaging studies. Therefore, their consensus opinion was used as ground truth for model development and baseline for comparison with the human reader study.

Segmentation models

The mean dice score of the segmentation model was 0.71 with a slightly better performance in anterior stroke segmentations (Mean dice: 0.73) than posterior stroke segmentations (Mean dice: 0.68), however without statistically significant difference (p = 0.2) (Table 2, Fig. 5). Scatter plot of Dice scores against stroke volumes clearly demonstrated that almost all of the Dice scores for anterior strokes larger than 50 cm3 were in the 0.7–0.9 range whereas almost all of the Dice scores for posterior strokes larger than 20 cm3 were in the 0.6–0.8 range (Fig. 6).

Table 2 Dice scores of ground truth segmentations and model predictions for ordinary single class model for the test set
Fig. 5
figure 5

Histogram of Dice scores of ordinary segmentation models for anterior stroke, posterior stroke and total dataset. Horizontal axis: Dice scores. Vertical axis: Number of patients

Fig. 6
figure 6

Scatter plot of stroke volumes against dice scores

For the multiclass segmentation the overall dice scores were lower, 0.46 for original images, 0.49 for N4 Bias corrected version and 0.51 for N4 Bias correction followed by registration to a common space version. However, the majority of voxel wise predictions of all three models were still sufficient to infer the territorial class.

Classification models based on aggregation of voxel wise predictions

After converting the voxel wise dense predictions of multi class segmentation results into territorial class labels by applying majority voting to the masks, we computed confusion matrices of final predictions.

The model built with original images reached 0.76 precision, 0.98 recall, 0.86 f1 score and 0.77 accuracy whereas the model built with N4 bias corrected images reached 0.79, 0.97, 0.87, 0.80 and the model built with images which were N4 bias corrected and then registered into a common space reached 0.81, 0.98, 0.89, and 0.83 values respectively. The performance metrics of the built models were provided in Table 3.

Table 3 Classification metrics of the proposed approach

Subgroup analysis

48 patients had stroke volume larger than 50 cm3, and 104 patients had stroke volume larger than 20 cm3. The number of patients who had stroke volumes larger than 10 cm3 was 137. Majority of the patients in our dataset had relatively small stroke volume, which affected the performance of the model. The scatterplot of dice scores against the lesion volumes was provided in Fig. 6. In the largest lesion group with greater than 50 cm3 stroke volume, the accuracy of the model reached 0.90. In two other subgroups which included stroke volumes more than 10 cm3 and more than 20 cm3, the accuracy was 0.87.

Discussion

We developed a fully automated, highly accurate, deep learning network for determining the territory of acute stroke as anterior or posterior circulation in DWI. Our network could determine the stroke territory end to end. This eliminates potential failures from necessary user interactions in semiautomatic models. Previous reports were based on slice wise prediction which might introduce some problems into the system. For the slice wise approach initially, the pathologic slices should be selected by the user which makes the system impractical for triage purpose. A fully automatic system obviates this requirement which makes the system more suitable into clinical translation. Additionally in slice wise systems the model can predict different classes for individual slices of the same patient. This can introduce additional uncertainties into the final predictions derived from different set of selected slices by different users. A fully automated end to end system can prevent this uncertainty and make its predictions more robust. To the best of our knowledge there are only 2 studies on prediction of stroke territory class in DWI including our previous study [1, 12]. Both studies approached the problem slice-wise and applied transfer learning by using off the shelf pretrained networks. This is a necessity to leverage the already learnt image prior features embedded in the weights of these networks with small medical imaging datasets. The authors reached 93% overall accuracy for prediction of middle cerebral artery, posterior circulation, or watershed infarcts in one of these and 86% for prediction of anterior stroke, posterior stroke or normal slices in the other one [1, 12]. However, in the study of Lee et al., since the majority of slices were usually derived from normal class even in already diseased patients, it is unclear how would they aggregate the prediction results for final patient level classification for clinically meaningful applications [1].

In this study we used a voxel wise prediction approach to derive final patient level prediction with the accuracy of 0.83 with our best model. The accuracy of this model reached 0.90 when it was evaluated on the lesions larger than 50 cm3 volume indicating its efficiency on more serious lesions. In medical imaging tasks, different levels of predictions might be possible as voxel wise, ROI wise, slice wise, modality wise and patient wise. Voxel wise prediction is the densest approach which decreases along the increasing patch size that is used for the prediction. To the best of our knowledge the only example of patient level prediction based on voxel wise predictions was used by Yogananda et al. for the prediction of MGMT methylation status of the glial tumors [21]. The authors reached 0.93 mean ROC_AUC which reflected the success of this approach over slice wise and modality wise approaches by more than 0.10% margin. Actually, using the full volume of the images of the patient to reach a final class is not straightforward for an image element which has no fixed shape, no fixed boundary properties, no fixed color or no distinct heterogeneity properties driven by underlying complex biology. The only determining feature for an acute stroke lesion in DWI is the location of the high signal intensity voxels conforming to a priori vascular feeding zone. Therefore, additional guidance driven by neighborhood voxels and collective behavior of voxels located in a similar region helped the voxel wise approach to perform better.

In convolutional neural networks, the training actually means learning the correct weights of the convolution filters in a broader sense. Although in lifelong learning approaches there is some dynamism of the values of these weights, in usual applications these convolutional filters are fixed structures, which implies that they affect every part of the image in a similar way. In a dataset like ours when the target could be anywhere, but the spatial location dictates the actual class label, a great challenge is introduced to the neural network. One of the main strengths of neural networks is the translation invariance of the convolution filters, that is the target object can be found in anywhere in the image. However, if the spatial location shift changes the class label, this property could be detrimental for the performance of the model. This was evident for the lower performance of our model before N4 Bias correction. The local magnetic field inhomogeneities and the magnetic properties of the scanned tissue might introduce some noise which is also affected by the distance to the isocenter of the receiving elements of the coil. This is known as bias field and usually evident as brighter regions in the scanned tissue which is not related to actual bright tissue instead is the consequence of the bias field [16]. There are numerous studies which did not care about the effects of the bias field, since the neural networks can learn to eliminate it, while focusing on the target. However, in a task like ours, when the target is just high signal in some location and the location dictates the class label, this effect could be detrimental as shown in Fig. 7. The model learned to predict both anterior and posterior circulation infarcts and had filters which would be activated when it has encountered bright signal at the corresponding locations. Therefore, although the model predicted correct regions as the anterior circulation infarct, the bright signal derived from bias field in the cerebellum lead to activation of posterior circulation infarct detector filters and a large false positive area for posterior circulation infarct was predicted which was eliminated after bias field correction (Fig. 7). This point is very import according to us because it emphasizes strongly that knowing the target domain and the target data while developing these systems has utmost importance.

Fig. 7
figure 7

Predictions of the models before and after bias field correction and registration. Lower row: images, Upper row: Images with mask overlay. a 55 year, male anterior stroke patient. The predictions which were displayed on the upper row correctly predicted most of the voxels for anterior stroke segmentation. However, there were great many false positive predictions for the posterior circulation due to high signal induced by the bias field. b Same patient after bias field correction. The bias field corrected images displayed darker signal now which was particularly evident in both cerebellar areas. Red: Predictions of model as class 1 (anterior stroke voxels) Green: Predictions of model as class 2 (posterior stroke voxels)

Our model demonstrated very good performance of 0.90 accuracy with the lesions greater than 50 cm3 and 0.87 accuracy with the lesions greater than 10 cm3. Although determining lesion origin for a large lesion can be a relatively easy task, this can turn into a challenge for small lesions near the edge of expected territories which was also evident by 0.81–0.84 interrater agreement for our overall dataset. The relatively high performance with 0.83 accuracy of our models including these small lesions was additional evidence of its usefulness under extreme conditions. The knowledge about the origin of the lesion can provide insights for the possible etiology, expected prognosis and the best management strategy based on the expected etiology and prognosis, therefore our model can be translated into practical clinical use [22, 23].

The medical image datasets are usually small which necessitates extensive data augmentation to alleviate this scarcity [24]. Actually, all deep learning studies benefit from data augmentation, but this requirement is stronger in medical imaging datasets due to this small dataset problem. The usual minimal set of data augmentation operations are rotation, translation, scaling and left right flip. Elastic deformations are also very helpful however it is not used frequently in medical imaging manuscripts. However similar to the above discussion, if the target class label strongly depends on spatial location, the rotation operation which will change the position of the bright signal in the cartesian coordinate system can change the predicted class label, since the filters activated in that location is preconditioned on the actual class that should reside there (Supplementary material S Fig. 1). Therefore, minimal rotations in the range of a few degrees could be tolerated however more extensive rotations should be avoided. In a similar manner, the rotated samples in our dataset should be registered to a standard space in order to have better results which was evident by the performance gain of our models after registration. Other groups have also proposed deep learning methods for noninvasive, image-based stroke lesion segmentation and its extensions in DWI and other modalities, but each of these has several limitations. For ischemic stroke segmentation, Praveen et al. built a stacked sparse autoencoder which performed perfectly on the Ischemic Stroke Lesion Segmentation (ISLES) 2015 dataset, with an average Dice score of 0.94. [25]. Pinto et al. built restricted Boltzmann machine to extract features from lesions and perfusion from different sequences to predict the final stroke lesion. They achieved 0.38 Dice score on ISLES 2017 [26]. DWI-ASPECTS can be thought as the MRI equivalent of CT-ASPECTS [25,26,27,28,29]. Do et al. built recurrent residual convolutional neural network for predicting ASPECTS score in acute stroke patients in DWI with an AUC of 0.94 [27]. These studies are the evidence of high performance of deep learning models for segmentation of stroke lesions in DWI which can be translated into class prediction by aggregation into multiclass class labels. Our study is a proof of concept of this idea which translated the multiclass stroke territory segmentation outputs to the classification outputs by employing majority voting.

The proposed solution including N4 bias field correction, registration to MNI space, signal normalization into 0–1 range and input size rearrangement into 320 × 224 could be carried out under 1 min automatically. Additional 1 min is required for the multiclass inference of the DWI image volume and subsequent majority voting. Therefore, the proposed pipeline and model could be incorporated into PACS for almost real time inference of affected territory by stroke.

There were some limitations of our study. One of the limitations was limited type of the data. Although we used the images from 2 different MR protocols, the images were from single center. In a future work we plan to validate our findings across multiple centers in order to explore generalizability of our approach. Another limitation was lack of external validation set. The third limitation was the constrained number of classes. In a future study with sufficient number of samples we plan to include all major arterial feeding zones for the classification. Additionally for a model with triage purposes it would be better to include normal cases as well. In a future study we plan to include normal classes also.

Conclusion

In conclusion voxel wise dens prediction based multiclass segmentation coupled with aggregation of predictions to predict patient level class label for acute stroke territory determination as anterior or posterior territory in DWI yielded good performance. This approach can be extended to other medical image classification tasks particularly which suffer from small number of samples.