Although research has demonstrated the predictive validity of personality, and especially conscientiousness, in employment selection contexts, personality inventories remain a weaker predictor of job performance than cognitive ability tests (Barrick & Mount, 1991; Furnham et al., 2009; Guion & Gottier, 1965; Sackett et al., 2021; Tett et al., 1991). As such, there has been interest in identifying methods of increasing the predictive validity of personality measures, with contextualization emerging as a promising option (Schmit et al., 1995). Contextualization offers an easy and low-cost means of improving the prediction of desired outcomes (e.g., job performance) by limiting a test-taker’s frame of reference to the desired context (e.g., their conscientiousness in work settings).

To date, research mostly provides evidence of the utility of contextualization as a whole, by comparing contextualized measures to non-contextualized measures. However, very few investigations have directly compared contextualization methods to one another and only one study has directly compared the efficacy of the two simplest and most popular forms of contextualization, instruction-level contextualization (i.e., modifying instructions to evoke the desired frame of reference) and tag-level contextualization (i.e., adding contextual tags to items). Although both of these contextualization methods have benefits compared to non-contextualized personality measures and are less resource-intensive than complete contextualization (i.e., fully rewriting items in the intended frame of reference mirroring common scale development practices including validation efforts with subject matter experts), researchers and practitioners currently have little guidance regarding which of these two methods should be preferred. We address this gap by comparing the utility of instruction-level and tag-level contextualized personality measures for predicting context-specific outcomes.

The present research boasts several strengths. Because most contextualization research has not directly compared instruction- and tag-level contextualization, this study represents an important contribution. Further, as illustrated in Table 1, relevant past research is marked by a variety of different design factors and inconsistent findings, which limit the conclusions that could be drawn regarding instruction- and tag-level contextualization. For example, several studies have contextualized personality to the school setting, using either instructions or tags, and presented correlations between conscientiousness and grade point average (GPA) (e.g., Bing et al., 2004; Lievens et al., 2008; Reddock et al., 2011; Schmit et al., 1995). However, a closer look at these studies does not provide a consistent picture supporting either instruction- or tag-level contextualization over the other method. Although one study has directly compared instruction- and tag-level contextualization (i.e., Swift & Peterson, 2019), generalizations from their findings (i.e., tags outperformed instruction-contextualization in one of four comparisons) may be impeded by their cross-sectional design and limited predictor-criterion relationships.

Table 1 Brief overview of methodological contextualization studies

Therefore, our comparison of instruction-level and tag-level contextualization advances the literature and provides useful guidance to researchers and practitioners alike. In addition, we utilize a diverse working sample and leverage a time-lagged, within-person design that is more methodologically rigorous than most contextualization research. By utilizing a time-lagged design rather than a cross-sectional design, we are better able to control for potential carryover effects associated with taking multiple personality assessment forms in one sitting as well as reducing the influence of common method variance (Podsakoff et al., 2003). In doing so, we increase confidence in the replicability of previous results regarding the general efficacy of contextualization. By utilizing a within-person design as opposed to a between-person design, we are able to utilize the most common analyses in similar methodological comparisons in the literature (i.e., correlation comparison and hierarchical regression); see Table 1. In addition to these common analyses, we employ relative weights analysis (RWA) to account for the high multi-collinearity expected in a hierarchical linear regression comparing the separate methods (Tonidandel & LeBreton, 2015). We examine work-relevant outcomes associated with each of the big five personality factors, rather than focusing solely on conscientiousness as a predictor. Nonetheless, a comparison of instruction- and tag-level contextualization may leave readers wondering if they should simply leverage both. In some situations, this may be appropriate. However, there may be scenarios in which researchers and practitioners are wary of altering a scale’s instructions and potentially distracting from other meaningful aspects of the instructions (e.g., a time frame, a target individual).

In sum, the current manuscript aims to apply rigorous methodology to (a) replicate past findings regarding the efficacy of contextualization generally and (b) compare the efficacy of the two most common and accessible contextualization methods.

Background

Contextualization of Personality

For many decades, a growing contingent of psychologists have challenged the traditional trait approach to understanding human personality (Mischel, 1973). Specifically, the cognitive-affective system theory of personality suggests that personality is conditional on situational cues as well as individual factors such as a person’s life experience; as such, behavior can only be expected to be consistent across situations to the extent that those situations elicit similar feelings, motives, and interpretations (Mischel & Shoda, 1995). For example, a person might be more consistently agreeable and extraverted amongst friends than they are at work. Drawing from this thinking, contextualized personality measures strive to limit a test-taker to a specific context as they answer items about their characteristic ways of thinking and behaving (Bing et al., 2004). For example, if a personality measure is being used to predict job performance, the test-taker would be encouraged to answer items according to how they act in work settings. Thus, contextualized measures provide a frame of reference in order to assess personality within a given context, rather than general personality.

In support of this thinking, Lievens and colleagues (2008) explored how contextualization increases criterion-related validity, finding that contextualization reduces both between-person variability and within-person inconsistency. In other words, contextualization reduces the likelihood of two different test-takers considering different life domains (e.g., work and social settings) in responding to the same personality inventory. It also reduces the likelihood of each test-taker drawing on different life domains in responding to individual items on the inventory. Beyond these explanations, the symmetry principle has also been leveraged to explain how contextualization improves criterion-related validity. In short, the symmetry principle suggests that when the specificity of a predictor and outcome are matched or symmetrical, their relationship is stronger than when the specificity is unmatched or asymmetrical (Ajzen, 2005; Schulze et al., 2021).

The formal study of contextualized personality measures dates back only a few decades (Schmit et al., 1995). However, in this time, a number of studies have examined the relationship between contextualized personality and work outcomes (Bing et al., 2014; Bowling & Burns, 2010; Hunthausen et al., 2003; Pace & Brannick, 2010; Robie et al., 2000, 2001; Schmit et al., 1995; Swift & Peterson, 2019). Meta-analytic findings suggest that validity for predicting supervisor-rated job performance can be improved for all of the big five personality factors through contextualization (Shaffer & Postlethwaite, 2012). Overall, Shaffer and Postlethwaite (2012) found a mean validity increase from 0.11 for non-contextualized personality measures to 0.24 for contextualized personality measures.

Importantly, contextualized personality measures also boast other advantages over non-contextualized personality measures. In selection contexts, non-contextualized personality measures are often viewed as less job-relevant than other assessments (Hausknecht et al., 2004). As a result, applicants respond less favorably to non-contextualized personality measures than they do to interviews, work samples, or even cognitive ability tests (Hausknecht et al., 2004). Contextualizing personality inventories increases their face validity and perceived predictive validity (Holtrop et al., 2014; Robie et al., 2017), improving the test-taker experience (Ployhart et al., 2003). This is a worthy goal as face validity and perceived predictive validity are associated with a variety of applicant reactions to the selection process, including perceptions of justice and intentions to accept job offers (Hausknecht et al., 2004). Thus, the contextualization of personality measures can not only increase predictive validity, allowing organizations to make better selection decisions, but also improve the applicant experience, making applicants more likely to accept employment offers.

Methods of Contextualization

Although a great deal has been learned about the value of contextualized personality measures over the last several decades, there is much more left to uncover related to how contextualization should be approached. In past research, personality items have been contextualized through one of three methods (Holtrop et al., 2014). The first method, instruction-level contextualization, manipulates a scale’s instructions to direct individuals to answer items according to how they typically behave in a given context (e.g., “To what extent do the following items reflect your tendencies at work?”). The second method, tag-level contextualization, adds contextual tags (e.g., “at work,” “at school”) to the individual items of the scale. The third method, complete contextualization, provides a frame of reference by fully rewriting the items of the scale to align with the target context (Holtrop et al., 2014). For example, a generic emotional stability item reading “I remain calm during emergencies” could be replaced by a work-contextualized item reading “I handle pressing work tasks with steady nerves” (Bing et al., 2014, p. 171).

Importantly, researchers have utilized all three of these methods to effectively match the frame of reference of personality measures to that of relevant outcomes. Evidence suggests that instruction-level contextualized personality measures better predict relevant outcomes than non-contextualized personality measures (Hunthausen et al., 2003; Lievens et al., 2008). Tag-level contextualized personality is also more effective than non-contextualized personality (Bing et al., 2004; Bowling & Burns, 2010; Reddock et al., 2011). Finally, complete contextualization also seems to increase predictive validity of personality measures compared to non-contextualized ones (Bing et al., 2014; Pace & Brannick, 2010; Swift & Peterson, 2019). Overall, meta-analytic findings support the conclusion that contextualized measures are more valid predictors of context-relevant outcomes than are non-contextualized measures (Shaffer & Postlethwaite, 2012).

While all three contextualization methods can improve the predictive validity of personality, the extant literature provides little guidance on which contextualization method to prefer. In fact, only a few studies have compared contextualization methods to one another. A study by Holtrop and colleagues (2014) demonstrated that personality measures with both tag-level and complete contextualizations tended to explain more variance in criteria than did non-contextualized personality measures; further, completely contextualized scales tended to outperform tag-contextualized scales. Robie and colleagues (2017) reported that tag-contextualized and completely contextualized personality measures generally outperformed non-contextualized measures, but only found partial support for the advantage of complete contextualization over tag-level contextualization. The one study to date that has compared instruction- and tag-level contextualization reported inconclusive findings from their cross-sectional study, with tag contextualization outperforming instruction contextualization in one comparison and no significant difference in the other three comparisons (Swift & Peterson, 2019). Of note, the instruction-level contextualization employed in the study only outperformed base personality in one of the four comparisons (Swift & Peterson, 2019).

Although there is evidence that completely rewriting items to fit the target context can produce strong results (Holtrop et al., 2014; Robie et al., 2017), there are also serious disadvantages to complete contextualization. First, the process is far more time- and labor-intensive than instruction- or tag-level contextualization. Complete contextualization requires a lengthy process, such as “(1) generating examples, (2) developing a preliminary list of items, (3) back-translation, (4) revision, and (5) a final check by two experts on personality ratings who assigned the completely contextualized items to the facet scales used in the inventory” (Robie et al., 2017, p. 59). Pace and Brannick (2010) described similar efforts, including an additional data collection to aid in development and validation of the contextualized scale. Even with these efforts, additional concerns have been expressed about the effects of contextualization on the psychometric properties of scales (Robie & Risavy, 2016). Specifically, complete contextualization may alter the content so extensively that the items no longer represent the intended domain; validation efforts are rarely pursued and, when they are, questions remain about whether they sufficiently address such concerns (see Heggestad et al., 2019). In short, complete contextualization constitutes the creation of a new scale, requiring extensive validation efforts to ensure psychometric validity. As a result, instruction-level and tag-level contextualizations have been significantly more common in the research literature and represent more accessible options for practitioners (Holtrop et al., 2014). For an overview of previous methodological approaches to contextualization research, refer to Table 1.

The purpose of the current study is twofold. We first aim to replicate previous findings regarding the general efficacy of contextualized personality measures with a more rigorous study design than has typically been utilized in contextualization research. Previous studies mostly used cross-sectional designs. We leverage a within-person, time-lagged design using a diverse, working sample. Using the multi-wave design and assessing lagged relationships with temporally separated variables should limit the risk of common method variance, which may have altered the nature of relationships observed in past research (Johnson et al., 2011; Podsakoff et al., 2003, 2012). Our design also limits the risk of potential carryover effects or demand characteristics that may occur when participants respond to contextualized and non-contextualized personality measures back-to-back (e.g., Schmit et al., 1995). With this more rigorous study design, we expect that, regardless of contextualization method, contextualized measures will outperform non-contextualized measures. We assess the efficacy of contextualization utilizing three analyses in order to explore (a) the relationships between personality and various work outcomes by comparing correlational strengths, (b) the predictive ability of the measures via hierarchical regression, and (c) the contribution of each method in explaining variance in the outcome with relative weights analysis.

  • Hypothesis 1: Personality scales employing any form of contextualization (i.e., instruction or tag) will outperform non-contextualized personality scales in terms of the (1a) strengths of correlations with criteria, (1b) incremental predictive validity, and (1c) relative importance of the predictors.

Further, we seek to extend current knowledge in this field by directly comparing the two most common and accessible methods of contextualization: instruction-level and tag-level contextualization. There is limited empirical foundation for hypothesizing that one form of contextualization will outperform the other, with the only direct comparison finding no significant difference between instruction- and tag-level contextualization in the majority of their comparisons (Swift & Peterson, 2019). However, there is some theoretical rationale to prefer tag-level contextualization. As mentioned, the theoretical foundation for contextualizing personality measures has typically been the cognitive-affective system theory of personality, which suggests that personality is conditional on situational cues (Mischel & Shoda, 1995). As a result, this theory posits that behavior will be consistent across situations to the extent that those situations present similar cues (Mischel & Shoda, 1995). Because tag-level contextualization repeats the frame of reference in each item, it is reasonable to expect that it will more successfully reinforce the proper context than instruction-level contextualization, which lists the target context only once. Further, previous research on survey methodology suggests that participants are less likely to read instructions than item text (Oppenheimer et al., 2009; Shamon & Berning, 2020). As a result, researchers have suggested that critical information should be provided in items, rather than instructions, whenever possible (Shamon & Berning, 2020). Thus, we expect that the repetition of the desired frame of reference in each item should lead tag-level contextualization to outperform instruction-level contextualization.

  • Hypothesis 2: Tag-level contextualized personality scales will outperform instruction-level contextualized personality scales in terms of the (2a) strengths of correlations with criteria, (2b) incremental predictive validity, and (2c) relative importance of the predictors.

Pilot Study Method

First, a pilot study was conducted to explore the placement of contextual tags within items. Just as past research provides little guidance for whether instruction- or tag-contextualization should be preferred, there is also no research guiding how tag-level contextualization should be implemented. Past research utilizing tag-level contextualization has typically added tags to the ends of items (e.g., Bowling & Burns, 2010; Holtrop et al., 2014; Lievens et al., 2008; Robie et al., 2017) or a mix of beginnings and ends of items (e.g., Holtz et al., 2005; Reddock et al., 2011; Robie et al., 2000; Schmit et al., 1995). To our knowledge, however, no researchers have provided rationale for the locations of these contextual tags beyond basing their decisions on grammatical fit (e.g., Holtrop et al., 2014; Robie et al., 2017).

The location of item tags deserves greater attention. When individuals are presented with information to memorize, both primacy (i.e., items early in the list are remembered better) and recency effects (i.e., items late in the list are remembered better) are reliably demonstrated (Kelley et al., 2013). Considering the cognitive effects associated with primacy and recency, tag location may impact the salience of the frame of reference in working memory. Greater salience could lead individuals to more carefully consider their behavior in the target context, leading to greater criterion-related validity. Importantly, evidence suggests that response-order effects can have meaningful impacts on survey research (Holbrook et al., 2007; Krosnick & Alwin, 1987). Due to limited theoretical rationale or empirical evidence about tag locations, we explored the effects of three possible locations (i.e., beginnings of items, ends of items, or split between beginnings and ends of items) in the pilot study in order to inform tag usage in our primary study.

Sample and Procedure

The pilot study contextualized conscientiousness to an academic setting. Specifically, data were collected in the spring of 2020 from undergraduate students at a large university in the USA. This study focused on the relationship between conscientiousness and academic performance, operationalized as college grade point average (GPA). Data were removed for any participant who was unable to provide an official college GPA. Overall, the sample (N = 257) was 55.6% female and had an average GPA of 3.44 (SD = 0.46).

Participants were randomly assigned to one of four conditions. In the base condition, participants completed the conscientiousness scale described below without any manipulations; the instructions prompted them, “To what extent do the following items reflect your tendencies?” For each of the three tag conditions, the participants viewed the same instructions as the base condition, but each item of the scale had the words “at school” added to the beginning or the end of the item. The three tag conditions will be referred to as primacy (in which the “at school” tag is added to the beginning of each item), recency (in which the “at school” tag is added to the end of each item), and split (in which the “at school” tag alternates between the beginning and end of each item). See Appendix A for the contextualized items. Descriptive information across each of the conditions can be found in Table 2.

Table 2 Descriptives across the pilot study conditions

Measures

Conscientiousness

The 20-item Big-Five Factor Markers conscientiousness scale from the International Personality Item Pool was used as the general measure of conscientiousness (Goldberg, 1992). Sample items include “I am always prepared” and “I like order.” Responses were recorded on a seven-point Likert scale (1 = Strongly Disagree; 7 = Strongly Agree). See Table 3 for internal consistency across conditions.

Table 3 Correlations and scale statistics in pilot study

Academic Performance

The criterion variable of academic performance was operationalized as participants’ grade point averages (GPAs). Participants were asked to self-report their current college GPA at the end of the survey. Evidence suggests that undergraduate research participants provide highly accurate self-reports of their college GPAs (Caskie et al., 2014; Cassady, 2000; Gray & Watson, 2002; Noftle & Robins, 2007).

Pilot Study Results

Correlations were calculated to examine relationships between the different versions of the conscientiousness measure and academic performance; see Table 3. Collapsed across conditions, conscientiousness correlated positively with GPA. This correlation aligns with meta-analytic findings on the relationship between the Big-Five Factor Markers conscientiousness scale and GPA (McAbee & Oswald, 2013). Interestingly, the correlation between conscientiousness and GPA was significant in the split tag condition, but not in the primacy or recency conditions. Following the z-test method for comparing correlations from independent samples (Cohen et al., 2013; Eid et al., 2011), we calculated the difference between the strengths of the correlations. The correlation between conscientiousness and GPA in the split condition was neither significantly stronger than that of the primacy condition (z = 0.49, p = 0.311), nor that of the recency condition (z = 1.00, p = 0.158). The split condition did, however, evidence a significantly stronger correlation between conscientiousness and GPA than that of the non-contextualized condition (z = 1.80, p = 0.036), which was not the case for the other tag conditions. Note that the above p-values represent one-tailed tests.

Pilot Study Discussion

The preliminary findings suggest that split tag contextualization should be preferred over primacy or recency tags. There are at least two potential explanations for why a split approach toward tag contextualization may be more effective. First, the flexibility of utilizing tags either in the beginning or end allows for researchers to better conform to grammatical conventions, thus reducing the potential cognitive load for participants. Additionally, providing contextual information at different locations within each item introduces more variety, potentially staving off cognitive shortcuts participants may engage in when reading items. In other words, when the same phrase is consistently repeated in the same location of each item, participants may be more likely to ignore this phrase and only attend to the unique parts of each item. Based on the pilot study, the split tags were then used as the form of tag contextualization in the primary study.

An important note regarding the pilot study is that these data were collected between January 28 and April 10, 2020. In response to the COVID-19 pandemic, all courses at the institution attended by this student sample were moved online beginning Monday, March 16, 2020. We expect that this salient disruption may be partially responsible for the rather low correlation observed between conscientiousness and GPA in the base condition. For those participants responding to a non-contextualized conscientiousness inventory during the early days of COVID-19, the pandemic may have been the most salient context available and participants may have self-imposed this context in their responses. Although the unique timing of our data collection likely limits the generalizability of the correlation between base conscientiousness and GPA, it also emphasizes the importance of contextualization, especially in situations when a different, undesired context may be more salient than the desired context. Further highlighting the importance of contextualization, the mean conscientiousness observed in the base condition was notably lower than the mean conscientiousness observed in each of the contextualized conditions (see Table 2 for effect sizes).

Primary Study Method

The primary study compared instruction-contextualized, tag-contextualized, and non-contextualized personality within a working sample. A multi-wave data collection allowed us to collect three versions of the personality measures with minimal potential spillover effects between conditions and provide a lagged test of their predictive validity. This study examined the predictive validity of each of the big five personality traits in predicting established criteria associated with the work context. Specifically, extraversion has been shown to relate to job satisfaction (Judge et al., 2002), agreeableness to perpetrated incivility and aggression (Taylor & Kluemper, 2012; Welbourne et al., 2020), conscientiousness to job performance (Dudley et al., 2006), openness to experience to creative job performance (Pace & Brannick, 2010; Schilpzand et al., 2011), and neuroticism to emotional exhaustion (Kammeyer-Mueller et al., 2016; Sosnowska et al., 2019).

Data were collected in four waves on Amazon Mechanical Turk (MTurk) using the CloudResearch Toolkit in the spring of 2021. To ensure high-quality data, our study was only available to CloudResearch-approved participants who had completed more than 10,000 MTurk human intelligence tasks (HITs) with an overall approval rating higher than 98%. We also followed MTurk best practice recommendations for increasing data quality, including the use of captcha verification and attention checks (Aguinis et al., 2021). Participants were screened for eligibility (i.e., working on average 35 h per week or more, employed at their current organization for at least three months, not self-employed, at least 21 years of age, residing in the USA, interacting with coworkers and/or supervisors on a weekly basis) and then compensated for the successful completion of each of four surveys. Each survey was administered 1 month apart with personality measures collected at the first, second, and third time points and the outcome measures collected at the fourth time point. To reduce participant fatigue and potential spillover effects between conditions and to encourage more thoughtful responses, participants responded to only one version of the personality measure at each of the first three time points. To control for potential order effects, the order of administration (e.g., base then instruction then tag; tag then base then instruction) was randomized and counterbalanced across participants. In total, there were six orders of administration; each was completed by 60–74 participants of the final retained sample. All outcomes (i.e., job satisfaction, perpetrated incivility, job performance, creative job performance, and emotional exhaustion) were measured at the fourth and final time point.

Participants

A total of 534 participants met the eligibility requirements and completed the wave 1 survey. In total, 465 participants completed the wave 2 survey, 438 completed the wave 3 survey, and 406 completed the wave 4 survey. Thus, 76.0% of those who completed the first survey also completed the final survey. Two participants were removed from the analytic sample for failing two attention checks at wave 1, and one participant was removed for failing two attention checks at wave 2. Participants who did not have complete data for all variables of interest or whose data could not be matched across waves (i.e., due to a missing ID) were excluded from analyses. The final analytic sample consisted of 399 participants who ranged in age from 23 to 78 years of age (M = 43.2, SD = 11.7). Roughly half (53.4%) of the sample identified as female. The majority (82.2%) identified as Caucasian or white, with the next largest group (7.0%) identifying as Asian or Asian American. Participants were employed in a variety of industries with education (13.3%), health care or social assistance (12.3%), and professional, scientific, or technical services (11.5%) most highly represented.

Independent samples t-tests and crosstabs with pairwise z-tests using Bonferroni-corrected p-values were used to compare participants included in the analytic sample to those who completed the wave 1 survey but were not included in the final sample. Participants included in the final sample did not differ significantly from those excluded after the wave 1 survey in terms of age, race, or gender identity. Compared to those who were not included in the final sample, participants in the analytic sample were less likely to report working in the Information industry (8.3% vs. 14.1%). There were no other differences in industry composition. We also compared the personality measures completed on the wave 1 survey by these two groups. The average score on the tag-contextualized conscientiousness scale was higher for those included in the final sample (M = 4.31, SD = 0.60) compared to those who were not included (M = 4.06, SD = 0.71, t =  − 2.22, p = 0.03). The other 14 personality comparisons yielded non-significant differences between groups.

Measures

Base Personality

We assessed non-contextualized extraversion, agreeableness, conscientiousness, openness to experience, and neuroticism using the mini-IPIP (Donnellan et al., 2006). The scale consists of four items assessing each personality trait, for a total of 20 items. Minor edits were made to two items to remove a specific frame of reference. Specifically, an extraversion item reading “I talk to a lot of different people at parties” was edited to read “I talk to a lot of different people.” Similarly, a conscientiousness item reading “I get chores done right away” was edited to read “I get tasks done right away.” The instructions for this condition read, “To what extent do the following items reflect your tendencies?” Participants responded to all three personality measures on a scale from 1 (Strongly Disagree) to 5 (Strongly Agree). Internal consistency was acceptable for extraversion (α = 0.90), agreeableness (α = 0.85), conscientiousness (α = 0.76), openness to experience (α = 0.83), and neuroticism (α = 0.81).

Instruction-Contextualized Personality

We adapted the base personality measure to measure work-specific personality by manipulating the instructions. Participants read the following instructions before responding to items: “To what extent do the following items reflect your tendencies at work?” Other than this change to instructions, the items were the same as those in the base condition. Internal consistency was acceptable for extraversion (α = 0.88), agreeableness (α = 0.84), conscientiousness (α = 0.79), openness to experience (α = 0.82), and neuroticism (α = 0.80).

Tag-Contextualized Personality

Because split tags were the most effective form of tag contextualization in the pilot study, this study applied tag contextualization in the same manner. Specifically, the base personality measure was adapted by adding “at work” tags to either the beginning or end of an item. We applied the “at work” tags such that items made grammatical sense and items were presented in an alternating order (i.e., the first item started with the “at work” tag, the next item ended with the “at work” tag, and so on). See Appendix A for the items. The individual items were presented in the same order in all three conditions and items representing each trait were grouped together (e.g., four extraversion items followed by four agreeableness items, and so on). Similar to the base condition, the instructions for this condition read, “To what extent do the following items reflect your tendencies?” Internal consistency was acceptable for extraversion (α = 0.86), agreeableness (α = 0.86), conscientiousness (α = 0.73), openness to experience (α = 0.81), and neuroticism (α = 0.78). See Appendix B for information about the factor structure of the three versions of the personality scale.

Job Satisfaction

Job satisfaction was assessed using the three-item subscale from the Michigan Organizational Assessment Questionnaire (Bowling & Hammond, 2008; Cammann et al., 1983) on a scale from 1 (Strongly Disagree) to 5 (Strongly Agree). A sample item includes “All in all I am satisfied with my job.” The measure evidenced high internal consistency (α = 0.91).

Perpetrated Incivility

Participants were asked how frequently they had engaged in incivility toward their supervisor or coworkers over the past month (Cortina et al., 2001) on a scale from 1 (Never) to 5 (Always). A sample item includes “made demeaning or derogatory remarks about them.” The four-item scale demonstrated high internal consistency (α = 0.91).

Job Performance

Participants were asked to recall their most recent performance evaluation and estimate how they were rated relative to their coworkers on five criteria (Pearce & Porter, 1986). A sample criterion reads “overall performance.” Participants indicated on a sliding scale what percentile (10th percentile–90th percentile, using increments of 10) they believed represented their relative performance on each criteria. The measure evidenced high internal consistency (α = 0.98).

Creative Job Performance

Participants were asked to report the extent to which they produce original or novel work (Oldham & Cummings, 1996) on a scale from 1 (Strongly Disagree) to 5 (Strongly Agree). A sample item includes “The work I produce is creative.” These three items evidenced high internal consistency (α = 0.94).

Emotional Exhaustion

Participants completed the emotional exhaustion subscale from the Maslach Burnout Inventory (Maslach & Jackson, 1981) on a scale from 1 (Strongly Disagree) to 5 (Strongly Agree). A sample item includes “I feel emotionally drained from my work.” This nine-item measure had high internal consistency (α = 0.96).

Analytical Approach

In line with previous studies comparing contextualized and non-contextualized measures (Bing et al., 2004, 2014; Bowling & Burns, 2010; Pathki et al., 2022), we assess the strength of relationships through correlation and the incremental predictive validity of the measures through hierarchical regression. Further, we advance the literature by utilizing relative weights analysis to compare the relative importance of predictors.

Primary Study Results

Base, instruction-contextualized, and tag-contextualized personality measures were correlated with established outcomes over time. All correlations were in the expected directions, supporting the use of these established criteria. Specifically, extraversion, conscientiousness, openness to experience, and neuroticism correlated positively with job satisfaction, job performance, creative job performance, and emotional exhaustion respectively; agreeableness correlated negatively with perpetrated incivility (see Table 4).

Table 4 Descriptives, correlations, and scale reliabilities in primary study

Hypothesis 1 posited that contextualized personality measures would outperform non-contextualized measures. Specifically, this hypothesis was assessed by (H1a) comparing correlational strengths through Steiger’s (1980) z-tests, (H1b) examining improvements in criterion prediction through hierarchical regression analyses, and (H1c) evaluating the relative contributions of predictors through relative weight analyses. First, correlations between base personality, contextualized personality, and their associated outcomes were compared using Steiger’s (1980) z-tests (see also Eid et al., 2011). See Table 5. The instruction-contextualized version of a personality trait measure demonstrated a significantly stronger association with its criterion than the base version in one case (i.e., openness to experience). The tag-contextualized version of a personality trait measure had a significantly stronger association with its criterion than the base version in three cases (i.e., extraversion, openness to experience, and neuroticism). No base personality measure demonstrated a significantly stronger association with its criterion than contextualized measures.

Table 5 Comparisons of base and work-contextualized personality's relationships with hypothesized criteria in primary study

Hierarchical regression analyses were then used to evaluate the incremental predictive validity of contextualized measures over and above non-contextualized measures. Specifically, the base version of a personality trait measure (e.g., non-contextualized conscientiousness) was entered in step 1 predicting the associated outcome (e.g., job performance); then, the work-contextualized version of that personality trait measure (e.g., instruction- or tag-contextualized conscientiousness) was entered in step 2. A significant R2 change (ΔR2) indicates that the addition of the work-contextualized personality trait measure significantly improved prediction of the outcome over the base personality trait measure. Results can be found in Table 6. Instruction-level contextualization demonstrated incremental validity for associated outcomes for agreeableness, openness to experience, and neuroticism. Tag-level contextualization demonstrated incremental validity for associated outcomes for extraversion, agreeableness, openness to experience, and neuroticism.

Table 6 Hierarchical regression analyses examining the incremental validity of work-contextualized personality with hypothesized criteria in primary study

To evaluate hypothesis 1c, relative weight analysis (RWA) was utilized to examine the relative importance or contribution of contextualized and non-contextualized personality trait measures toward the total predicted criterion variance. RWA is particularly appropriate for comparing the relative importance of predictors in situations where the predictors are correlated with one another (Tonidandel & LeBreton, 2015). Thus, five separate RWAs were conducted using RWA Web (Tonidandel & LeBreton, 2015). The three predictors (i.e., base, instruction-contextualized, and tag-contextualized versions of a personality trait) were included in one model for predicting each criterion. Table 7 reports the raw relative weights associated with each predictor (i.e., an additive decomposition of the model R2) and their statistical significance based on bias corrected and accelerated confidence intervals (see Tonidandel et al., 2009), rescaled relative weights (i.e., the raw relative weights rescaled to reflect the percentage of predicted variance in the criterion that can be attributed to each predictor), and a comparison of the predictors (i.e., comparing the raw relative weights of the instruction-contextualized and tag-contextualized scales with the base scale). The tag-contextualized version of a scale predicted significantly more variance in the related criterion in three cases (i.e., extraversion, openness to experience, and neuroticism). The instruction-contextualized version of a scale did not predict significantly more variance in the related criterion in any cases. All in all, this series of analyses provided support for hypotheses 1a, 1b, and 1c.

Table 7 Relative weight analyses assessing hypothesis 1

Hypothesis 2 posited that tag-contextualized personality measures would outperform instruction-contextualized measures. Similar to hypothesis 1, this hypothesis was assessed using Steiger’s (1980) z-tests (H2a), hierarchical regression analyses (H2b), and relative weight analyses (H2c). First, the correlations between instruction-contextualized and tag-contextualized personality measures with associated outcomes were compared using Steiger’s (1980) z-tests (see also Eid et al., 2011). Results are described in Table 5. The tag-contextualized version of a personality trait measure had a significantly stronger association with its criterion than the instruction-contextualized version in three cases (i.e., extraversion, openness to experience, and neuroticism). These analyses are generally supportive of hypothesis 2a.

Next, hierarchical linear regression was leveraged to assess hypothesis 2b. The instruction-contextualized version of each personality measure (e.g., conscientiousness) was entered in step 1 of a regression predicting the associated outcome (e.g., job performance); then, the tag-contextualized version of the measure was entered in step 2. Results are depicted in Table 8. Tag-contextualized measures demonstrated incremental validity over instruction-contextualized measures for extraversion, openness to experience, and neuroticism.

Table 8 Hierarchical regression analyses examining the incremental validity of tag-contextualized personality with hypothesized criteria in primary study

Finally, relative weight analyses were conducted, which included only the instruction-contextualized and tag-contextualized versions of the predictors and directly compared their relative contributions toward predicting associated criteria. Tag-contextualized measures accounted for significantly more variance in the criteria in three cases (i.e., extraversion, openness to experience, and neuroticism). Results are summarized in Table 9. In no instance did instruction-level contextualization account for significantly more variance in the criteria than tag-level contextualization. This was true in both these analyses and those including all three predictors (i.e., base, instruction-contextualized, and tag-contextualized personality). Overall, these analyses provide support for hypotheses 2a, 2b, and 2c.

Table 9 Relative weight analyses assessing hypothesis 2

Supplemental Analyses

Although the focus of this work was on the effect of contextualization in established predictor-criterion relationships, examining all of the relationships in the data can provide a more comprehensive understanding of contextualization. Thus, as supplemental analyses, we examined the effects of contextualization on the non-hypothesized relationships (e.g., extraversion predicting perpetrated incivility, job performance, creative job performance, and emotional exhaustion) using the same methods as above. All tables related to these supplemental analyses (i.e., Table S3S7) are available in Appendix C.

Correlations between base personality, contextualized personality, and the non-hypothesized outcomes were compared using Steiger’s (1980) z-tests; see Table S3. There were no significant differences between the base and instruction-contextualized versions of personality. Out of the 20 total comparisons, tag-contextualized personality demonstrated a stronger association with criteria than base personality in ten cases. Hierarchical regression analyses are presented in Table S4. The addition of the instruction-contextualized personality measure significantly improved prediction of outcomes over the base personality measure in nine cases out of 20. Tag-level contextualization demonstrated incremental validity over base personality in 14 of 20 cases. RWAs comparing base, instruction-contextualized, and tag-contextualized personality measures for the non-hypothesized outcomes are presented in Table S5. Across 20 comparisons, instruction-contextualized personality never predicted significantly more variance in an outcome than base personality. The tag-contextualized version of a scale predicted significantly more variance in criteria in eight of 20 cases. Although the primary analyses focused on effects of contextualization in established predictor-criterion relationships, these supplemental analyses largely mirrored the results related to hypotheses 1a, 1b, and 1c. With the exception of conscientiousness, contextualization generally improved the relationships between personality and these work-related outcomes.

Turning to direct comparison of instruction- and tag-contextualized measures among the non-hypothesized outcomes, tag-contextualized personality evidenced significantly stronger correlations with outcomes than instruction-contextualized personality in six cases (see Table S3). In only one comparison did instruction-contextualized personality have a significantly stronger correlation with an outcome than tag-contextualized personality. Hierarchical regression analyses comparing instruction- and tag-contextualization among non-hypothesized outcomes are presented in Table S6. The addition of the tag-contextualized personality measure significantly improved prediction of the outcome over the instruction-contextualized personality measure in 14 of 20 cases. Table S7 displays the RWAs comparing instruction- and tag-level contextualization among non-hypothesized outcomes. Instruction-contextualized personality never predicted significantly more variance in an outcome than tag-contextualized personality. The tag-contextualized version of a scale predicted significantly more variance in outcomes in five of 20 cases. Although the primary analyses for hypotheses 2a, 2b, and 2c focused on the effects of tag- and instruction-contextualization in established predictor-criterion relationships, these supplemental analyses also support the superiority of tag-level contextualization over instruction-level contextualization.

Discussion

Based on these findings, we are better able to understand how various personality factors are differentially impacted by contextualization. Specifically, we assessed the five factor model of personality and found that contextualization generally improves the strength of relationships and prediction between personality factors and work-relevant outcomes. These results replicate previous findings highlighting the benefits of utilizing contextualized measures to improve context-relevant prediction (Shaffer & Postlethwaite, 2012). Further, we examined the utility of the two most common and accessible forms of contextualization by comparing instruction-level and tag-level contextualizations in a within-person, multi-wave design. Our findings suggest tag-level contextualization outperformed instruction-level contextualization for extraversion, neuroticism, and openness to experience. Based on the pilot and primary study, we provide an explicit recommendation that if researchers and practitioners wish to contextualize measures, they should prioritize the use of tags. More specifically, initial evidence supports implementing tag-level contextualization by altering between starting and ending items with tags. Seeing that the addition of instruction contextualization would have virtually no cost, researchers and practitioners may wish to employ contextualized instructions in addition to tags, assuming that this addition would not make for overly long or grammatically confusing instructions.

Theoretical Implications

The current study provides a methodologically rigorous replication and extension of previous findings demonstrating the utility of contextualization. The direct comparison of instruction-level and tag-level contextualization further advances the literature. Additionally, this study provides evidence for the inclusion of less frequently represented factors of personality (i.e., extraversion, openness to experience, neuroticism) in contextualization research. Especially in selection contexts, conscientiousness has been the dominant personality predictor (Barrick & Mount, 1991). Supporting the tenets of Cognitive-Affective Personality System theory, the current findings demonstrate the value of contextualizing multiple personality factors in order to predict a wide variety of valued workplace outcomes (Mischel & Shoda, 1995). Openness is often singled out as the personality trait that is the least work-relevant (Barrick & Mount, 1991). The current study demonstrated that openness predicts work-related outcomes better than generally assumed when contextualization is used. Also, our research utilizes more comprehensive and advanced analytical approaches to assess the utility of contextualized personality measures by directly comparing their relative contributions with relative weight analysis.

Based on the primary study’s results, contextualization seemed to not significantly improve the relationship between conscientiousness and self-report job performance. Further, the supplemental analyses demonstrated that contextualization did not improve the relationship between conscientiousness and other work-related outcomes. One potential explanation that has been considered in previous research is the idea that the layman's interpretation of general conscientiousness may be heavily overlapped with the layman's representation of work (Shaffer & Postlethwaite, 2012). This finding has been found empirically in previous research as well (Heller et al., 2009). This conceptual overlap, in combination with Trait Activation Theory (Tett & Burnett, 2003) and Cognitive-Affective Personality System theory (Mischel & Shoda, 1995), suggests that individuals are already inclined to think of the work context when reading general conscientiousness items because the work context activates this trait. However, contextualization did significantly improve the relationship between conscientiousness and GPA in the pilot study and a markedly lower mean conscientiousness score was observed in the base, non-contextualized group of the pilot study. This could be the result of differences between a student sample and a working sample. However, the unique timing of the pilot study data collection at the beginning of the COVID-19 pandemic suggests that, under certain conditions, unintended contexts may become particularly salient (Ansell et al., 2010; Ng et al., 2021). In other words, participants responding to the base conscientiousness items may have self-imposed a “pandemic” context. Thus, even if a particular personality factor is generally associated with a specific context (i.e., conscientiousness may be generally associated with work and/or school), contextualization may still prove beneficial to ensure this connection is made.

On the other hand, contextualization consistently improved the relationships and predictive ability of openness to experience, which is often viewed as the least work-relevant of the big five personality traits. Taken together, these findings suggest that some constructs (e.g., conscientiousness) may be more strongly, inherently associated with certain contexts (e.g., work) than other constructs (e.g., openness to experience). Along these lines, recent research has begun to explore the characteristics of items that may influence contextualization with the inception of “hidden framings” or implicit frames of reference that originate from item word choice or situational context (Schulze et al., 2021). Future research could help inform which personality traits or other constructs are more likely to benefit from contextualization. In line with Cognitive-Affective Personality System theory and Trait Activation Theory, the current contextualization research emphasizes the importance of psychological characteristics of current situations in impacting behavior (Mischel & Shoda, 1995; Tett & Burnett, 2003). However, Cognitive-Affective Personality System theory also posits that genes and early developmental history play important roles in determining behavior (Mischel & Shoda, 1995). Although less directly related to contextualization, future research should investigate the behavioral impacts of other predictors described by this theory.

Practical Implications

The current research provides actionable best practices for researchers and practitioners alike. First, we provide additional evidence that simple forms of contextualization do improve the predictive ability of common personality assessments. Practically speaking, there is little to no cost associated with contextualizing personality measures through adding tags or altering instructions. The benefits, however, can be significant. This is in contrast to complete contextualization, which may provide strong results but also requires extensive time and resources. Next, we provide initial evidence that tag-level contextualization should generally be preferred over instruction-level contextualization. Finally, although future research should seek to replicate these findings, the results of our pilot study support the use of an alternating approach when applying tags to contextualize items.

As has been discussed in the extant literature, contextualized personality provides a potent predictor of important work and academic outcomes. At the same time, contextualized personality also provides better face validity for applicants when compared to general personality measures (Holtrop et al., 2014; Robie et al., 2017). Although we recommend the use of contextualized personality measures, one caveat should be considered. Generally speaking, the preponderance of evidence suggests that the best practice is to contextualize personality measures when the criteria of interest relate to the specific frame of reference (e.g., school, work). Thus, the bandwidth-fidelity dilemma should be considered in the decision to use contextualized personality measures. In other words, one should be sure to align the specificity of one’s predictors with the specificity of one’s outcomes.

Limitations and Future Directions

Although this study provides meaningful contributions to the extant literature and best practices around contextualized personality, there are several limitations worth considering. As mentioned, one of the downsides to complete contextualization is the validation work necessary to confirm the content domain has not changed due to the modifications. However, that is not to say this problem does not exist in less intrusive forms of contextualization. In fact, previous researchers have questioned whether tag-level contextualization impacts the psychometric properties of a scale (e.g., Robie & Risavy, 2016), although the current study showed support for the psychometric properties of the tag-level contextualized personality measures. In sum, although we endorse the utilization of tag-level contextualization, we also encourage researchers and practitioners to utilize these methods responsibly. We echo Heggestad and colleagues’ (2019) recommendation that authors explicitly describe any changes they make to adjust a scale’s context. Whenever possible, authors should provide their full list of contextualized items and/or contextualized instructions in an appendix or through an online supplement (e.g., housed on the Open Science Framework).

Next, we leveraged relatively stringent requirements to ensure high-quality data from our MTurk sample. These criteria may have had the effect of screening out less conscientious potential participants. Indeed, our participants reported relatively high average conscientiousness compared to samples in past research utilizing the mini-IPIP (Baldasaro et al., 2013; Donnellan et al., 2006). Thus, our findings related to conscientiousness may be partially attributable to an unusually conscientious sample. While our results support the use of tag-level contextualization in research, the generalizability of our findings from a voluntary research study to a high-stakes selection scenario is an open question. To the extent that tags may be more effective than instruction-level contextualization because survey participants pay less attention to instructions, job applicants would likely be more motivated to carefully read instructions; thus, instruction-level contextualization may perform just as well as tags in such situations. Although previous research has supported the resilience of tag-level contextualization in hypothetical high-stakes scenarios, future research should examine whether our findings hold in selection settings (Bing et al., 2004; Schmit et al., 1995).

The last major limitation of the current study is the reliance on self-report data. Previous research has investigated potential common method variance effects that can impact observed effect sizes (Spector & Brannick, 2010). One key issue when considering common method variance is assessing whether the method in question (self-report) was chosen with rationale. Previous research has supported the accuracy of self-report college GPA as used in our pilot study (Caskie et al., 2014; Cassady, 2000) and job performance as used in the primary study (Williams & Levy, 1992), although other researchers have raised justifiable concerns about self-report job performance (e.g., Donaldson & Grant-Vallone, 2002). Aside from empirical support, there is also the question of whether the constructs conceptually should be assessed through a particular method. We argue that self-report assessment is clearly most appropriate for personality, emotional exhaustion, and job satisfaction. However, other forms of assessment may be more appropriate for job performance and incivility. Ultimately, future research should seek to replicate our findings with other operationalizations, such as supervisor-reported job performance or coworker-reported incivility.

In addition, new questions are worth considering regarding the way in which tags are implemented. In our pilot, the split tag condition outperformed the non-contextualized condition, while neither the primacy nor recency tag conditions significantly differed from the non-contextualized condition. As this is the first known research that compared placements of contextualized tags, future research should seek to replicate this finding. Further, although we provide possible explanations for the superiority of split tags, these explanations remain untested. We expect certain items may be clearer with a tag appended to the beginning as opposed to the end, or vice-versa. Future studies could investigate our grammar hypothesis by conducting think-aloud protocols with participants in order to assess whether there is noticeable conscious cognitive load associated with grammatically unusual or incorrect items (See Charters, 2003 for an introduction to think-aloud protocols). Alternatively, our unconscious scanning explanation could be tested using eye-tracking software and tracking participant attention when presented with multiple items using repetitive tags.

Conclusion

This paper provides an in-depth investigation of accessible modifications practitioners and researchers can utilize to contextualize personality measures to evoke desired frames of reference. Specifically, instruction-level and tag-level contextualizations are directly compared in a within-person, time-lagged design. We provide best practice recommendations for the contextualization of personality measures. Specifically, the results generally support the use of tag-level contextualization and suggest that tags should be added to items in an alternating manner.