A classification of response scale characteristics that affect data quality: a literature review

Quality & Quantity, Jul 2017

Quite a lot of research is available on the relationships between survey response scales’ characteristics and the quality of responses. However, it is often difficult to extract practical rules for questionnaire design from the wide and often mixed amount of empirical evidence. The aim of this study is to provide first a classification of the characteristics of response scales, mentioned in the literature, that should be considered when developing a scale, and second a summary of the main conclusions extracted from the literature regarding the impact these characteristics have on data quality. Thus, this paper provides an updated and detailed classification of the design decisions that matter in questionnaire development, and a summary of what is said in the literature about their impact on data quality. It distinguishes between characteristics that have been demonstrated to have an impact, characteristics for which the impact has not been found, and characteristics for which research is still needed to make a conclusion.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:


A classification of response scale characteristics that affect data quality: a literature review

A classification of response scale characteristics that affect data quality: a literature review Anna DeCastellarnau 0 1 2 Response 0 1 2 0 Tilburg University , Tilburg , The Netherlands 1 RECSM-Universitat Pompeu Fabra, Ramon Trias Fargas , 25-27, Merce` Rodoreda Building, 08003 Barcelona , Spain 2 & Anna DeCastellarnau Quite a lot of research is available on the relationships between survey response scales' characteristics and the quality of responses. However, it is often difficult to extract practical rules for questionnaire design from the wide and often mixed amount of empirical evidence. The aim of this study is to provide first a classification of the characteristics of response scales, mentioned in the literature, that should be considered when developing a scale, and second a summary of the main conclusions extracted from the literature regarding the impact these characteristics have on data quality. Thus, this paper provides an updated and detailed classification of the design decisions that matter in questionnaire development, and a summary of what is said in the literature about their impact on data quality. It distinguishes between characteristics that have been demonstrated to have an impact, characteristics for which the impact has not been found, and characteristics for which research is still needed to make a conclusion. Data quality; Measurement error scale characteristics; Classification; Literature review 1 Introduction A challenge for questionnaire designers is to create survey measurement instruments (from now on called: survey questions) that capture the true responses from the population. To do so, they need to create survey questions that not only capture the theoretical concept under evaluation, but that also minimize the impact of their design characteristics on the quality of the responses. Deciding about the right characteristics of a survey question is not a straightforward task. For instance, ‘What is the optimal number of response options to use?’ or ‘Shall I label all options in the scale? are recurrent questions without a clear answer in the field of questionnaire design and survey methodology. However, making the right decisions is crucial if one wants to minimize the impact of those on survey’s data quality (Alwin 2007; Dolnicar 2013; Krosnick 1999; Krosnick and Presser 2010; De Leeuw et al. 2008; Saris and Gallhofer 2014; Schuman and Presser 1981) . Within the Total Survey Error framework (Groves et al. 2009), the way a survey question is designed has a direct influence on the responses given to such question, and impacts the overall surveys’ data quality. The observational gap between the ideal measurement and the response obtained, is defined as measurement error. Studies assessing the influence of questions’ characteristics on measurements’ error show that these characteristics explain between 36 and 85% of its variance (Andrews 1984; Rodgers et al. 1992; Saris and Gallhofer 2007; Scherpenzeel and Saris 1997) . Saris and Revilla (2016 , p. 4) state that if measurement errors are ignored: ‘‘one runs the risk of very wrong conclusions with respect to relationships between variables and differences in relationships across countries’’. Among the wide range of components that influence the design of a survey question, the choice of the response scale is often the most important decision to assure good measurement properties. For instance, Andrews (1984) showed that the number of categories had the biggest effect on measurements’ quality, followed by the provision or not of an explicit ‘‘don’t know’’ option. Moreover, the design of the scale is often the most complex in terms of the amount of decisions that influence the way respondents interpret the options provided. Literature on how to design scales is wide. Most research is directed to the study of a specific set of design characteristics, like the optimal number of points (Preston and Colman 2000; Revilla et al. 2014) or the kind of labels to use (Eutsler and Lang 2015; Moors et al. 2014; Weijters et al. 2010) . Some literature reviews have been conducted to summarize all these findings (e.g. Dolnicar 2013; Krosnick and Fabrigar 1997; Krosnick and Presser 2010). However, these summaries focus on the most commonly used characteristics and do not provide an accurate guide of all design decisions that developing a scale can require. Moreover, one can get quite lost because of the different classification strategies and the different ways researchers use to refer to the same aspects. In this paper, I aim to provide an updated and detailed classification of characteristics to be used in the development of scales in combination to their influence on data quality. Specifically, I focus on closed and ordinal response scales for forced-choice scales because, in contrast to multiple-choice, open and nominal scales, many more subjective design decisions can take place. To make such a classification, I conducted a revision of the literature with two main objectives: (1) classify the characteristics of response scales, and (2) assess whether evidence has been found, in the literature, regarding the impact of those characteristics on data quality. The reminder of this paper is organized in the following way: Sect. 2 presents the methodological procedure followed to review the literature and make the classification. Section 3 presents the findings from the literature review following the classification. And, finally, Sect. 4 concludes with the main findings of this research. 2 Methodological procedure I conducted a revision of the literature looking for evidence about the relevance of the characteristics of closed and ordinal response scales. As a starting point, I took the list of characteristics developed by Saris and Gallhofer (2007) and further updated in Saris and Gallhofer (2014) . They structured this list in characteristics which group different mutually-exclusive choices. For instance, the characteristic: labels of categories, groups three possible choices: no labels, partially-labelled or fully-labelled. In total, they considered more than 280 possible choices, among which 40 choices are related to the design of the scale and belong to 17 characteristics. Table 2 in Appendix provides the list of response scales’ characteristics and the choices considered by these authors. This list covers most characteristics used in the development of scales for face-to-face surveys, that used showcards as visual aid for the respondent. Its major drawback comes from specific characteristics related to the design possibilities offered by other modes of survey administration, such as the different formats of scales’ visual presentation which are available in web surveys. From this preliminary list, I conducted an in-depth search for publications that mention these 40 design choices in academic journals or book chapters. While revising the literature I focused, on the one hand, on identifying other characteristics and design choices, and on the other hand, I searched for empirical evidence and/ or theoretical arguments in the literature that assess if these design choices have an impact on data quality or not. In relation to the empirical evidence, it is often difficult to extract general conclusions since studies differ on the type of questions under examination, on the sample characteristics, on the mode of administration, and especially on the type of quality indicators used. Moreover, there are clear dependencies between characteristics. However, in this paper my goal is to identify if there is any kind of empirical evidence in the literature, thus, I will not differentiate the study characteristics or on the sign of the effect found, or on the kind of indicators. In fact, a wide range of measurement quality indicators, or its complement measurement error, are considered in the literature. Hereafter I considered different types of response style bias, like extreme and middle responding and acquiescence, item non-response, and satisficing bias as indicators of measurement error. Furthermore, I considered different measures of reliability and validity, as indicators of measurement quality. The revised literature often uses different terms for the same types of design choices. To provide a clear summary of the literature review, an initial step is to harmonize the terminology. When necessary, I therefore renamed characteristics and add more possible design choices. I thereby also identified the gaps of non-studied variations that should also be considered. Subsequently, as illustrated in Fig. 1, I group within families, similar sets of related characteristics, and within a characteristic the different number of mutually-exclusive choices one could take. Next, using this classification, I summarize the results of the literature review. 3 The findings from the literature review By the end of this process, I have reviewed 140 publications from which I have used 88, and from which I have identified 83 different design choices related to the design of response scales, i.e. 43 more than Saris and Gallhofer’s preliminary list. First, I classified those mutually-exclusive choices into 23 different characteristics. Finally, I have classified these into four main families of related characteristics. Table 1 presents this classification and provides information on the four possible scenarios regarding its impact on data quality: (1) whether a characteristic has been empirically demonstrated to have an impact on data quality (Yes); (2) whether it has been shown to not impact data quality (No); (3) whether it has not been studied (NS); or (4) whether its impact is not clear yet to make a conclusion (NC). Following, a detailed description of each characteristic and design choices together with the findings related to their influence on data quality is provided using the classification presented in Table 1. The description below follows the detailed summary provided in the Table 3 in Appendix, which also provides all the theoretical and empirical references used as well as the indicators used to assess the impact on data quality for each study. 3.1 The scales’ conceptualization 3.1.1 Scales’ evaluative dimension The evaluative dimension of the scale comes from the theoretical underlying concept that is intended to be measured by the survey question. The basic distinction is between agree– disagree and item- (or construct-) specific scales. Agree–disagree scales can be used to evaluate the level of agreement or disagreement towards a statement or a stimulus. For instance, asking ‘‘Do you agree or disagree that your health is good?’’ and providing the respondents with the options ‘‘agree’’ and ‘‘disagree’’. Such type of scales has obtained a lot of attention by researchers. These scales are simple to design (Brown 2004; Schaeffer and Presser 2003) but they require a major cognitive effort from respondents (Kunz 2015). Empirical evidence has shown presence of acquiescence bias, i.e. the propensity to agree, in such scales (Billiet and McClendon 2000) . Item-specific scales can be used to measure variables, for which the scale options directly refer to the theoretical concept under evaluation. For instance, when asking ‘‘How good or bad is your health?’’ an item-specific scale would provide the respondents with the options Characteristics of the scales’ visual presentation Types of visual response requirement Slider marker position Scales’ illustrative format Scales’ layout display Overlap between verbal and numerical labels Labels’ visual separation ‘‘good’’ and ‘‘bad’’. Comparing item-specific with agree–disagree scales, studies have shown that item-specific scales provide higher measurement quality (Alwin 2007; Krosnick 1991; Revilla and Ochoa 2015; Saris et al. 2010; Saris and Gallhofer 2014) . The choice of the scale’s evaluative dimension has therefore, an impact on data quality. 3.1.2 Scales’ polarity Every concept has a theoretical range of polarity, which can be either bipolar or unipolar. While bipolar constructs range from positive to negative with a neutral midpoint; unipolar constructs range from zero to some maximum level with no neutral midpoint. Scales’ polarity refers to the conceptual extremes of the labels used in the scale. A bipolar scale uses the two theoretical poles of the bipolar concept being measured in the scales’ extremes, for instance, ‘‘satisfied’’ and ‘‘dissatisfied’’. A unipolar scale uses only one pole of the concept being measured for one extreme and its zero point for the other, for instance, ‘‘important’’ and ‘‘not important at all’’. This distinction is relevant, because in case a unipolar scale is used to measure a bipolar concept, the scale would be one-sided towards the positive or the negative pole. Moreover, it is important to consider since specific characteristics like the use of a midpoint or the use of a symmetric scale depend on whether the scale is provided as unipolar or bipolar. While bipolar scales ask about the neutrality, the direction and the intensity of an opinion, unipolar scales only ask about the extremity or intensity. Moreover, bipolar scales have the disadvantage that some respondents are reluctant to choose negative responses (Kunz 2015), and that reliability is somewhat higher in unipolar scales than bipolar scales (Alwin 2007) . However, I have not found more studies assessing the impact of the scales’ polarity on data quality. Thus, more research is needed to confirm its relevance. 3.1.3 Concept-scale polarity agreement The distinction between the concepts and the scales’ polarity is key, since the non-differentiation between bipolar and unipolar attributes has resulted in ‘‘misinterpretations of the empirical findings’’ (Rossiter 2011, p. 105) . Even so, when designing survey questions, this characteristic has received quite little attention, compared to other aspects of the survey questions. It has been shown that this characteristic has an impact on the response styles (van Doorn et al. 1982) but no clear impact on measurement quality (Saris and Gallhofer 2007). Thus, more research is needed about its impact on data quality. Following the classification of Saris and Gallhofer (2007) , the design of concept-scale polarity can be: both bipolar, both unipolar, or bipolar concept with a unipolar scale. In practise, even if, theoretically unipolar concepts should be designed using unipolar scales, we find also bipolar scales. For instance, a scale ranging from ‘‘Completely unimportant’’ to ‘‘Completely important’’ would be a unipolar concept with a bipolar scale. So far it was not studied whether it has or not an impact and whether the formulation of these scales affects their interpretation but we should account for this reality. I therefore propose to add this choice to the classification. 3.2 The type of scale and its length 3.2.1 Types of scales There are multiple types of continuous scales. I distinguish four main types: (1) absolute open-ended quantifiers, a type of numerical text input scale, used to ask respondents an open and numerical answer; (2) relative open-ended quantifiers, a similar type of numerical text input scale, which require a previous specification of the meaning of a standard value; (3) relative metric scales, a kind of scale that also requires the specification of a standard to give relative evaluations. However, in this case, respondents are asked to draw a line relative to the standard provided instead of giving a numerical answer; and (4) absolute metric scales, where respondents should select a point in a continuum. Typically, it is presented as a straight horizontal or vertical line with specified anchors on each endpoint. Rounding is the major problem of continuous numeric options. It has been shown that respondents create their own grouped response categories, often using exact multiples of 5 (Liu and Conrad 2016; Tourangeau et al. 2000), except for the relative metric scales which, in contrast, require lines’ length to be measured later (Saris and Gallhofer 2014) . Relative scales are argued to be more burdensome to respondents which should not give an absolute evaluation but instead a relative answer given the standard value specified (Krosnick and Fabrigar 1997). Moreover, the specification of an appropriate standard is sometimes hard, since it is important using a standard that is ‘‘part of actual experience for all respondents’’ and ‘‘perceived as distinct from the 0 point’’ (Schaeffer and Bradburn 1989, p. 412). The impact on measurements’ error of using these types of scales has been studied by comparing absolute open-ended quantifiers with absolute metric scales with mixed results: Liu and Conrad (2016) find non-significant differences in item-nonresponse, and Couper et al. (2006) find higher item-nonresponse for the metric scale. Scales can also provide a limited number of categorical options. I distinguish four main types of categorical scales: (1) dichotomous scales which only provide two substantive response options, typical dichotomous scales are yes–no and true–false; (2) rating scales which provide three or more categorical options; (3) closed quantifiers which are mainly used for objective variables such as the frequency of activities, omitting its response alternatives such scales become an open-ended quantifier; and (4) branching scales are used to simplify the respondents’ task when answering to long bipolar scales. Branching scales consist on dividing the response task in two steps. First, the respondents are asked about the direction of their judgment, i.e. neutral alternative versus the extreme sides of the bipolar scale. Second, they are asked about the extremity or intensity of their judgement on the selected side. Rating scales require more interpretative efforts that may harm the consistency of the responses compared to dichotomous scales (Krosnick et al. 2005), whereas branching scales have been argued to be useful to explore the neutral alternatives and to provide large fully-labelled scales without a visual presentation (Schaeffer and Presser 2003). A handicap of closed quantifiers, compared to open quantifiers, is that the specified ranges inform respondents about the researcher’s knowledge of (or expectations about) the real world (Schwarz et al. 1985; Sudman and Bradburn 1983). In this direction, Revilla (2015, p. 236) for sensitive questions recommends providing ‘‘answer categories with high enough labels such that respondents do not feel that their behaviour is not normal’’, and for non-sensitive questions ‘‘use labels following the expected population distributions such that respondents can use the middle of the scale as a reference point as to what is the norm, and evaluate their own behaviour as lower or higher than the average’’. Looking at its impact on measurement quality, scales with 2-points usually perform worse than scales with more categories, with the exception of three-point scales (Krosnick 1991; Lundmark et al. 2016; Preston and Colman 2000) . Only Alwin (2007) reports that dichotomous scales provide higher reliabilities than rating scales and absolute metric scales. On the contrary, some studies find evidence regarding branching scales producing higher measurement quality than rating scales (Krosnick 1991; Krosnick and Berent 1993) . When rating scales are compared to continuous scales, like absolute metric scales or open-ended quantifiers, evidence is mixed: continuous scales are more reliable in Saris and Gallhofer (2007) , but in Couper et al. (2001) and Miethe (1985) they provided higher item-nonresponse and lower reliability, respectively, than rating scales, and no differences between the two have been found on measurement quality by Koskey et al. (2013). Comparing rating to metric scales, the second appeared less reliable and leading to higher item-nonresponse in the studies of Cook et al. (2001) , Couper et al. (2006) and Krosnick (1991), however, others find comparable impact between the two (Alwin 2007; Funke and Reips 2012; McKelvie 1978) . Finally, Al Baghal (2014b) compares closed with open-ended quantifiers showing nonsignificant differences on measurement quality. Overall, the decision on type of scale to provide has an impact on data quality and should be considered carefully when designing survey questions. 3.2.2 Scales’ length The length of the scale is one of the key issues in scale development. As Krosnick and Presser (2010, p. 269) say, ‘‘the length of scales can impact the process by which people map their attitudes onto the response alternatives’’. The minimum and maximum possible values are used to evaluate the length of continuous scales. This characteristic has been fairly studied. Reips and Funke (2008) argue that differences on the length of metric scales may depend on the devices’ screen size and resolution, while, Saris and Gallhofer (2007) find a significant effect of the maximum possible value to answer in continuous scales on measurement quality. The number of categories is used to evaluate the length of categorical scales. Among the characteristics of categorical scales, the number of categories is one of the most studied and complex design decisions: while a two-point scale allows only the assessment of the direction of the attitude, a three-point scale with a midpoint allows the assessment of both the direction and the neutrality, and even more categories allow the assessment of its intensity or extremity. Furthermore, while too few categories can fail to discriminate between respondents with different underlying opinions, too many categories may reduce the clarity of the meaning of the options and limit the capacity of respondents to make clear distinctions between them (Krosnick and Fabrigar 1997; Schaeffer and Presser 2003). The results regarding its impact on data quality are mixed. Most evidence suggest using more than 2-points to increase measurement quality (e.g. Andrews 1984) . Some find evidence in favour of using 5–7-points (Komorita and Graham 1965; Rodgers et al. 1992; Scherpenzeel and Saris 1997) . Others argue that options from 7 up to 10-points should be preferred (Alwin and Krosnick 1991; Lundmark et al. 2016; Preston and Colman 2000) . Some others argue that even more categories, i.e. 11-points, can provide better measurements (Alwin 1997; Revilla and Ochoa 2015; Saris and Gallhofer 2007) . Finally, others do not find differences across different number of points (Aiken 1983; Bendig 1954; Jacoby and Matell 1971; Matell and Jacoby 1971; McKelvie 1978) . More recently, research has looked at the specific circumstances of the questions when evaluating the impact of the number of points. Some find, when distinguishing between item-specific and agree–disagree scales, that the quality does not improve for agree–disagree scales with more than 5-points (Revilla et al. 2014; Weijters et al. 2010) and for item-specific it goes up between 7 and 11-points (Alwin and Krosnick 1991; Revilla and Ochoa 2015) . Similarly, Alwin (2007) argue that the optimal of points in a scale should be considered in relation to the scales’ polarity, and show that the use of 4-point scales improved the reliability in unipolar scales, while 2, 3 and 5-point scales improved the reliability in bipolar scales. This summary has clearly shown that the length of the scale is a characteristic to consider. 3.3 The scales’ labels 3.3.1 Verbal labels Verbal labels are words used as a reference to clarify the meanings of the different scale points and its interval nature and reduce ambiguity (Alwin 2007; Krosnick and Presser 2010) . Although it has been found that fully-labelling all points increases the cognitive effort of reading and processing all options (Krosnick and Fabrigar 1997; Kunz 2015). Studies about its effects on response style bias show that acquiescence is higher and extreme responding is lower with fully-labelled scales (Eutsler and Lang 2015; Moors et al. 2014; Weijters et al. 2010) . Other studies about its impact show, higher reliability of endpoints labelled scales compared to fully-labelled scales (Andrews 1984; Rodgers et al. 1992) , while the majority show that labelling all points in the scale has a positive impact on reliability (Alwin 2007; Alwin and Krosnick 1991; Krosnick and Berent 1993; Menold et al. 2014; Saris and Gallhofer 2007) . Thus, the impact on data quality is clear. Usually a distinction between fully-labelled, partially-labelled and not at all labelled is made. However, there are multiple ways to design a scale partially-labelled and these should also be considered when assessing its effects on data quality. Thus, I propose the following distinction to cover the possible design choices in surveys: scales not at all labelled, only labelled at the end-points, labelled at the end- and the midpoints, labelled at the end- and more points but not all, and fully-labelled. 3.3.2 Verbal labels’ information Verbal labels can provide different lengths and amounts of information. The more information is provided in the labels, the less information is needed in the request. Saris and Gallhofer (2007) distinguish between short labels or complete sentences and conclude that reliability improved when short labels instead of sentences are used. But still, more research is needed to assess the impact of this characteristic on data quality. The length of a label does not actually provide sufficient advice on how to design them. For instance, even if using complete sentences may improve reliability are very long labels still preferable? It is for this reason, that I belief what affects data quality may be the amount of information provided in the label rather than its length. Thus, I propose the following differentiation. Non-conceptual labels require a previous specification of the type of measurement concept. For instance, the labels ‘‘Not at all’’ and ‘‘Completely’’ cannot be used without a previous specification of the concept like in the form of a question: ‘‘How satisfied are you with your job?’’. Scales can otherwise provide conceptual labels like ‘‘Not at all satisfied’’. Verbal labels can also provide information about the object and/or the subject under evaluation. An example of objective label would be ‘‘Not at all satisfied with my job’’, and of subjective label, ‘‘I am not at all satisfied’’. Finally, a fullinformative label would be ‘‘I am not at all satisfied with my job’’. 3.3.3 Quantifier labels Two types of labels for closed quantifier scales can be distinguished. First, vague quantifier labels which are known to be prone to different interpretations, e.g. ‘‘often’’ can mean ‘‘once a week’’ for a respondent and ‘‘once a day’’ for another (Pohl 1981; Saris and Gallhofer 2014) . In terms of its impact on data quality no clear conclusions can be extracted so far: Al Baghal (2014b) show that measurement quality is not affected with vague labels for closed quantifiers compared to open-ended responses, while Al Baghal (2014a) find higher levels of validity than in open-ended scales. Second, closed-range (or interval) quantifier labels, compared to vague quantifiers, are argued to be more precise and less prone to different interpretations (Saris and Gallhofer 2014) . However, when providing closed-range quantifiers, respondents may use the frame of reference provided by the scale in estimating their own behaviour (Schwarz et al. 1985). Selecting unbiased ranges allowing respondents using the middle of the scale as a reference point is preferable (Revilla 2015) . More research is needed to shed light towards whether the use of vague or closed-range quantifiers impacts or not data quality. 3.3.4 Fixed reference points Fixed reference points are verbal labels used in a scale to prevent variations in the response functions and set no doubt about the position of the reference point on the subjective mind of the respondent (Saris 1988; Saris and Gallhofer 2014) . For instance, the use of ‘‘always’’ and ‘‘never’’ can be fixed reference points on objective scales, and the words ‘‘not at all’’, ‘‘completely’’, ‘‘absolutely’’ and ‘‘extremely’’ for subjective scales. Usually, these are provided at the end-points of a scale. However, with closed-range quantifiers usually all labels are fixed reference points (e.g. ‘‘from 1 to 2 h’’), and in bipolar scales, the midpoint alternative is also such. The use of fixed reference labels make the scale the same and comparable for all respondents (Saris and De Rooij 1988) . Moreover, it has been proved to have a positive impact on improving measurements’ quality (Revilla and Ochoa 2015; Saris and Gallhofer 2007) , and that when fixed reference points are not provided, respondents use different scales (Saris and De Rooij 1988) . 3.3.5 Order of verbal labels The ordering of verbal labels can be from negative (or passive)-to-positive (or active) or from positive-to-negative. The order of the verbal labels is an important characteristic since it provides an additional source of information to the respondents (Christian et al. 2007a) . Moreover, scales ordered form positive-to-negative tend to provide more quick responses, which increases the chance that respondents do not processes all options consciously (Kunz 2015). Studies find that the order does impact measurement error and response style bias 3.3.6 Nonverbal labels Nonverbal labels are numbers, letters or symbols instead of words attached to the options in the scale. The most commonly used are numbers and symbols, e.g. radio and checkbox buttons. Krosnick and Fabrigar (1997) suggest combining numerical and verbal labels. Similarly, others suggest that numbers may help respondents to decide whether the scale is supposed to be unipolar or bipolar (Schwarz et al. 1991; Tourangeau et al. 2007). However, respondents may take longer to submit an answer when numerical labels are provided since they are an additional source of information to process (Christian et al. 2009) . Regarding its effect on data quality: Moors et al. (2014) show that scales without numbers and only verbal end-labels evoked more extreme responses than those with numbers, while Christian et al. (2009) and Tourangeau et al. (2000) conclude that response style is unaffected by the use or not of numbers in the scale. Thus, slightly more evidence points toward the fact that the choice of nonverbal labels does not affect data quality. 3.3.7 Order of numerical labels Order of numerical labels can be from low-to-high or from high-to-low. From the few studies about its impact on response style that have been found, two of them conclude that, when negative numerical labels are provided compared to when all numbers are positive, the differences in the response distributions are significant (Schwarz et al. 1991; Tourangeau et al. 2007), while Reips (2002) concludes that it does not influence the answering behaviour of participants. Since there is no classification, I propose the following distinction to account for the different choices in surveys: numerical labels ordered from negative-to-positive, from positive-to-negative, from 0-to-positive, from 0-to-negative, from positive-to-0, from negative-to-0, from 1 (or higher)-to-positive or from positive-to-1 (or higher). 3.3.8 Correspondence between numerical and verbal labels The order of numerical labels is of special relevance when these are combined with verbal labels. Correspondence between numerical and verbal labels refers to the extent to which the order of numerical labels matches with the order of verbal labels. Numerical labels should reinforce the meaning and the polarity of verbal labels (Krosnick 1999; Krosnick and Fabrigar 1997; O’Muircheartaigh et al. 1995; Schaeffer 1991; Schwarz et al. 1991) . However, it should be considered that a more negative connotation is given to the label related to a negative number (Amoo and Friedman 2001; Schwarz and Hippler 1995). Following Saris and Gallhofer (2007) the level of correspondence is classified into: high correspondence which refers to combinations of numerical and verbal labels that match perfectly, e.g. a bipolar scale where numbers are ordered from -5 to ?5 and verbal labels range from ‘‘Extremely bad’’ to ‘‘Extremely good’’ or a unipolar scale where numbers range from 0 to 10 and labels from ‘‘Not at all’’ to ‘‘Completely’’; low correspondence which refers to combinations where the lower numbers are related to positive verbal labels or vice versa, e.g. a scale numbered from 0 to 10 and labelled from ‘‘Good’’ to ‘‘Bad’’; and medium correspondence which refers to any other combination of numerical and verbal labels that matches the order of the labels: negative/low and positive/high but not perfectly. Among the little amount of empirical evidence found, only one study concludes that low correspondence do not impact the distribution of responses (Christian et al. 2007a) , while two conclude that reliability improves with high correspondence between the verbal and the numerical labels in the scale (Rammstedt and Krebs 2007; Saris and Gallhofer 2007) , i.e. there is an impact. 3.3.9 Scales’ symmetry Symmetry is a specific characteristic of bipolar scales. Symmetric scales assure that the number of labels in bipolar scales is the same in the positive and in the negative side. Asymmetric scales assume previous knowledge about the population, otherwise it would be biased (Saris and Gallhofer 2014) . However, its impact on measurement error is not clear: while Scherpenzeel and Saris (1997), for symmetric scales, find no effect (or very little) on reliability and validity, Saris and Gallhofer (2007) find a positive effect. 3.3.10 Neutral alternative Neutral alternative is also a characteristic of bipolar scales, where the respondents are not forced to make a choice in a specific direction. Neutral alternatives can be provided implicitly or explicitly. Explicit neutral alternatives are usually labelled such as ‘‘neither A nor B’’, while implicit neutral alternatives do not need to be labelled to understand its implicit neutral connotation, i.e. a bipolar scale with an uneven number of points, the midpoint will be considered neutral even if it is not labelled. Some argue that providing a neutral alternative can increase the risk of survey satisficing (Bishop 1987; Kulas and Stachowski 2009) . Others argue that not providing a neutral point forces respondents to select an option which do not reflect the true attitudinal position (Saris and Gallhofer 2014; Sturgis et al. 2014) . Finally, Tourangeau et al. (2004) argue that the neutral point in a scale can be interpreted as the most typical and use it to make relative judgements. Regarding the impact on response styles, studies find that including a neutral point increases acquiescence and lowers the propensity towards extreme responding (Schuman and Presser 1981; Weijters et al. 2010). In terms of its impact on measurements’ quality, most evidence suggest that providing the neutral impacts measurement quality (Alwin and Krosnick 1991; Malhotra et al. 2009; Saris and Gallhofer 2007; Scherpenzeel and Saris 1997) . Only Andrews (1984) finds that the effect was very small. 3.3.11 ‘‘Don’t know’’ option ‘‘Don’t know’’ (or ‘‘No opinion’’) option is a non-substantive response alternative. These can also be implicit or explicit. An implicit ‘‘don’t know’’ option is an admissible answer not explicitly provided to the respondent, which requires an interviewer to record it. An explicit ‘‘don’t know’’ option can be directly provided as a different response alternative to the respondent. Providing an explicit ‘‘don’t know’’ option depends on whether researchers believe that respondents truly have no opinion on the issue in question (Dolnicar 2013; Kunz 2015). However, many authors argue that when the ‘‘don’t know’’ is provided this leads to incomplete, less valid and less informative data (Alwin and Krosnick 1991; Gilljam and Granberg 1993; Krosnick et al. 2002, 2005; Saris and Gallhofer 2014) . Whether providing explicitly or implicitly a ‘‘don’t know’’ option impacts data quality is not clear: some authors show that providing it explicitly impacts data quality (Andrews 1984; De Leeuw et al. 2016; McClendon 1991; Rodgers et al. 1992) , while others conclude that there is no support towards this impact (Alwin 2007; McClendon and Alwin 1993; Saris and Gallhofer 2007; Scherpenzeel and Saris 1997) . 3.4 The scales’ visual presentation 3.4.1 Types of visual response requirement The type of visual presentation requires from the respondent higher or lower effort when responding. Following are the different types of visual response requirements distinguished in the literature: (1) point-selection is the most standard way to present scales, either a continuous line or categorical options are provided from which the respondent should point and select the desired choice; (2) slider is a type of linear implementation in which the respondent should move a marker to give a rating; (3) text-box input is a typing space where respondents can type in their answer; (4) drop-down menu shows the list of response options after clicking on the rectangular box, i.e. before clicking the respondent do not see the whole list of options and sometimes respondents have to scroll down to select the most desired option; and (5) drag-and-drop refer to the technique where respondents need to drag an element (e.g. the item or the response) to the desired position. Comparing point-selection to sliders, the first are less demanding but also less fun and engaging (Funke et al. 2011; Roster et al. 2015) . In this line, Cook et al. (2001) and Roster et al. (2015) compare sliders with radio buttons and find non-significant differences on reliability or item-nonresponse, respectively. The use of box format is closer to how questions are asked on the telephone, and do not provide a clear sense of the range of the options (Buskirk et al. 2015; Christian et al. 2009) . Comparing the use of text-box input with the use of point-selection or sliders, some demonstrate that item-nonresponse and response style and are comparable across the three types (Christian et al. 2007b) , while others show that there is an impact on item-nonresponse and response style between the three (Buskirk et al. 2015; Christian et al. 2009; Couper et al. 2006) . Christian et al. (2007b) argue that drop-down menus are more cumbersome than text-box input when large number of options are listed. In this line, other authors argue that drop-down menus are more burdensome to respondents because they require an added effort to click and scroll (Couper et al. 2004; Dillman and Bowker 2001; De Leeuw et al. 2008; Reips 2002) . Liu and Conrad (2016) compare drop-down menus with sliders or text-box input and find that item-nonresponse was non-significantly different. Similarly, when drop-down menus are compared to point-selection comparable results in terms of response style and item-nonresponse are found (Couper et al. 2004; Reips 2002) . Finally, drag-and-drop provides higher item-nonresponse compared to point-selection and it is argued to prevent systematic response tendencies since respondent need more time to process what is the task they are required to do (Kunz 2015). Overall, the evidence provided by these studies suggests that there is no impact on data quality depending on the type of visual response requirement. 3.4.2 Sliders’ marker position Slider marker position is a specific characteristic of sliders. Markers can be placed at the top- or left-side, at the bottom- or right-side, at the middle or outside of a slider. A challenge when designing an slider is how to handle the starting position of the marker and identify non-respondents (Funke 2016). The impact of this characteristic on measurements’ error is not yet clear, since only one study looks at its effect on data quality and finds that higher nonresponse and higher response style bias occurred when the marker position was at the middle or the right-side of the slider compared to when the marker was placed at the left-side (Buskirk et al. 2015) . 3.4.3 Scales’ illustrative format Sometimes scales are presented using an illustrative format instead of using the traditional scales. Usual illustrative formats are ladders (or pyramids), to indicate levels of some aspect, and thermometers, to indicate degrees of feelings. Other illustrative formats can be clocks to indicate the timing of things, or dials to enter numerical values. The use of these types of scales usually require lengthy introductions and not all points can be labelled, but are useful to visually provide numerical scales with many points (Alwin 2007; Krosnick and Presser 2010; Sudman and Bradburn 1983) . The few studies available suggest that this characteristic has an impact on data quality: thermometer scales provide less measurement quality than ladders or radio button scales (Andrews and Withey 1976; Krosnick 1991) , ladder scales provide better measurement quality than traditional scales (Levin and Currie 2014) but lower validity compared to other illustrative formats (Andrews and Crandall 1975) , and responses are significantly different whether a pyramid or an onion format are used (Schwarz et al. 1998). 3.4.4 Scales’ layout display The scales’ layout display of the answer options can be horizontal, vertical or nonlinear. Nonlinear scales can provide, for instance, the answer options on different columns. Tourangeau et al. (2004, p. 372) argue that respondents usually expect, in vertically oriented scales, the positive points to appear first at the top. However, Toepoel et al. (2009, p. 522) argue that respondents read more naturally in a horizontal format. Two studies looked at the effect of scales’ layout display on response styles but they both find that whether presenting the scales in an horizontal, vertical or nonlinear layout provided significant differences on the responses (Christian et al. 2009; Toepoel et al. 2009) , i.e. it has an impact. 3.4.5 Overlap between verbal and numerical labels Overlap between labels is a characteristic considered by Saris and Gallhofer (2014) for which no relevance has been found while reviewing the literature. This characteristic intends to indicate whether the verbal labels used in a horizontal scale are clearly connected to one nonverbal label or they overlap with several of them. More research is needed on this characteristic to assess whether it is or not relevant to consider when designing visually presented scales. 3.4.6 Labels’ visual separation Labels can be visually separated by adding more space between them, separating lines or the options in boxes. The aim of this is to provide a visual distinction between the labels in the scale. For instance, researchers may be interested in visually separating the ‘‘don’t know’’ option from the substantive responses to make a clear differentiation. However, Christian et al. (2009) and Tourangeau et al. (2004) argue that visually separating some of the labels may encourage respondents to select it more often. The impact on data quality is clear: De Leeuw et al. (2016) show that by separating the non-substantive option reduces item-nonresponse and provides higher reliability, Christian et al. (2009) and Tourangeau et al. (2004) show that separating the non-substantive option lead to significant differences on the responses while it do not happen when the midpoint is separated. The current distinction in Saris and Gallhofer (2014) is whether the labels are separated within different boxes or not. However, given that I found more choices in the literature, I propose to distinguish between visually separating the non-substantive option, the neutral option, the end-points, all points or none of the points in the scale. 3.4.7 Labels’ illustrative images Illustrative nonverbal labels can be used instead of or in combination with verbal and numerical labels when they are provided visually to the respondent. Usual illustrative labels are: feeling faces (also called smileys) which attach images of different face expressions (e.g. from sad to happy). They are easy to format and they attract the attention of the respondents (Emde and Fuchs 2013). Moreover, they have the advantage of being easier to identify by respondents than verbal labels because they eliminate the barrier of mapping feelings into words (Kunin 1998). Its effect on data quality indicate that there is no impact: while Derham (2011) shows that nonresponse is significantly higher in faces scales compared to sliders and point-selection scales, Andrews and Crandall (1975) , Emde and Fuchs (2013) show that the differences in the responses between smiley scales and radio button are non-significant. For the sake of completeness and to capture the different formats found in the literature I propose to distinguish two other types labels’ illustrative images: other human symbols, like thumbs and manikins, and other nonhuman symbols, like stars or harts. 4 Conclusions This paper provides a complete and updated classification of the characteristics and its possible design choices considered in the literature when designing forced-choice, closed and ordinal response scales. This classification has been summarized in Table 1 together with the main conclusion of the literature review, which indicate whether evidence has been shown in the literature of each characteristics’ impact on data quality. Three main limitations of this study should be kept in mind: First, to assess whether there is an impact or not on data quality, I did not consider the different sample sizes or the power of the studies. I considered the absolute amount of studies. Further research, could provide weights to the different studies. Second, it is likely that publication bias in favour of studies which found an effect of a certain characteristic is present, i.e. the number of characteristics which have an impact may be overestimated. Third, I did not aim to provide information to improve the design of response scales. Thus, the results on the impact are provided independently of its positive or negative effect. From Table 1 the following main conclusions can be extracted: 11 characteristics have an impact on data quality: the scales’ evaluative dimension, the type of scale, the length of the scales, the use of verbal labels, the use of fixed reference points, the order of numerical labels, the correspondence between numerical and verbal labels, the use of a neutral alternative, the scales’ illustrative format, the visual layout display of the scales, and the labels’ visual separation. 4 characteristics do not have an impact on data quality: the order of the verbal labels, the use of nonverbal labels, the type of visual response requirement, and the labels’ illustrative images. Further research is needed for 8 characteristics: to know whether the scales’ polarity, the agreement between concept and the scale’s polarity, the information provided by verbal labels, the quantifier labels, the scales’ symmetry, the use of a ‘‘don’t know’’ option, the slider marker position, and the overlap between verbal and numerical labels have or not an impact on data quality. What is clear from the large body of research presented here and its often mixed results is that characteristics interact with each other, e.g. usually scales with more points are partially labelled. Thus, researchers should account for the effects driven by the overall design of the survey question, when assessing how to optimally decide upon a characteristic. That is in line to what Cox III (1980 , p. 418) already concluded for the optimal number of categories: ‘‘there is no single number of response alternatives for a scale which is appropriate under all circumstances’’. The results presented in this paper provide on the one hand a source for researchers that want a complete list of characteristics and its possible design choices for closed and ordinal scales, and on the other hand, a detailed summary of the literature that refer to the impact of each characteristic on data quality. Finally, further research should provide the same summary for other characteristics related to the design of survey questions, such as the design of the request for an answer or the overall visual presentation of the survey question. Acknowledgements I would also like to show my gratitude to Melanie Revilla, Wiebke Weber and Willem E. Saris for their fruitful comments and feedback on an earlier version of the manuscript, although any errors are my own and should not tarnish the reputations of these esteemed persons. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Design choices More than 2 categories scale Two-category scale Numerical open-ended scale Magnitude estimation Line production More steps procedures [Enter value] [Enter value] Appendix See Tables 2 and 3. Number of categories (categorical) Maximum possible value (continuous) Correspondence between numerical and verbal labels Range of the used scale Range correspondence Symmetry of response scale Neutral category Number of fixed reference points ‘‘Don’t know’’ option Horizontal or vertical scale Overlap between verbal and numerical labels Numbers or letters before answer categories Scale with only numbers or numbers in boxes n o scaAD iley–W icequA ecenb tloow t-codumm lsaeaehvm reoTMMM S lsscaeaehv reoTMMM S ireehghq liilitreaby seevowm l[sscaeW itfsacpoum titsscepno aynd TMMM itsceenh [sseeponR lit:iltiilrf)(eeabh7020yonwA tllI[rrsseaccaeedopooSwm litili]raeybSEY?l:tilli)(ecaednond20n00CBM lii[fsssaeccecqnudnuoADA ttr]frcaoghuohSSEEYM?:lil)r(ssscaceea9191kdnoADK titiliilI[rrrssaeeaaenhopnbSP itlttt]rrr-sssaeceeonoSEY?:lil)(sacacaeo201hd5nvROAD titllI[r-rsaaecehnyquuoSTw liitlitili]raaaedyvdnybEY?ll:ir)fr(seaaa201oh4ndSGAD titllI[r-rsaaecehnyquuoSTw liitlitili]raaaevdyndybEY?.llti:I)r(ssscaeaaea0012hvSS tr-r[sceeaounhTTDAMMM itli]aydvSEY? il:lilr)(ssaaceaponu700h2nwA ittiilililirrrseaaeaenhbbpohgh tiliilli]raeeybySEYW? llirfr(seaaa002oh7ndSG:t)eh illlirrfrsseacacaopboonopnu iltlliilitiirrseeaacgnoynbngfiw iiliitr[-rrssaeceaecyudvgnonT lititiliil]raaaedyvdnybNO?.:lti)ffrr(aeeea1982ndnoonvD tiitilrrrrssssaeeceaeuonbdnpo tiititltrrsseubodhguonhy i]rsacnopoSEYm? t n o e t s m ire to t u eg ce ttea lr l e t e e r s re ep ea ag d r l t n n sc to a i se n f o y r a un tah eng K ( o nu im no R ( h t h i t i w w t tp lea p e e r e c c l n a ra lao cn s l o r co sc r ipo inp rc la r r a la lra lao b u po l la i o o y t i l a u q n o E illl:rr)seeaceaaaeaeaunnoph2041gnddBm ltit-[-rrrsssceeaeaacaaceuodopnovgunkodR illtirr]rrssssseeeaaecgponoonndoNO?lii:tilr)rssaceaeaenghhvg2007nhw titilititillrrssceaacaeaehodouuobbnhopbmm illitilt[rsseececeay–yomWWliitil]raebySEY?.it:lltlliltrr)sssceeecaeaeaea2010kobohnm litilittirr][receaaobybuondoSSEY?.:tiltl)rffrrrssseecaceaeee06u20upoomm tlittiirr-sseaacaceaaeengooopngdhndndm ittr]-r[IsseeeeanpononnqufiSEYm?:itli)rrsssceacaeeeae2201ndpkunRm litlt-rrsssseaceaaeecep5onnboponpoonmm t]-r[IsseeennnopoNOm?.l:ttll)-ssssaeeeeaeecae3boopun210kynddo ittllliilitrrrrssaeacaaeceaeangonopbobym tilili]’rrr[saeeaybCNOVm?itl:ilrr)rssscecaaeecevhk1991noowm til;lilittiliililtirrrrsssacaeeaaeaeengoyhnbbw t;liiirssssaecceacoouobunghdnnhnhgwm ililititiitilrrrrsssseaaeaeecaebnhdhghvngop tttttt-rrr[ssseeeecaeon–onpoudPmm til]rrsaeconoSEY?t:iirr)rseaccae991n3nbhnkndngoB ittliiiilrrrrsceaeeaacoypodnovbopbnhngmm litililtitr][I)(rseeacaeabygnSEYm?:iiti)-rscaaa2n016ongndondnnufiC titi-rrffssseeeeeceeeeonnnopbnnondwm ,tiltltirrrssaeeaeceeeacnddngouopnobm t]-r[IsseeenononpNOm? A A ( ( C C F ( ( ( K K ( ( K L ( ( f i r e (o itv is ) s r e e h t .ttljilt:rr)rssseaaecaceeaed2110mm lttl;tlitrrsseacaeacaceecaeboogpohypm tttilttttirssseecaaeonhonubohpmm ltittitrseaecenododnohm .tl:itil)rsssseaccace2k005nooouodhm iililrrrrsseaaeaeceeennnnguqdm ittttiiffrrrrsaeeeeacacvopnnhhhwm tiitltrrssssceceaccaeaynodpnongom ii:ltirrrr)saaaceaendkbgno79vo19pF lilit(r)rssseacaeeaceaeogndndndgum tlttitiiifrseacechdodoudhhonnfiwmm itlilttirrsssaaeaeaaogonvnduonbmm tjsengdum it:rr)rssaaeaeenod2601ndundnpoCm itlilirrrsseeeaepoovduoyknn1ddn01w ,littrssssaceaaceaeanyouywm :tllliti)r-rsecaeeaaee1502hvguqnndofi illtiitlrrrsssecaeaeeceevoddpnnbhnuufl tttltitrrseeeeaeonpondyhupoopnh ititirsnoubd llilit:ifrrr)saeaaechond201npoud4onG tilitltttlr)rrrrsseceeacaeeeaaeeavhbnm itir-rsseaeeeeacedduqnnnopuoundbngfi ieaddov :itrrrff)aaeeacaendubdhg89nun19dBm littitit(rr-sseeaeaeeeaeovdndqunonpfim ltlitttrrrrseeeaeaaeaebpodvhpophpom ittilitrrrsaeacaaeccanddongddnogon tiitisscnond i:rrrffr)sssaeeeeacacandhbhnng2003hP ttilrrfreaeaaeaeogpovdnvdubnogm itilltrsseaeaconvuogy H ( K ( K ( L R ( ( S ( S ( S ( d e d d e n nd rs isecoh tleghn -eenop re -eenpo re itrcem itrcem sou tiaeunfi g iscng itsdn ltseuo tianufi itleav tianufi itleav ltseuo toohm ign sedq ichnn e a b q e q e b ic ta lo ra D le A R R A D R C B a c s f o e p se y l t a e c h s t ed f se u o n ticonn itirssc itirssc frsepo 3 tec tec o le ra ra ep b a a y a h h T T C C S T ( lc re ia o 1 a b d f d )0 rs e itltrrsecaaeeognum .tl:r)zaeach5891w tttrseaeendonbphuo titsssacaeddyndm tjtlrrsaeeeaondpud tl-seeeeuoopnbddn rraaannddnubdBm titlrsseaaecnhuonfi iltrrssecaaenuowm tseednonp .tlreaaeagnuu002 ti-eeeeandndqpunfi illitssecngnnoowm itrrseaecndnnudow tilssscaeunonuo h a r a u q n r o o u a c S T ( ( D M M N d r t ttililiiraaaeceeacndoyuqndvpEm .:itlltli)r(sssceaeaa2d61hoouov0knduLmm tltit[rrrssceecaaaonungnnhC tili]aydvSEY?iililit:ilffrr)(eeeceaeecdononbn7891vyoKM tiitlttiilrr[ssecaaceaeeeaenngdnbydvTwm ttiiltililitt]rrssaeaaeeedvydnybTNO?tilil:tiilr)(sssaeceaceeeeudgndng5819hbmM tititillitirrrfrsceaaaeebyodopgnonmm litiltittlr]r[sssseeaeeeacby–SETY?:lllt)r(ssssscaeeaae2000n2podnnoPCm ,ttlitlitilillir[rrsseeeaeaaeaedbyvdnbT itilrrraeaacandonphhbnoCC tiil]aydvSEY?:lli)r-fr(seeeeaaa0072ndnopdohndSG itliiittlrrssssccaeeacaeaeavhngndnuqnyfifim ltilitttiliilitirrrreaaaaeeoudyvhnbybhghngw ililtilrr-r[sssaeaeceeacbyndouTTMMM tiil]aydvSEY? itiliitil:itrr)(ssaeeeacaeednonnb3891nkAm tititifrrssecaeeeaceegnuhbogonghnpdm iltliitiltr]rI[sseeaccaeybynonnnNO?lillt:ilrr) (sseacaeeeaobhnp117991n7pwAm iltilir]rr[eaecebyouSSETTYMMM?lit:ilfr) (sssseeaceeo4ppvouh7002nwAm ,lliltiiitililrrsscaaeeeaehhopunnybw ,liiliitiliilrrfrrssscaeeaaeghho2opbnyb3 .ililtlrf[seeeaa7opy–yodnp5dnwWWtiilil]raeybSEY?:iiil)ffrr (sseeecca1991donnkondnnKwA ,liitiltitrIaeaeeehnd5pby2nbSwDAw ,titiffrrfrssseeceeaec3pbu9donnoonm tiittrrfr[aaeceeepooovonnp9o7nbPw titttttitt]rrseeaeauduodubSEY? n e i n f h itsn ino reo ish isa ay ,too ign fo tfo fropo itiraed lea bunm ichw scen ltiam irsem treedh aenm tlehgn iseze e s c e le a p o a fo la th b n s l o g g um co teh isng saac tsum :) 7 tae on itry tim on ln tno fo o r rc 99 cw tia la op sd am in ty sn fo ic r1 fe rm ce :) en tiop taek lirao iree tisev llra irag tno ifno tseh 0280 sedp e b e e p th a d a io e is e le th b e : rn n F is th k a ) l 0 lt ae and ecd ise rpom unF ssc en : d th ) e u 70 uo to 89 a t u rec m m d o s 0 h 1 se ir 2 s n o II in le ti I po r n p p la x s p s m m n ip n v lw ca e o re ap ro o o s r c c lo e co ed n op ick lex ro o n u c a n ce g s ti i A C K ( ( ( R ( e e lu lu a a s v v e i e r e l l o isb issb teg s o o ac p p f m m o u u r e im im b in ax um h t g n e l e s n o p s e R fro ith S ly E e d p tltiiiliraaaeccaeeuqdynopvdnEm .tittffr)(sseeceeeaagbgondhnd1849TwA ..ttiilittrrrrssseeeeaaecebp3ogyuqoowM ,titilttff[eeeaaacyhddov2nphnTmMMM ilrrrr]saeeudoSEY?ititilii:tlf)r(eeeaeeepdnynodnbdg1n954hB ittl[rrfssseeeaceecaogbnuoTm ililitr]aebyNO?il:tililltr)(Jaeaacaeab71nyd91boyndM ttilitifrfreeeeaaeephonbnudndydnvom ,tliiiltittttrrr[rssscaeceeeeybnnounopT iitlitiilit]raeceaaddyvvdyndpvNO?:liitilti)rrr(eaaaaa6591byhondoGKmm titititfrrssseeeacehopbupnonuhn6owm l]r[acaaonbhhpSECY?.itil:iitl)rr(aeaea2dyvhhgn106dnkup7Lm ttit[rrsceaanou2pnnhd11pnponC liti]adyvSEY?il:tililltr)(Jeaeaaacyb1y971bondM tittitrf[Irfseeeeeeobpnonnuhpdndonnnm tltiitlititrr]sssseeeaeecacybnnnydoTN?liititlttilli:r)(sseeaeecgybyhdvv1789oKM ,tiliiltttffrrrsaeceaeaeac1yubndphhn1p7 ttiiltttilrr[ssseaaeeeendybTT iilt]adyvNO?li:tillitlr)rr(seaaeae00yob20nonndoPCwm ,,,,,,,irrfrsscaeeeeop019d7834phgh2w litttilittr]rr[sseaeeeeayb–hon10pTYm?:titlilif)f(ssaaceaacep115po120nvdhovRO llittr-[rfIssseceecaaeuoquyohSTTMMM tliiililit]raaaeydvbyndSEY?.tilti:illtl)r(seaaeae4odyuonpq1v20oRm littrr[-frssscaeeeacuo5phohnTDAwm iitlitliil]raaeadyvdnybSETYMMM? e re v o 0 ta tsn teh 2 n s r i e n s im dn een t e e s r e c o m s w u rP id sep teb rg r a nd se f s l a i o n itac rffe rom ity tico reo ea pm cap itsn e h o a i h c c c d S T ( T C T C s e c i o h c re ex re ;g ts ffsu lpm tsan lilen ionp rs ’s lse co iop lab lla eb in lae ab re d y fo g um tiiyug tsco llrab reom teenh llfun rfom illenb fno abm ssen :)ev and lyn tah lra lad ingn ec spo 97 ay lo ing tau an ea u e 19 it e d n rs m d r A K ( ( d e l l d e le b l e a e r l b o s la t dm led in y e h t f o s s c l it e s b i a r l te l c a ra rb a e h V e r t dn re se d e o s se cu lse re a a on ce M sem iab tiav trsn lab -sco SE cen irse sep edu TM o r r itliitilreeceaaacavdnondpyqu :ltlllll)rsaaeeaendg2015nubyudFL lttrr[rrsssssseeeeeecexduoponpxEm ititiitrrrssssaeebhougdhpnonoub i]rsaconpoSEYm?till:lrf)rrsaeeceadnn9319knobvuB iiliilllttilr[Ireeaeaepovbgnbymm tiliil]raeybSEY?.lt:lllllll)saeeceeaa2041yudondbF iililtititlrrseaaeeeebhnhghhynonhw llltitt[’rssaeaeeeabdnpodnnuGm l]aabdSEYm?.l:ltill)rsaeaeee2104ndoovognbk tt[rrrrssssseeeeeeeeponxxonpEmm ttlltf]rrssaeaccanohguohSEY?.lt:tll)-rrrsaeeeaae1992onngodnbv lrrr[rraeeeacnodoovhoTwmMMM tili]aydvSEY?lli:r)ffrsseaaaeeho0270dnouhTG ililiiiilttr[rrssaeeeacacebygnnuynfiT liiilttiil]raeaabynvdydTYMM?.lt:itjii)rrssaeeeecaece2010ghhuq tltllrrrssceeeeeeacaeohxognwwm illltr[scececeaeeaeequndbxndEAm lti]rssahogugoddohbSEY? llitiliil:r)rfrseaaaeho2070dnybG llil[rrrsaeaeaeceggbnvhyouSTM tiliil]aeybSEY? m e t M a b r E K M M R S W S E ( ( ( ( ( ( ( ( 01 th op a d fy K T ( K ( T C p l e a m c u e ev r n t v i o t f co ep it c n -n cn jec jeb llio o b u u N C O S F S - E r y n a e Y te l p O p o ?teb re N s i d d ic ] r r rs e tiy o -ok ]s? r l tily itefi unm liad auq an ep a n v e [R lo u q e se S n u n E o d t M re no spo Y e e c M ?cn ff M d e x i F l s d ttliaaeacedqunoyn .t:tlr)frreeeea0hodb70oh2vb iiititrffrseaceecepdgnvondonfi tltr[ssseeenyophoughR i]rsacponoSEYm?.:iltr)fffcaeec0nop90y2ounm tlllfrrrseeeaaeeoohvdbb ittitiirrssaughdohubon SEY?lti:-r)rffeeeonky0210oZmm ,itrffr()-rsaeeceyponnonPNm tiit-freeceanhnodPNm ittitiirrssahdugohubno SEY?:tlt)rrrfseeee70h02oddhoono tittrscaeaecaenonpunmmm ililtirr-seeacaobydnTMMM O :iitltl)rrrsaeae1997ddohndS tiliillitir[r-aaeaeynddvbynuT itlililiit]raaeayndbvdyNO? id a n s a th b ] o w ev b ] l fi e N lae to M liireacvp titirseanh llsseeaodb rsseeonpno tiitirsnoubd titirseanh iragnyvby iit[scagnfiS irsacnopom rsaednbH itissaenvo tiiacnngfi iitscagnfi irsanopm isaadnG iisaeangv tlir[auyuT tili]ayd?rezeenph ffrceeno recoTMM m p s [S co ra h q v c o s E (C (C (K (S (S isegn rom (-PN rom (-PN D F F t n u o ev on o i ti i h t i t ti ub rse tih h so ub is rw tirs bm w ogu no thp tirs ley toh idh nuo tanh trh tih iw id ity ts i g n g sa w se gh l w u au seon lse roh tih ind ib lse toh ruo a t w on se S c to t q a h e e itisv tiav -po rh o g to (o p e 0 -) - n e ev -0 - r 1 t-eov t--eo tiisv tiag t-eo t-eov iehg t--eo tiag tiisv -opo t-eon itisv tiag roh tiisv e o t- - o e ( o N P 0 0 P N 1 P m u n r e d r O ce sse n n n o ed po lia to ?o c ] p -s on se n re d ty s e p r io r n il re u tily rrse teh tub ew sop iab rco r[T auq co lly tirs l:)o rreo lire low litiy S taa low tian ihd 007 rsc ttse :)7 iab EY dno :)b7 tsaub rguho O se2b ebum trsee– r002 llrey ]? ecen .l020 tsac ltey N?rdK ren [sT lfeho tcan ilitby id a p ts ] n w le l fi ia iilrcaepv tiitrsaen itsenom sseepon irsapnom ttseadmm tleenhho iitlseavb isaandG iirssengw lreTMM m h do [R co a w po ra lo M C E ( R S ( ( h n n l a a e ic n s :)nu7 ifrcon :)n5u99m iliebpovd tend :)ehn3w tiehbndw ilfrrecnod ,tsendhno rseepovo :r)eaavb tsseggu treevhb itiseopv recunhm itlitazaou rsebunm rsseeaopu eaeeddnm r1 to .la rp sop r02 ib ou sep lsm 9915 laeu fon rom am ecn ino eb in 1 o n 0 t p , c t littrrseacaeegunohm ireaadndnooFmm itttitsaaaccnoohnno ititittseavpohhnhw :i)rssce9u919konno titrrfrseaecenhbonum rsdow iirrsaacabdnkgnoF tlllrfsecaeceeeduyb ilttfssecaeonpho ittirr’acaeeghhuM llllrssaeaehoubdbv ttrrrfeeeahookwm rrffsseaeeeacndhP lllrrsaeeacaebobvm ltlirseaceehbyunwm lttrrrreaecaeaoppo illirrecaeanubobpm tiitseeendvpoh ilrreazacppndhHw iitreecaeauvnvgnm tittitirreeaeaepnvogn ltirrsseacaundnohn ltlseaceahgno .lt:)raezac1991hw tittieeecddnnhohw iliirrsaedopbnounm ittillrrrsaaeeceebdyb itttccaeeunoohmmm 2 e e T (A (K (K (O (S S ( S ( n m g u s igh ied ow i e D H M L l a c i r e m u n n e e w t eb lse d u ec lab e tinon itssc ednn lrab c ir o e 3 te sp v le ra rreo and c b a a h C T C liiittilreaceceaaavdnondpquym lli:tilr)rfrsssseaaaeccaednho2070yGmm ititlitliiffrsaeeeceaeaaopvvhonybnd tiliiliiltrr[r-sceeaaeauydovbydnTTMMM tiil]aydvSEY?li:ililitr)rrseezeaaecaeanpndh1997byndS tilitliiltirfrrrssaeaeaecydghhyvhgoymm lilitilrr[-rssseeceaceaauobyndTTMMM itil]aydvON? iil:iitl)rrssacendnonk1991ondpoKwwM ,ilitlilliitlrrssaeaaeecaeybovubn7onpm tiittitttrrrrfr[aaceaeenoopovnobudouP titt]seaduSEY?:iitllittffr)sseaece9184dpondnhdoynghwm ,lttilitit[aaaaeuqynodvdyhodTmMMM tilrrr]rffseaaeeecdnduoNO?.tlt:liitilti)rrseaaaceeao2009hdponduvdym itiliit]rr[aenvodySECY?lli:tiir)rfrsaaeaadnho2007noovdpngG tlitiitltrrrssaecaeecagoypounvgnnybohfim tililiitil[r-rrsaaaeeecynbdvdyuoTTMMM tililiitil]raaaeynbdvdySEY?li:litir)rseezeaaeccenpndh1997pxS titiliiltitffrsaeeceaahnopnodonybbum ilitiillitirrrr[aeeceeadvyuhghobySTTMMM itli]aaydvndSEY?:itr)frfrssaaeeceendn19hu18onghPm ltltiiittirrrfrsseaeaeecaeevdnndhoppoonom tittttlr[rssssseacaeeeeendnonhpogyponyR itiititr]rrsscadhgubuonoohponSEYm?.tlt:jiiiti)rrsssaeeeceae2010dponnm iltrrrsssssceceeacaeeeeeednnoquxponwm itrr[sssceeceaceeeennudxqonpEAm i]sabSEY? E (S (S (A (A (M (S (S (S (W 1 g e s 0 d s t 2 le i n r w e e w r f o e n m ig m m s m y e y s D S A d ry e t u e in s m t c n ti m o s y c ir s e ’ 3 t s c e le ra la b a c a h S T C m E I N e v i t a n r e t l a l a r t u e N t tioponKD iiirdnvgop itrehgho ttffeecdoh irsseacen .liitIcpm ssea itfceenofiC treceduon ]SEY?tropp tliiilayb senhw ttrscun tiifsehono ttffceeonn TMM tilircop iiltradyv tlitaaceeaqdnonuy iilitiraecngnovdxpP ilitillttraeaebyboon tilili]raeybyNO?tiil:l)sceeapxdDK ,litiaevdyTmMMM r]rreoSEY?.:lilit)ac2xp601EDK liliiltrrseeaandobyw tiiirssaaacendngdn r-sseeanponnond itil:)sceexp1doDK rrrsssceeeceynopn tiiisscaaadngnbfiS il:)s9931nnouAw itrreepogovKDm tilii]aeybNO?.l:litiir)eae9vo912dw ltili[eccypxoTMMM SE :)rrfeeh700o2povT itissaeacavhnognfi lit-r[rsaeceuyuqoTM tili]ayvdNO?i)rsae9971dnxSDK iltiitltrffeaceanobyo illitiraeabyndTMM O iilircaepdvEm il:)(n2007wA rsaacahopm lili[eey–WWr(send1948wA tlti[aaadquy ilrsaeadndu t(eeeeuLwD iitssaangdm lrseoDKwm ililittr[Iaeebym l]aahpSEY?l(cendon991CM isceceaceuqno i[sceececqunA l(ecadnonndCM tirffrsaedoonw lr-r[rseceuoT tlr(seeagodR irffengoDK liti]adyvY?illr(saaandhSG tieopnoodKD trsaeeeunmm ililitraeabydn lr(eezecaephnS ilitiscepdom r-[rseceuoTM iilt]adyvN? d e d i v it it o n c c r ig li li p s p p t e x o D E Im N n o i t p o ) K d D e ( itnu sc ’’w no its on c ir k 3 te t’ c n le ra o b a D a h ‘‘ T C t c d a re p se S a O .:iltiltitl)ffrrssseeceaeeeced5102onkkunnghow ttiiiiliittrsssseaeaaepohhgnodnnoddngdbwmm ,ttililttifr[rrssssscaeaaeaacnouobndbddooxSfim ti]r-IssseeeaanpoonndnbSEYm?.:ltiitlrr)rrssssaeeeaceaaaeb7002pnonhopbm tilittlr-sssceeaeeeecaeononndnupnbbobxwm tltiititrr[sssseeeohguyhdbunpoonR i]rsacnopoONm?.:lttitiiitir)rsssaeeaaaca9002oxyhnhngnnfiBm tititl-[rrsssssaeceeceeepdoonopnnopoonpnoRm ittiittliir]rrssssceaaudghuohboonybponEYm?.il:ltiffr)rssseaeeececd1020noohokodnopwm iliiltltiililitr[r]rrssaeeceacaaeybnoongbySN?.:ltlr)rrsssaeeeecaaaeno4002nnopoupopbwm ttliitt-[I-rseaeeeeecenodponudpndbnnoonwwmm ]rsseeonnnopON?.:liitittlir)rrssseaeeaaeeo6020ngupondhdmm ttttititilt[I-rrseeaeaeconobuxpunodhnm ]rsseeonnonpSEY?zun:lffrfr--rr)sssaacaeeepdngdod1520udohm ttiittlrrr-ssseceaaeecaeoopnpodnnobouodnmm t]r-[IsseeenpononSEYm?:tiiir-)rsssaeeae6102dnononnduonpCm ilttiitrr-ffrseeaccaedynnpongnonoddopdonfiwm ttitt]-I[r-sseeeenuponnponxboxNOm?:iti-r)sseeeceodpdondun02p20onnounnoflwm iiitttrrrrssseaceaeaacauovbhodpngndooubonwm ttlittiirr[sssseeeuohgyhdbuponnoR i]rsacnopoONm?.:tltttlirrr)rsssssseaeeaeeeeeea5102ponobndnw liittitlitt-ffrI[r-rsssaeeacaaceeennonnguoobdyndnfi ]rsseeonnonpON? B ( C ( (C (C (C (C (K (L (R R ( d e s n th le a d m ilrseaceaegonvdo s itttttr-eeenxopubbx rrsceeebuoommm littirssaeenodpo ltitrssecoohow ,tilrseeeaehvhuw itrffrraeeeedddoqu llillirssaccdnopykm rrsaeeeunonmm trrsaeeenondm ititrr-sseeaaquow iirrrseeeuqongm itlt-scaeeonnd tti-ssaenbvnu tttirsseeaceynvpm ttsseeeendopdndnn ttlrfrraeehnoohong irfaeaenundnggngo itllti-sssceecaeonn a n o s tiltrrseceaaeohgunm .itlt:rrf)seaakku201ooxb5m ttitfrfsseeeaeenohngopohno .ititli:rr)saeaceanh20nu70m r-rsscaeeeaeubdopdonunwm llirfrsseaeeenhgnubpoobwm .iitl:trrf)saeanh2ooxb900m tiltrssseaeaeeeuonqkhopond liitirssaepydnopddvo .tl:rr)seeaepuo200xoopbd4 ttrrfsseeceaopondnoohvhwm tttirssseeeaehononpow .tl:-)reeeeaudopod8020Lww trfrrssseeeednuboonndopm lli:r)rsaaeenndo1p002kBwm ttitrrf-rsaeeudhdnnuodpowwm trssseceppo .tl:r)eeaaeeknu2011nddomm iittireaecaadn–yhoodponhnnon iltiitrrfsseeevodppobynnodom rssseepon :)rrzaaan210u5dgyopddnm tiirrssssseeeececeeopnndnpon tireeomm i:it)sseaeep2002hndnvomm tlfsssecaepoy .ttlil:r)rrssseeaeaeo2015dm ttttrraceeaaaepdodnubopnhd T no (B (C (C (C (D (D (F (K (R (R i t a t n e s e r u p n se l n tu e p ic au ito p m ro o s c in n d h i e c v l w d ng ’se t-se re t-oxb -od -an is la in d x op ag e c o li e r r D s P S T D D e s n o p e s s e n r o e p h s t e r d e f u o l le3ticonn titirrssaecc tiitrrssecca ifsseapovuy itrreeequnm b a a a h h T titilliirecaaeeaacnonddvyqupm .,:litilr)rrssseaeeeea1502opoonnnddkkundmm tiiltliiitfrrrrsssseeeceeeapoonnopnooddhghndghm tiitlttiifr[rrrrsscaeeaaeeacdonpoookpkngSfimmm tii]r-ssseeeaanonpondnbSEYm? lll:ltir)rrsssaaaeaecaeaedn197dndd5ndobndCw tttiiltllttf[rrrssssaeecaeaecnohhydopyvoonuCw itil]aydvSEY?il:ilitiitiilrfrrr)rsssaeeaaccaeby1991hghoknong tilttttr[rr-rffrseeeeeeeaacgnhhopnoonhoduPmm ttltttit]rrrsssceeeeaeo–onnoSEYmm?i:tlliittrrr)rrseeeaaceaeeee1042huddovpdnnvddbC iitlttltiliilrr[rrsssssacaaeaeceaeeadyvodnynhohbonP iilttitl]rrrsceeaaaecnnovdngnodyvoSEY?.:ltiltiitrffr)rrsssseaeeaeaczaceen9891pognnhydnfiw iititrf[rrrssssaaaaeeeeenodnoypoonduhhponRwmm ititiitltr]rrsssacebudonhgupoonohySEYm? .:lttiliitltr)rrsssseaeeaeaap9002onnoonnnhyou littiitiltffrrrrsaceeecaeeacevodngndyponfiwm tlttiiitrr[sssseeedohughyubonnopR i]rsacnopoSEYm?.:ltttitili)rsseeaeeeea900p2hgonnnopoopn iltllltitititrrffrrrrsssscaeaeeeazeeyoovuundnoohnpon tltitiitititirr[rsssssseeeypononouhhgbdubduonR i]rsaconpoSEYm? S B E ( A K L S ( ( ( ( T ( N rse ify m o s eh ed m n t re io :) g )83 and th tcu 10 irn 9 a d u 1 ls ts 20 sa rn ia in o g r r e n t u d o i n se p r i m d b e e s e d n y o thy reP ood lleb r h a ff a o n : g B p m ) n d g a e 7 e n e l d l 0 l a b eb an ,te tih 0 s w k to t r i n o an rse se 2 e c n i i u sn y n m d l w eq ro a an d d a l r m c u la sc A K S ( ( ( t a e v i t a r t s u l l i C ( t to rs fi re g n y o ill eh m i t n ta o n ev p n e s z th :) to p o re ir 4 e :) ho and 00 th 2 ta 9 0 e 20 th lly .la sn .l in tan te ito a n o u p te ito irz a o l p o eg ev e o h n i o a t p d d r i e a a u so o re re o p T T ( ( y a l p s i d t u o y a l S N t n e to s l a s b l r e e b v la n l e a e c i w r t e e b m u p n a l r d e n v a O s e c r i e o m te h to p c t o m n o T e ed r o isg t/fB t/h ldd tis edd rem reh eno e e ig i u a h t l a r t n l a e o a n iz tic li r r n o e o D L R M O L T O N H V N O T n d e t e titepohoKD tiissaaadgnn saendpon tti-ssanubnvno ,ssepon iitscaaengondfi ittiirsnoubd tti-ssaenvubno [rssseeeponRw SEY? itillaeydv ttr[scunoC tedn it-rreponod SEY itifrfseceennh itraedohd tiiitrsonubd n s se il d n a h ?tn s g f n ]n ab se e s o a o ra la no fo irs p c iililittrcaeeceaaapvdnonqud .tl:llr)saeeeeaeecu2016ypLw tttirrfrssssseaeeeeoohubnvndpm ilititilI[rrrceeeaeydopudhghbm itlf]eceaaonphfiSECY?.ititl:tirr)ssaaeaaehn2009onp tiltiitiffrssaeecapoondogndnfi tititirfssaeaeepnoohdpdonom ittlrffr[sssseeeceeednponhyoR ir]scaoponSEYm?.tlti:rr)seaaeeaaaoungu2004p tittitiitffrssaeceepoondhdubon lttitiitrrssecayhoughdbuonpom ll:)rrseaaacndndnd57o19Cw tltirfssseeeaeccaeaabnngndw ilti]advyNO?t:tir)saeeecacoh0211hnomm iitlirrssscaeeangnyhghnnofiw ltiltr[I-sssseeccaeeenonpnom i:-)sseacdnduh0213ngonFm ttlirsssseeeeeeeponbnhywm ttitl[sssseeeeoubndngponyR ir]scapoonNOm? y m D E ( C ( T ( A D E ( ( ( C T ( T ( to an in th t a rm ’s o t f n fo sg tsn a y s a e ls b e e l r th n y a c a if c n e tr t s i v n a r e sace and h fo id s f n le ty to :) io a is r 3 t sc se ie 1 n s c s 0 te e e a 2 t c n e s a e a e r h e F h a cu th :) t s F t t 8 g e d ca en 99 itn fca an ttr m 1 an ,s s e a yo in i d d d d j n im ro ro m an en u l e w w E K ( ( s lo ls b o m b y m s y sce an sn a m a f u m g h u n r h il e - e ee th on on F O N N s e g a m i Schaeffer, N.C., Bradburn, N.M.: Respondent behavior in magnitude estimation. J. Am. Stat. Assoc. 84, 402–413 (1989). doi:10.2307/2289923 Schaeffer, N.C., Presser, S.: The science of asking questions. Annu. Rev. Sociol. 29, 65–88 (2003). doi:10. 1146/annurev.soc.29.110702.110112 Scherpenzeel, A.C., Saris, W.E.: The validity and reliability of survey questions: a meta-analysis of MTMM studies. Sociol. Methods Res. 25, 341–383 (1997) Schuman, H., Presser, S.: Questions and Answers in Attitude Surveys: Experiments on Question Form. Wording and Context. Sage Publications, Thousands Oaks (1981) Schwarz, N., Grayson, C.E., Knauper, B.: Formal features of rating scales and their interpretation of question meaning. Int. J. Public Opin. Res. 10, 177–183 (1998). doi:10.1093/ijpor/10.2.177 Schwarz, N., Hippler, H.-J.: the numeric values of rating scales: a comparison of their impact in mail surveys and telephone interviews. Int. J. Public Opin. Res. 7, 72–74 (1995). doi:10.1093/ijpor/7.1.72 Schwarz, N., Hippler, H.-J., Deutsch, B., Strack, F.: Response scales: effects of category range on reported behavior and comparative judgments. Public Opin. Q. 49, 388–395 (1985). doi:10.1086/268936 Schwarz, N., Kna¨uper, B., Hippler, H.-J., Noelle-Neumann, E., Clark, L.: Rating scales: numeric values may change the meaning of scale labels. Public Opin. Q. 55, 570–582 (1991). doi:10.1086/269282 Sturgis, P., Roberts, C., Smith, P.: Middle alternatives revisited: how the neither/nor response acts as a way of saying ‘‘I don’t know’’? Sociol. Methods Res. 43, 15–38 (2014). doi:10.1177/0049124112452527 Sudman, S., Bradburn, N.M.: Asking Questions: A Practical Guide to Questionnaire Design. Jossey Bass, San Francisco (1983) Toepoel, V., Das, M., van Soest, A.: Design of web questionnaires: the effect of layout in rating scales. J. Off. Stat. 25, 509–528 (2009) Tourangeau, R., Couper, M.P., Conrad, F.: Spacing, position, and order. interpretive heuristics for visual features of survey questions. Public Opin. Q. 68, 368–393 (2004). doi:10.1093/poq/nfh035 Tourangeau, R., Couper, M.P., Conrad, F.: Color, labels, and interpretive heuristics for response scales. Public Opin. Q. 71, 91–112 (2007). doi:10.1093/poq/nfl046 Tourangeau, R., Rips, L.J., Rasinksi, K.: The Psychology of Survey Response. Cambridge University Press, Cambridge (2000) van Doorn, L.J., Saris, W.E., Lodge, M.: The measurement of issue-variables: positions of respondents, candidates and parties. In: Middendorp, C.P., Niemo¨ller, B., Saris, W.E. (eds.) Het Tweed Sociometric Congress, pp. 229–250. Dutch Sociometric Society, Amsterdam (1982) Weijters, B., Cabooter, E., Schillewaert, N.: The effect of rating scale format on response styles: the number of response categories and response category labels. Int. J. Res. Mark. 27, 236–247 (2010). doi:10. 1016/j.ijresmar.2010.02.004 Aiken , L.R. : Number of response categories and statistics on a teacher rating scale . Educ. Psychol. Meas . 43 , 397 - 401 ( 1983 ). doi:10.1177/001316448304300209 Alwin , D.F. : Feeling thermometers versus 7-point scales . Which are better? Sociol. Methods Res . 25 , 318 - 340 ( 1997 ). doi:10.1177/0049124197025003003 Alwin , D.F. : Margins of Error: A Study of Reliability in Survey Measurement . Wiley, Hoboken ( 2007 ) Alwin , D.F. , Krosnick , J.A. : The reliability of survey attitude measurement: the influence of question and respondent attributes . Sociol. Methods Res . 20 , 139 - 181 ( 1991 ). doi:10.1177/0049124191020001005 Amoo , T. , Friedman , H.H. : Do numeric values influence subjects' responses to rating scales? J. Int. Mark. Marking Res . 26 , 41 - 46 ( 2001 ) Andrews , F.M.: Construct validity and error components of survey measures: a structural modelling approach . Public Opin. Q . 48 , 409 - 442 ( 1984 ). doi:10.1086/268840 Andrews , F.M. , Crandall , R.: The validity of measures of self-reported well-being . Soc. Indic. Res. 3 , 1 - 19 ( 1975 ) Andrews , F.M. , Withey , S.B. : Social Indicators of Well-Being: Americans' Perceptions of Life Quality . Plenum Press, New York ( 1976 ) Al Baghal , T. : Numeric estimation and response options: an examination of the accuracy of numeric and vague quantifier responses . J. Methods Meas. Soc. Sci. 6 , 58 - 75 ( 2014a ). doi:10 .2458/azu_jmmss. v5i2 . 18476 Al Baghal , T. : Is vague valid? The comparative predictive validity of vague quantifiers and numeric response options . Surv. Res. Methods 8 , 169 - 179 ( 2014b ). doi:10 .18148/srm/ 2014 .v8i3. 5813 Bendig , A.W. : Reliability and the number of rating-scale categories . J. Appl. Psychol . 38 , 38 - 40 ( 1954 ). doi: 10 .1037/h0055647 Billiet , J. , McClendon , M.J. : Modeling acquiescence in measurement models for two balanced sets of items . Struct. Equ. Model A Multidiscip. J . 7 , 608 - 628 ( 2000 ). doi: 10 .1207/S15328007SEM0704_ 5 Bishop , G.F. : Experiments with the middle response alternative in survey questions . Public Opin. Q . 51 , 220 - 232 ( 1987 ). doi:10.1086/269030 Brown , G.T.L.: Measuring attitude with positively packed self-report ratings: comparison of agreement and frequency scales . Psychol. Rep . 94 , 1015 - 1024 ( 2004 ). doi:10.2466/pr0.94.3 . 1015 - 1024 Buskirk , T.D. , Saunders , T. , Michaud , J.: Are sliders too slick for surveys? An experiment comparing slider and radio button scales for smartphone, tablet and computer based surveys . Methods Data Anal. 9 , 229 - 260 ( 2015 ). doi: 10 .12758/mda. 2015 .013 Christian , L.M. , Dillman , D.A. , Smyth , J.D. : Helping respondents get it right the first time: the influence of words, symbols, and graphics in web surveys . Public Opin. Q . 71 , 113 - 125 ( 2007a ). doi:10 .1093/poq/ nfl039 Christian , L.M. , Dillman , D.A. , Smyth , J.D.: The effects of mode and format on answers to scalar questions in telephone and web surveys . In: Lepkowski, J.M. , Tucker , C. , Brick , M. , De Leeuw , E.D. , Japec , L. , Lavrakas , P.J. , Link , M.W. , Sangster , R.L . (eds.) Advances in Telephone Survey Methodology, pp. 250 - 275 . Wiley, Hoboken (2007b) Christian , L.M. , Parsons , N.L. , Dillman , D.A. : Designing scalar questions for web surveys . Sociol. Methods Res . 37 , 393 - 425 ( 2009 ). doi:10.1177/0049124108330004 Cook , C. , Heath , F. , Thompson , R.L. , Thompson , B. : Score reliability in webor internet-based surveys: unnumbered graphic rating scales versus Likert-type scales . Educ. Psychol. Meas . 61 , 697 - 706 ( 2001 ). doi:10.1177/00131640121971356 Couper , M.P. , Tourangeau , R. , Conrad , F.G. , Crawford , S.D.: What they see is what we get: response options for web surveys . Soc. Sci. Comput. Rev . 22 , 111 - 127 ( 2004 ). doi:10.1177/0894439303256555 Couper , M.P. , Tourangeau , R. , Conrad , F.G. , Singer , E.: Evaluating the effectiveness of visual analog scales: a web experiment . Soc. Sci. Comput. Rev . 24 , 227 - 245 ( 2006 ). doi:10.1177/0894439305281503 Couper , M.P. , Traugott , M.W. , Lamias , M.J.: Web survey design and administration . Public Opin. Q . 65 , 230 - 253 ( 2001 ). doi:10.1086/322199 Cox III , E.P.: The optimal number of response alternatives for a scale . J. Mark. Res . 17 , 407 - 422 ( 1980 ). doi:10.2307/3150495 De Leeuw , E.D. , Hox , J.J. , Dillman , D.A. : International Handbook of Survey Methodology. Routledge , New York ( 2008 ) De Leeuw , E.D. , Hox , J.J. , Boeve , A. : Handling do-not-know answers: exploring new approaches in online and mixed-mode surveys . Soc. Sci. Comput. Rev . 34 , 116 - 132 ( 2016 ). doi:10.1177/ 0894439315573744 Derham , P.A.J.: Using preferred, understood or effective scales? How scale presentations effect online survey data collection . Australas. J. Mark. Soc. Res . 19 , 13 - 26 ( 2011 ) Lundmark , S. , Gilljam , M. , Dahlberg , S. : measuring generalized trust. an examination of question wording and the number of scale points . Public Opin. Q . 80 , 26 - 43 ( 2016 ). doi: 10 .1093/poq/nfv042 Malhotra , N. , Krosnick , J.A. , Thomas , R.K. : Optimal design of branching questions to measure bipolar constructs . Public Opin. Q . 73 , 304 - 324 ( 2009 ). doi: 10 .1093/poq/nfp023 Matell , M.S. , Jacoby , J.: Is there an optimal number of alternatives for Likert scale items? Study I: reliability and validity . Educ. Psychol. Meas . 31 , 657 - 674 ( 1971 ). doi:10.1177/001316447103100307 McClendon , M.J. : Acquiescence and recency response-order effects in interview surveys . Sociol. Methods Res . 20 , 60 - 103 ( 1991 ). doi:10.1177/0049124191020001003 McClendon , M.J. , Alwin , D.F. : No-opinion filters and attitude measurement reliability . Sociol. Methods Res . 21 , 438 - 464 ( 1993 ). doi:10.1177/0049124193021004002 McKelvie , S.J.: Graphic rating scales-How many categories? Br . J. Psychol . 69 , 185 - 202 ( 1978 ). doi: 10 . 1111/j.2044- 8295 . 1978 .tb01647.x Menold , N. , Kaczmirek , L. , Lenzner , T. , Neusar , A. : How do respondents attend to verbal labels in rating scales? Field Methods 26 , 21 - 39 ( 2014 ). doi: 10 .1177/1525822X13508270 Miethe , T.D.: The validity and reliability of value measurements . J. Psychol . 119 , 441 - 453 ( 1985 ). doi: 10 . 1080/00223980. 1985 .10542914 Moors , G. , Kieruj , N.D. , Vermunt , J.K. : The effect of labeling and numbering of response scales on the likelihood of response bias . Sociol. Methodol . 44 , 369 - 399 ( 2014 ). doi:10.1177/0081175013516114 O'Muircheartaigh , C. , Gaskell , G. , Wright , D.B. : Weighing anchors: verbal and numeric labels for response scales . J. Off. Stat . 11 , 295 - 307 ( 1995 ) Pohl , N.F. : Scale considerations in using vague quantifiers . J. Exp Educ . 49 , 235 - 240 ( 1981 ). doi: 10 .1080/ 00220973. 1981 .11011790 Preston , C.C. , Colman , A.M.: Optimal number of response categories in rating scales: reliability, validity, discriminating power, and respondent preferences . Acta. Psychol . (Amst). 104 , 1 - 15 ( 2000 ). doi: 10 . 1016/S0001- 6918 ( 99 ) 00050 - 5 Rammstedt , B. , Krebs , D. : Does response scale format affect the answering of personality scales? Eur . J. Psychol . Assess. 23 , 32 - 38 ( 2007 ). doi: 10 .1027/ 1015 - 5759 . 23 .1. 32 Reips , U.-D.: Context effects in web-surveys . In: Batnic, B. , Reips , U.-D., Bosnjak , M. (eds.) Online Social Sciences, pp. 69 - 79 . Hogrefe & Huber, Cambridge ( 2002 ) Reips , U.-D., Funke , F. : Interval-level measurement with visual analogue scales in Internet-based research: VAS Generator . Behav. Res. Methods 40 , 699 - 704 ( 2008 ). doi:10.3758/BRM.40.3.699 Revilla , M. : Effect of using different labels for the scales in a web survey . Int. J. Mark. Res . 57 , 225 - 238 ( 2015 ). doi: 10 .2501/IJMR-2014-028 Revilla , M. , Ochoa , C. : Quality of different scales in an online survey in Mexico and Colombia . J. Polit. Lat. Am. 7 , 157 - 177 ( 2015 ) Revilla , M. , Saris , W.E. , Krosnick , J.A. : Choosing the number of categories in agree-disagree scales . Sociol. Methods Res . 43 , 73 - 97 ( 2014 ). doi:10.1177/0049124113509605 Rodgers , W.L. , Andrews , F.M. , Herzog , A.R. : Quality of survey measures: a structural modeling approach . J. Off. Stat . 8 , 251 - 275 ( 1992 ) Rossiter , J.R. : Measurement for the social sciences: The C-OAR-SE method and why it must replace psycometrics . Springer, New York ( 2011 ) Roster , C.A. , Lucianetti , L. , Albaum , G.: Exploring slider vs. categorical response formats in web-based surveys . J. Res. Pract . 11 ( 2015 ). http://jrp.icaap.org/index.php/jrp/article/view/509/413 Saris , W.E. : Variation in Response Functions: A Source of Measurement Error in Attitude Research . Sociometric Research Foundation, Amsterdam ( 1988 ) Saris , W.E. , Gallhofer , I.N. : Design, Evaluation, and Analysis of Questionnaires for Survey Research . Wiley, Hoboken ( 2007 ) Saris , W.E. , Gallhofer , I.N. : Design, Evaluation, and Analysis of Questionnaires for Survey Research . Wiley, Hoboken ( 2014 ) Saris , W.E. , Revilla , M. : Correction for measurement errors in survey research: necessary and possible . Soc. Indic. Res . 127 , 1005 - 1020 ( 2016 ). doi: 10 .1007/s11205-015-1002-x Saris , W.E. , Revilla , M. , Krosnick , J.A. , Shaeffer , E.M. : Comparing questions with agree/disagree response options to questions with item-specific response options . Surv. Res. Methods . 4 , 61 - 79 ( 2010 ). doi: 10 . 18148/srm/ 2010 .v4i1. 2682 Saris , W.E. , De Rooij , K. : What kind of terms should be used for reference points . In: Saris, W.E. (ed.) Variations in Response Functions: A Source of Measurement Error in Attitude Research , pp. 188 - 219 . Sociometric Research Foundation, Amsterdam ( 1988 ) Schaeffer , N.C. : Hardly ever or constantly? Group comparisons using vague quantifier . Public Opin. Q . 55 , 395 - 423 ( 1991 ). doi:10.1086/269270

This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs11135-017-0533-4.pdf

Anna DeCastellarnau. A classification of response scale characteristics that affect data quality: a literature review, Quality & Quantity, 2017, 1-37, DOI: 10.1007/s11135-017-0533-4