Prospective educators as consumers of empirical research: an authentic assessment approach to make their competencies visible

Empirical Research in Vocational Education and Training, Apr 2017

Background Educators today have to be able to make current empirical research results usable for everyday practice. Consequently, there are increasing endeavors to develop and assess competencies in consuming empirical research (CCER) on an academic level. However, problems with regard to recruiting and motivating test participants—rooted in the prevalence of low-stakes testing conditions—could limit confidence in the validity of the findings. The current study presents a structure and proficiency level modeling for CCER under high-stakes conditions. Method The sample comprises N = 155 bachelor students of Human Resource Education and Management. The assessment design of the 26 items complied with demanding standards for designing tests (such as Evidence-Centered Design and authenticity). Results The results are as follows: (1) We were able to confirm our expected structural model which consists of two dimensions (‘conceptual competencies’ and ‘statistical competencies’) instead of one overarching dimension. (2) The test items are of a high quality. (3) Three levels of CCER could be defined according to two task characteristics (cognitive processes and complexity) which explain nearly 100% of the prospective educators’ CCER abilities. Conclusion The results of the study show that we succeeded in designing a reliable and valid test instrument for assessing (prospective) VET-educators’ competencies in consuming empirical research.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1186%2Fs40461-017-0052-5.pdf

Prospective educators as consumers of empirical research: an authentic assessment approach to make their competencies visible

Wiethe?K?rprich and Bley Empirical Res Voc Ed Train Prospective educators as?consumers of?empirical research: an authentic assessment approach to?make their competencies visible Michaela Wiethe?K?rprich Background: Educators today have to be able to make current empirical research results usable for everyday practice. Consequently, there are increasing endeavors to develop and assess competencies in consuming empirical research (CCER) on an academic level. However, problems with regard to recruiting and motivating test participants-rooted in the prevalence of low? stakes testing conditions-could limit confidence in the validity of the findings. The current study presents a structure and proficiency level modeling for CCER under high? stakes conditions. Method: The sample comprises N = 155 bachelor students of Human Resource Education and Management. The assessment design of the 26 items complied with demanding standards for designing tests (such as Evidence? Centered Design and authenticity). Results: The results are as follows: (1) We were able to confirm our expected structural model which consists of two dimensions ('conceptual competencies' and 'statistical competencies') instead of one overarching dimension. (2) The test items are of a high quality. (3) Three levels of CCER could be defined according to two task characteristics (cognitive processes and complexity) which explain nearly 100% of the prospective educators' CCER abilities. Conclusion: The results of the study show that we succeeded in designing a reliable and valid test instrument for assessing (prospective) VET? educators' competencies in consuming empirical research. High? stakes testing; Item? response? theory; Research competencies - Educators in?vocational education and?training (VET) should act as?consumers of?empirical Undertaking empirical research stimulates profitable innovation (Egeln et?al. 2002). Natural scientists, for instance, have long taken it for granted that they should base their practical actions on current scientific research. In the healthcare sector, for example, hardly anybody wants to be treated by a doctor who refers to outdated research findings (Jahed et? al. 2012). In many professions it is now common that practitioners are obliged to know about the latest relevant research results. Educators too have to be familiar with principles of empirical research in so far as they are able to reflect and to ? The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. critically question the findings of scientific research. Correspondingly, it has been suggested that evidence-based practice should prompt educational professionals to be aware of recent advances in their area of work (Darling-Hammond and Bransford 2005, pp. 15?16; Weber and Achtenhagen 2009). This enables them to monitor whether their educational activities are successful. But the sector of science also benefits if the latest research results are applied in economic practice. On the other hand, science can take up the research interests postulated by practice (Wuttke 2001, p. 40; Zurstrassen 2009, p. 41). Slavin provides a pithy summary of the situation: Educators need to be sophisticated consumers of research, regardless of whether they are also producers of research (2007, p. 2). Referring to Slavin (2007) and further authors (e.g. Stark and Mandl 2001; Schweizer et? al. 2011) research methodological tasks comprise two key challenges: (a) reviewing empirical academic literature, and (b) independently performing empirical research projects. Within this study, we consider educators as consumers?not as producers?of research, who must be able to avail themselves of scientific results?in terms of scientific studies?in their everyday practice. Therefore, we suggest, they need to have competencies in consuming empirical research (CCER). Slavin?s (2007) as well as Darling-Hammond?s and Bransford?s (2005) requirements originally referred only to teachers. We claim that the active use of research findings is relevant for all people responsible for education, because teaching and learning take place in various settings, not only in schools. We chose VET-educators [Human Resource Education and Management students of the Ludwig-Maximilians-University in Munich (LMU)] as target group for our study, because they cover polyvalent professional areas. They are typically employed in various workplaces, for example (1) as teachers or trainers in schools and companies or in organizations for further education, (2) as organizers of vocational training, human resource management, and professional development within enterprises, (3) as administrative educators or politicians within chambers, associations, or ministries, or (4) as consultants or coaches in different educational environments. A teacher employed in a vocational school, for instance, needs CCER for aligning his/her instructional methods to the latest research results on efficient teaching. Another example constitutes employees operating within a company?s apprenticeship department. They need CCER in order to design workplace learning processes according to current scientific findings within this field. Research?based training of?VET?educators The described evidence-based orientation has manifested itself within the many international and national professionalization standards for educators. Scientific studies on the effectiveness of teacher education and the corresponding professionalization efforts exist in the field of general (e.g. Baumert and Kunter 2006; Bl?meke et?al. 2008; Bl?meke et?al. 2011) and vocational (Bouley et?al. 2015) education. Within the current scientific literature, the professionalization standards are often the starting point for measuring competencies. They reflect?among other competencies?the importance of educators? research methodological competencies: (1) All ten international ?Core Teaching Standards? modeled by the InTASC (Interstate Teacher Assessment and Support Consortium 2011) imply skills linked to the field of practices in consuming research; (2) concerning the twelve Swiss standards? formulated by Oser (1997)?nearly all these standards include practices in consuming research implicitly; (3) within the national German KMK standards (?Kultusministerkonferenz?; Ministers? of Education and the Arts conference) (KMK 2004) competence number ten is assigned to the area of innovating and postulates that teachers should understand their profession as a lifelong learning task. In the light of the continuously decreasing halflife of knowledge, this is an essential claim that is only feasible if teachers master practices in consuming research. Accordingly, competence number ten stresses the aims and methods of educational research as well as the interpretation and application of its results as one central curricular focus. The corresponding standards emphasize inter alia that graduates of teachers? study programs must be able to receive and evaluate results from educational research and to use these results to optimize their educational activities (KMK 2004, pp. 5, 12). From the present authors? perspective, these requirements are transferable and necessary for all people responsible for education. Furthermore, it has to be guaranteed that the evidence-based orientation is also embedded within the corresponding instructional processes (e.g. Slavin 2008, pp. 5?14; Fichten 2010, p. 159). Assessing prospective VET?educators? competencies in?consuming research The competencies defined within the curriculum and implemented within the instruction program have to be translated into an operationalized form in order to make them measurable. Older studies on research-methodological competencies by Stark and Mandl (2001), Schweizer et?al. (2011) as well as Wagner and Maree (2007) focused more on the development, implementation, and evaluation of training programs for promoting these competencies. Comparable with our intention of modeling and measuring competencies in consuming empirical research, only the recently published AHELO project by the OECD (Tremblay et? al. 2012) and the LeScEd (Learning the Science of Education) project (Gro? Ophoff et?al. 2014, 2015) exist. They aim at assessing competencies that are relevant in the field of working scientifically in higher education. Both initiatives address the application of research concepts and the adequate use of statistical tools. The LeScEd project deals with modeling and measuring the educational research literacy of students within the field of educational sciences and therefore adopts a similar approach to our work. But?in contrast to our study?within both existing initiatives (LeScEd and AHELO), testing is performed under low-stakes conditions. As Wise and DeMars (2005, 2006) show, in some cases low-stakes testing conditions can lead to fundamentally biased test results. These effects could be intensified in the field of higher education, because of the lack of compulsory attendance (Wise and DeMars 2006; Wolf et? al. 2015). Students who organize their studies independently?what is explicitly desired?could tend to neglect low-stakes tests. Due to competitive obligations during their studies, they will rather prioritize high-stakes tests that bear serious consequences for their academic progress. Consequently, low-stakes testing in higher education could lead to a higher probability of self-selection effects, as well as to a lower motivation for participating in the respective test. This can threaten the representativeness of the sample and raise the number of not-answered tasks (missing values). In addition, for some types of learners the application of statistical tools represents a serious obstacle, due to their anxiety over such formal methods. Such individuals are often unable to cope with corresponding tasks (Onwuegbuzie 2001). This phenomenon can reinforce the number of missing values and is likely to be intensified if low-stakes testing conditions prevail. This can entail large losses of data points and bias the calculated estimators as well as the identified competence structure. Further, if we consider the missing values as representing participants who lack statistical competencies, this could result in a biased underestimation of these competencies. Therefore, we focus on modeling and measuring prospective VET-educators? competencies in consuming empirical research under high-stakes testing conditions. We aimed at developing an appropriate performance measurement instrument including authentic test tasks which meets the ambitious standards for designing tests. The following approaches led our assessment development: Collegiate Learning Assessment (Shavelson 2008), Evidence-Centered Assessment Design (e.g. Mislevy and Haertel 2006), and the authentic assessment (Janesick 2006). In line with the described overarching goal, the primary objectives of our study are: (1) To develop a structural model for CCER and to prove this model empirically by using the Item-Response-Theory (IRT) (Hartig and Frey 2013). (2) To investigate if the 26 test items meet the central Raschmodeling assumption of equal item discriminability and whether they allow for reliable and valid measurement. (3) To define a proficiency level model for the central CCER in order to make a statement about the prospective VET-educators? degree of competence. Theoretical background and?research questions The underlying concept of?competence In accordance with the discussion of modeling and measuring professional competencies (Bl?meke et?al. 2015; Shavelson 2010) we use a holistic (complex) concept of competence, which integrates analytical as well as behavior-related aspects. Our understanding therefore corresponds with the conception proposed by Bl?meke et?al. (2015), who focus on ?the latent cognitive and affective-motivational underpinning of domain-specific performance in varying situations? (p. 3), as well as Weinert (2001), the CurriculumInstruction-Assessment Triade (Pellegrino et?al. 2001), and the corresponding EvidenceCentered Design (Mislevy and Haertel 2006; Bley 2017). Holistic competence models comprise a horizontal and a vertical layer. The horizontal competence structure (width of competence)?shaping the structural model (Hartig and Klieme 2006, p. 132)?represents theoretically assumed sub-dimensions of the particular construct, which are specified by internal cognitive and non-cognitive dispositions (National Research Council 2012, p. Sum-3). These internal dispositions required for performing situation-specific actions within a domain are not directly observable. They are only measurable through external observable behavior (=performance), which is evoked by test items reflecting workplace situations that depict the competence sub-dimensions. If a student is, for example, able to interpret the values presented by an SPSS-output of a correlation analysis correctly (=performance), the skill to judge outputs from relevant statistical software is attributed to this person. The test person?s external response behavior results from combining internal cognitive and affect-motivational dispositions. Although we prefer a holistic concept of competence, we start from and therefore focus on cognitive dispositions of CCER (skills) within this study, because there are limited robust results in this field. Furthermore, non-cognitive affective-motivational dispositions?such as achievement motivation?are not explicated in our model, because they are not separately measurable (Shavelson 2012). However, they are implicitly covered by the actions that are required to solve our authentic test tasks (cf. chapter?3.2). To facilitate making a differentiated statement regarding varying proficiency levels of students, the test items have to differ with respect to their level of difficulty. For the vertical competence structure (depth of competence)?shaping the proficiency level model (Hartig and Klieme 2006, p. 133)?various competence profiles are assumed. Through focusing on the particular degree of achievement, it provides information on the difficulty level of situational challenges and reflects the different proficiency levels of the particular construct. In line with internationally proven assessment standards, we assumed that competencies are supposed to be malleable and that the formation of competencies proceeds along a linear continuum (Bl?meke et?al. 2015, p. 7; Hartig 2007; Wilson 2005). For relating both the horizontal and the vertical modeling perspective to CCER, see sections ?The domain of CCER: development of a structural model? and ?Scaling CCER: development of a proficiency level model?. The domain of?CCER: development of?a structural model Standards for the design of assessments suggest that relevant and representative observable evidences for typical research-methodological reviewing activities have to be identified in order to develop tasks that are valid as regards the contents on which they focus. This was performed through a domain analysis (Mislevy and Haertel 2006). In order to understand (a) which substantial content areas CCER refers to in detail and which challenges (prospective) VET-educators are expected to master in the field of research methods, as well as (b) which competence dimensions are relevant to managing the respective challenges (Wiethe-K?rprich and Trost 2013), a systematic literature review was conducted. Furthermore, within focus groups experts who run (lecture on) a course on research methods for Human Resource Education and Management students of the LMU were consulted and students who attended this course preceding our test participants were asked to state typical research-methodical challenges/tasks, abilities that are required to master these challenges, and tasks which were particularly difficult to solve. The subsequent ?big ideas? (Pellegrino 2010, pp. 17?18) were derived from the domain analysis: (a) Content areas and?typical challenges One crucial result is that research-driven learning is commonly structured alongside the typical scientific research process. This is pointed out in detail by several authors (e.g., Rost 2007; B?hner 2011). Hence, the contents?that students should possess in the field of research methods?identified as domain-typical can be classified by four central categories: (1) problem definition; (2) methodology used to investigate the research question(s) of interest; (3) analysis, depiction, and interpretation of the results; and (4) discussion, and conclusions derived from the research findings. Furthermore, experts stated that the main challenges could be divided into ?working with research-methodological conceptual procedures? (such as capturing a study?s statement from the abstract) and ?working with statistical issues? (such as interpreting statistical representations; for more examples see Table? 1). Based on an analysis of the module descriptions for courses on research methods of all German university study programs in the field of Human Resource Education and Management (N?=?42), we identified that the curricular emphasis is on scientific literature that deals with research questions answered through applying quantitative?by contrast to qualitative?research methods. As a consequence, in this study we focus on quantitative research methods for defining CCER. (b) Definition and?dimensions of?CCER Inspired by Schweizer et?al. (2011), we derived the following definition for competencies in consuming empirical research (CCER): CCER include competencies which enable an individual to reflect, interpret, and evaluate critically empirical quantitative studies?which are based on educationalpsychological as well as sociological research questions?with regard to the quality of their theoretical foundation, their research questions and design, their methodical procedures, and their results including the practical relevance. Influential conceptual frameworks, which serve for analyzing educational research competencies are generally based on the following two concepts: (1) The SDDS model (Scientific Discovery as Dual Search model; Klahr and Dunbar 1988) defined that the process of gaining scientific knowledge requires three main components: searching for hypotheses, developing research designs, and evaluating empirical evidence (including the interpretation of statistical data analysis). (2) The EBR model (Evidence-Based Reasoning model), which is for example used by the LeScEd group, differentiates the three steps analyzing, interpreting, and applying (Gro? Ophoff et?al. 2014; Brown et?al. 2010). The first step (analyzing) is particularly of interest with regard to instructional research on statistical literacy within the field of mathematics (Groth 2007). On the basis of first results it is questionable if the three different components of those approaches really address different latent sub-competencies of educational research competencies and therefore, if they are actually empirically distinguishable. The LeScEd group shows that a one-dimensional instead of a three-dimensional model fits the data better (Gro? Ophoff et? al. 2014). Summarizing, it is noticeable that all approaches define a specific ?statistical? component while the other aspects are summarized in different variations. This separation of a statistics dimension also becomes evident in practical approaches in the instructional educational science?s field of research methodology. They often distinguish between two main dimensions: research methods and statistics (e.g., Renkl 1994; Onwuegbuzie 2001; Dunn et? al. 2007). As discussed by the authors, the use of statistical procedures to answer research methodological questions frequently constitutes a difficulty for prospective educators. The researchers point out a negative attitude towards statistical contents?manifesting itself in statistics anxieties, emotional hurdles, and mental stress?that prevents students from solving statistical tasks. This goes along with the assumption that the differentiation between statistical and further elements of research methods is referred to the underlying interests and talents of the target group (prospective educators), which are shaped more socially than analytically (Holland 1959). Based on these considerations, we expect two central content-related dimensions of competencies in consuming empirical academic studies: Research-methodological conceptual competencies (DIM1) is the ability to reflect, interpret, and critically evaluate empirical academic literature with regard to the research-methodological categories applied to the study?s structure, theoretical foundation, and research questions, the selected research design, as well as the description and interpretation of the results including their practical relevance. Research-methodological statistical competencies (DIM2) is the ability to reflect, interpret, and critically evaluate the choice and the application of central statistical procedures which are used to answer research questions or to test hypotheses deduced from scientific problems of empirical research (Stark and Mandl 2001, pp. 5?6). We suggest that prospective VET-educators should be able to review a holistic study. To master the relevant challenges that occur within the different research-methodological content areas, various situation-specific skills are required. In line with the idea of the Evidence-Centered Assessment Design approach (Mislevy and Haertel 2006), in Table?1 the two dimensions of CCER are further operationalized and we present a selection of the relevant evidence a student has to adduce in order to demonstrate that he or she has accomplished the respective research-methodological skill (Mislevy and Haertel Both the LeScEd project and our study on CCER include dimensions of conceptualization and statistics in order to operationalize latent sub-dimensions of the competence model. While LeScEd differentiates three dimensions??information literacy?, ?statistical literacy?, and ?critical thinking? (Gro? Ophoff et? al. 2014, p. 254)?statistical and conceptual competencies are differentiated in our interpretation of the domain analysis. Conceptual competencies therefore cover aspects of ?information literacy? and ?critical Scaling CCER: development of?a proficiency level model For scaling a competence scale in proficiency levels, a continuous competence dimension?as it is used for CCER?is divided into discrete, ordinal categories (Fleischer et?al. 2013, p. 8). Only if the items are distinguishable by a varying degree of difficulty can Table 1 Situation-specific skills and?according evidences of?dimension 1 and?dimension 2 DIM1: research?methodological conceptual competencies: 1.1 The student can understand and differentiate basic research methodological terms, concepts, and models The common quality criteria for tests?objectivity, reli? ability, and validity?are explained correctly (includ? ing different kinds of these criteria) 1.2 The student can scrutinize studies? structure, rigor, The research questions/hypotheses of a study are and relevance identified appropriately 1.3 The student can assess the appropriateness of research designs critically 1.4 The student can make and work with interpreta? tions, causal explanations, and predictions DIM2: research?methodological statistical competencies 2.1 The student can justify the selection of statistical routines 2.2 The student can express the relevance of central quality criteria for procedures of statistical testing 2.3 The student can judge outputs from relevant statistical software The justification provided for the data collection method selected by the author(s) of a study is con? vincing related to the research question A scientific paper?s results and conclusions are analyzed critically regarding their practical and scientific significance The author?s decision to apply a correlation and a regression analysis respectively related to the scien? tific question is assessed correctly Quality criteria for factor analyses?eigenvalue, explained proportion of total variance, and specific? ity?are identified correctly The important impact factors based on a presented output of a regression analysis are identified and interpreted correctly (based on significant ?? values) a differentiation between diverse proficiency levels be effected (Embretson 2002). The selection of the task features used to scale CCER was made in the light of the characteristics which had turned out to be significant determinants of the item?s difficulty in previous studies on measuring situation-specific skills by using stage-oriented models, drawing notably on Blum et?al. (2003), Kauertz and Fischer (2008), Winther and Achtenhagen (2009). According to these authors, the following three criterions are assumed to have a relevant impact on the difficulty of solving research-methodological tasks: (I) Kind of cognitive process according to the Cognitive System of the ?New Taxonomy of Educational Objectives? by Marzano and Kendall (2007, 2008); (II) Complexity concerning the number of content-related elements; (III) Degree of familiarity. A detailed description of these criterions?including their operationalization and examples?is presented in Appendix (Table?6). RQ 1 (Structural model): Are the two theoretically modeled dimensions of CCER empirically distinguishable? RQ 2 (Quality of the test instrument): Do the empirical quality measures concerning the performance test instrument indicate: RQ 2a) ? that the central Rasch-modeling assumption of equal discriminability regarding all test items is met? RQ 2b) ? that the test items allow a reliable and valid measurement of CCER? RQ 3 (Level model): Which levels of CCER can be defined by task characteristics that significantly determine the item difficulty? Methods Target group, course structure, and?sample As target group, undergraduates of the Human Resource Education and Management (HRE&M; in German: ?Wirtschaftsp?dagogik?) study program at the LMU were chosen. Their polyvalent educational profile prepares them for the various workplace settings where VET-educators are typically employed. The students are offered a small group course on empirical research methods which integrates essential research-methodological content. Through focusing on empirical research methods, the course follows the trend towards an empirical research orientation which prevails in the field of educational sciences (Gesellschaft f?r Empirische Bildungsforschung 2012). It aims at two superordinate learning objectives that are compliant with the two key challenges of research methodology: (a) reviewing empirical academic literature and (b) independently performing an empirical research project. In order to develop these competencies, an innovative instructional design consisting of different course elements is provided. The course is offered every semester. With reference to the learning objective (b), an independent research project has to be performed and the results have to be presented in a short research paper. The test designed for our study addressed whether learning objective (a) is being achieved. The total grade for the course is composed of both performance measures. Participation is compulsory for undergraduates who intend to write their bachelor theses in the field of human resource education. Alternatively, they attend a course and address their theses to the field of business administration. The test on CCER (for answering RQ1 and RQ2) took place at the end of the respective semester of the study program HRE&M at the LMU. Within our cross-sectional research study, test data are available for a total of 155 students. They were derived from the full surveys of four consecutive semesters (n1? =? 54, n2? =? 30, n3?=?23, n4?=?48)1 starting in the winter term 2011/12. The students are on average in their sixth semester (SD?=?1.34) and two-thirds of them are female.2 Intended information for?the high?stakes assessment and?test design Our test of CCER was designed for a real 60-min exam under high-stakes conditions. Therefore, the test result has considerable consequences for the respective student: It decides if the respective test person has passed or failed the course, and enters the students? final grade for the study program. Compared with a voluntary survey without important consequences, high-stakes testing situations lead to higher motivation and a significantly lower probability of guessing and skipping test tasks. So, the number of missing responses will be minimized. Furthermore, there is no sample self-selection effect. Students commonly dedicate little effort to low-stakes testing assessments as an act of prioritization and to save their energy for meaningful academic tasks. These points reduce the score validity of low-stakes testing approaches in the most basic sense (Wise and DeMars 2005; 2006; Wolf et?al. 2015). Missing responses for omitted items are usually not random. This may lead to biased estimates of item and person parameters (Mislevy and Wu 1996). However, at least for low-stakes testing assessments, several authors propose ignoring missing responses instead of scoring them as incorrect (de Ayala et?al. 2001). But if this results in an unequal distribution of omitted items concerning different competence dimensions (e.g. relatively more statistical questions are skipped), the consequence may be a misjudgment of the competence structure. In a high-stakes power test?as we intended to design?it can be expected that omission occurs when participants do not know the answer and therefore missing at random is less plausible (Mislevy and Wu 1996). Instead, there is a significant correlation of ability and the number of missing responses (Pohl et? al. 2014). Despite all its benefits, high-stakes testing conditions have?with regard to assessment development?several restrictions concerning the number and administration opportunities of the test tasks. That means: (a) With regard to local conditions we had to use a paper-and-pencil test instead of a more realistic technology-based approach. (b) The number of test tasks was limited by the test time, because each participant had to get exactly the same items. And (c) a substantial number of easy tasks had to be implemented, because a student had to reach 50% of the maximum score to pass the test. With regard to the high-stakes testing conditions and the intended assessment information, as well as on the basis of our competence model and the identified evidences, we developed 26 paper-and-pencil test items. They were designed along the lines of the 1 The tasks were all the time not accessible for students, so that participants of earlier semesters did not have systematic disadvantages (ANOVA regarding the total scores for the four groups: F value?=?1.087, p?=?.357?>?.05). 2 Test data were analyzed anonymously. Biographical data derived from course registration information. typical research process set out by Rost (2007, p. 26). Each content area of the research process was covered by different situations in the form of items which depict the two theoretically expected dimensions of CCER (DIM1: 11 items; DIM2: 15 items). In order to cover the whole spectrum of proficiencies, and taking into account that undergraduates need 50% of the maximum score to pass the test, we constructed tasks of all degrees of difficulty. Development requirements originating from (i) content-related instructional science (in German: ?Fachdidaktik?), (ii) cognitive psychology, and (iii) psychometrics were considered for constructing the items. From the perspective of (i), content-related instructional design, the standards for designing authentic assessments had to be fulfilled, such as realistic illustrations, orientation towards real professional circumstances/environments, permission of judgements and reflections, focusing actions and the comprehension of these actions, replicating or simulating tasks which originate from the occupational routine, and inspirations for further learning (Janesick 2006, p. 4; Mislevy and Haertel 2006; Weber et?al. 2014). For this reason, all test items are based on one empirical study (published in the Journal of Pedagogical Psychology) which bears the title ?Personal responsibility for academic achievement: Dimensions and correlatives? by Koch (2006). This research study deals with the effect of the latent construct ?personal responsibility? for academic achievement. The implementation of different authentic situations requiring CCER within a superordinate context?the real study by Koch (2006)?offers various advantages: Koch?s article was selected due to the probable attractiveness that the topic would hold for the students as well as the students? involvement triggered by the topic. It can be expected that the study?s context equally constitutes relevant issues for all test persons and that all students should obtain a comparable interest, previous knowledge, and experiences concerning the addressed content area. Consequently, situation-specific affect-motivational effects are largely negligible. Furthermore, this paper was chosen since the statistical sophistication is consistent with the abilities which can be expected as concerns consumer behavior, which the participants have acquired during the course. The study had not been utilized by the instructors during the course on empirical research methods or during other courses. Therefore, the tasks outline new situations. All tasks were presented using real data, text excerpts, and figures. They follow the judgment of a complete research process. Correspondingly, the test person?considered qua consumer of research?had to grasp central contents and information about the paper (cf. sample item 1; see Appendix (Table?5) fro the whole item pool); evaluate sources and methods used for the survey as well as methods for analyzing data (e.g. correlation and factor analyses) performed in the study; and analyze, interpret, and assess statistical diagrams [cf. sample item 20; see Appendix (Table? 5)]. In addition, transfer cases based on socalled ?what-if tasks? had to be solved. These tasks go beyond the research methodological situations covered by Koch?s (2006) study. In one of the ?what if tasks? the test persons had to create an experimental design for evaluating the effectiveness of a training on strengthening personal responsibility for academic achievement what is not part of Koch?s (2006) study. Apart from few matching tasks, the test items are designed primarily using openended formats in terms of performance tasks and analytic writing tasks (cf. the Collegiate Learning Assessment approach by Shavelson 2008). Subsequently, two examples of test items are illustrated: Item 1 is assigned to the content area of ?problem definition? and refers to skill 1.2 concerning the conceptual dimension of CCER (cf. Table?1). The abstract of Koch?s (2006) article illustrated in Fig.?1 is presented. Based on this, the student is prompted to identify the paper?s two central research questions. Item 20?allocated to the content area of ?results? and depicting skill 2.3 of the statistical dimension of CCER (cf. Table?1)?presents the subsequent output for a correlation analysis performed in Koch?s (2006) study. To decrease extraneous cognitive load, some side notes and highlighting elements are integrated in the output (cf. Fig.?2). The item quotes the following statement which a researcher had framed based on the output: ?Final university examination grade and commitment to the studies are two independent criteria for success?. The test persons are requested to mark the value that resulted in the given statement within the presented output (e.g. by circling) and to explain why the argument is derived correctly. As all test items refer to the described study, the test persons do not have to become acquainted with a new context in every task [cognitive psychological perspective, (ii)]. Additionally, appropriate linguistic complexity of the tasks? instructions, reasonable signaling, and the avoidance of redundancies were considered when constructing the test tasks [in accordance to Bley et? al. (2015)]. Therefore, the extraneous cognitive load as well as the time for introducing tasks can be reduced (van Merri?nboer and Kirschner 2013, p. 22). Despite of the advantages of embedding test-items in one real study (e.g. authenticity), we are aware that this approach results in the fact that the test instrument is subsumed under a single anchor. As a consequence, the assumption of local stochastic independence could be violated [psychometric perspective (iii): Koller et? al. 2012]. Therefore, we made a great effort in providing all the necessary information (e.g. text excerpts or statistical outputs from the study) relevant to solve each new situation. The result of the Fig. 1 Abstract of the study by Koch (2006) Fig. 2 Correlation matrix presented in the study by Koch (2006) non-parametric T11 test (p value?=?.824) (Ponocny 2001) shows that this procedure was successful. By discussing all items with seven experts, who are instructors in empirical research methods for students of HRE&M at the LMU, content as well as substantive validity were ensured. The experts were asked to evaluate the items with respect to the relevance of content-related aspects, the appropriateness of the tasks? scoring, as well as students? cognitive solution processes that are intended to be activated by the test tasks [cf. Appendix (Table?6)]. Slight revisions of our tasks were performed corresponding to the experts? assessment. Handling of?missing responses and?coding High objectivity in implementing the test can be assumed, because of legally defined examination rules. No data set had to be eliminated. As expected, the number of missing responses is quite low (1.41%) and all of them can be classified as ?omitted responses?, because all participants received exactly the same test and the missing responses were spread over the whole test not only over the last items. Because the number of missing responses correlates significantly with persons? abilities (Kendall?s ??=??.243, p?=?.000), we interpreted a missing response as an inability in item answering. Our scoring guide includes a best-practice solution of the written exam as well as a description of each optional scoring category [see Appendix (Table? 5)]. Twelve items were scored binary and for 14 items students could earn partial credits (three response categories). The appendix (Table?5) explains the scoring rules for each item in detail. In correspondence with the scoring guidelines, two trained raters coded the students? answers independently. These raters are research and teaching assistants on an expert level who teach in empirical research methods for students of HRE&M at the LMU. An interrater reliability Kappa (Fleiss and Cohen 1973) of .940 was attained. This accounts for a high level of agreement. Instrument for?an expert?based rating of?the tasks? difficulties For developing the proficiency level model for CCER (RQ3), the seven experts introduced in chapter? 3.2 were asked to evaluate the test items corresponding to the three characteristics assumed to determine their difficulty. A written questionnaire was used for this rating. It was largely performed on three- or four-point Likert scales [for the operationalization of the criterions for the tasks? difficulty see Appendix (Table?6)]. Before the rating, the experts participated in training to explain the design of the items and the meaning of the different degrees of the task criteria which had to be evaluated. The theoretical estimation of the test items? difficulty levels was carried out a priori and therefore independently from the performance test?s results. In order to find consensual ratings, the responses given by each expert were discussed within the group of all raters and the research team (Kuckartz 2014; Wahl 1982). This procedure served to make sure that all experts correctly understood what the variables of the questionnaire aimed at. Slight adjustments of the original coding were made in response to the insights derived from the focus group discussion. Methods of?data analysis For empirically validating the theoretically assumed structural model for CCER (RQ1) and for examining the test instrument?s quality (RQ2), the written students? exams were analyzed using psychometric models belonging to the IRT. RQ1: Two central Rasch-models were applied?a one-dimensional and a two-dimensional Partial-Credit-Model (PCM; Masters 1982; Adams et?al. 1997)?by using the software ConQuest 3.0 (Wu et?al. 2007). The central advantage of Rasch-models?namely that individuals? ability parameters are estimated independently of the tasks used to compare the individuals?is only valid if there are equal discrimination values of all items in a test (RQ2a). To test this, median scoresplit analyses?Andersen-Likelihood-Ratio-Tests (Andersen 1973) and Wald-Tests (Koller et?al. 2012, pp. 77?79)?were executed, using the eRm-package belonging to the software R (version 3.1.2; Mair and Hatzinger 2007).3 The quality of the test items (RQ2b) was investigated by calculating and evaluating (i) the scaling of the individuals? ability parameters as well as the items? difficulty parameters, (ii) the EAP/PV (expected a posteriori/plausible values) reliability, (iii) the curve of the total test information function, and (iv) the wMNSQ (weighted Mean Square) values. For the expert-based determination of proficiency levels (RQ3)?following Hartig (2007)?we chose an additive and linear regression model for the coherence between the item features (independent variables) and the IRT-based item difficulty parameters (dependent variable). The 50%-thresholds resulting from the IRT-scaling were used as item difficulty values. Results and?discussion Empirical validation of?the structural model: RQ 1 Based on the finding of the LeScEd study (Gro? Ophoff et?al. 2014, p. 266), where the onedimensional model fitted the data best, a one-dimensional PCM was tested against a twodimensional between-item-multidimensionality PCM in order to identify if the two expected dimensions of CCER are empirically distinguishable. Thereby, DIM1?the conceptual dimension?is described by 11 test items (1?6, 17?19, 25?26) and DIM2?the statistical dimension?by 15 test items (7?16, 20?24). According to our theoretical expectations, the information criteria BIC, AIC, and CAIC?which show lower values for the two-dimensional PCM?provide empirical evidence for a better fit of the two-dimensional model (cf. Table?2). This finding is confirmed by the Likelihood-Ratio-Test according to Martin-L?f (Glas and Verhelst 1995, pp. 86?89) which became significant on the 5%-level (Chi square? =? 25.90; df? =? 2; p? =? .000). The moderate correlation between the two latent dimensions of CCER (r?=?.678; covariance?=?.226) supports a two-dimensional solution. For the following analyses, the estimated parameters for the better fitting two-dimensional PCM are used. Quality of?the test instrument: RQ 2a) and?2b) RQ 2a) refers to investigating whether the central Rasch-modeling assumption of equal discrimination parameters is fulfilled for all items. For performing the AndersenLikelihood-Ratio-Tests and Wald-Tests we determined a significance level of 20%. For 3 Scoresplit analyses [by using Andersen-Likelihood-Ratio-Tests (Andersen 1973) and Wald-Tests (Koller et?al. 2012)] has a long tradition as well as a high power to examine this assumption (for a detailed discussion see Rasch 1961 and/or Glas and Verhelst 1995). Table 2 Fit statistics for? the one-dimensional PCM in? comparison with? the two-dimensional PCM PCM Deviance (LR?test) 6073.97 6048.07 Number of estimated parameters 41 43 BIC 6280.75 6264.93 AIC 6155.97 6134.07 CAIC 6321.75 6307.93 The degrees of freedom result from the difference between the estimated parameters Difference = 25.90 taking into account the alpha-error-cumulation, a Bonferroni correction (Abdi 2007) was conducted for the Wald-tests through dividing the defined significance level by the number of performed tests per dimension. The Andersen-Likelihood-Ratio-Test is not significant for the conceptual dimension (p?=?.503?>?.2) but for the statistical dimension (p?=?.003?<?.2). However, the results of the Wald-Tests show that no z value is significant. Consequently, our test allows a separated statement regarding task-difficulties and test persons? abilities. RQ 2b) examines whether the test items allow for a reliable and valid measurement of CCER. For all test persons and for all test items, (i) the scaling of the individuals? abilities and of the items? difficulties can be illustrated by a Wright map regarding the two scales (cf. Fig.?3; Wilson 2005, pp. 90?98). Based on a maximum of 40 the test score value which was achieved on average is 23.35 (SD?=?5.69). The ability parameters (EAP/ PV-estimators) range from ?.952 to .900 logits for the conceptual dimension and from ?1.664 to 1.547 logits for the statistical dimension. They are normally distributed (Kolmogorov?Smirnov test; DIM1: p? =? .564; DIM2: p? =? .402). The difficulty parameters define the latent variable of conceptual competencies on a scale from ?1.506 to .748 logits and the latent variable of statistical competencies on a scale from ?1.809 to .663 logits. Correspondingly there is a lack of items with a very high degree of difficulty. This effect was to be expected, because in order to regulate the CCER exam failure rate a considerable number of items of an easy and moderate difficulty had to be included. With regard to the two dimensions, which are assumed based on the empirical analysis, the (ii) EAP/PV reliability?which is comparable with Cronbach?s alpha (Adams and Wu 2002, p. 152)?shows moderate values of .548 for the conceptual and .737 for the statistical dimension. It is assumed that the low number of items?which is a consequence of the explained high-stakes testing conditions?is responsible for the moderate reliability values. Since the reliability value only expresses how precise the measurement is with respect to the complete ability spectrum, the (iii) Wright map is considered additionally. Their advantage compared with the EAP/PV-reliability value is that the Wright map indicates how accurate the measurement is regarding different ability areas. As Fig.? 3 illustrates, with the expectation of high ability parameters, the item difficulties and student abilities corresponding well. This supports a precise measurement as the test also covers the ability level of students with a very high and a very low degree of CCER in a differentiated way. In order to examine (iv) potential Differential Item Functioning (DIF)-effects with respect to the test persons? gender, an Andersen-Likelihood-Ratio-Test (Glas and Verhelst 1995) as well as Wald-Tests were calculated. The test results show that no DIFeffect exists for any item. The (v) wMNSQ values of the 26 test items are located within a range of .89?1.15 [cf. Appendix (Table?7)]. Hence, all items show a good to very good fit since they are situated within the strict interval of .80???wMNSQ???1.20 postulated for the PISA study (OECD 2014, p. 151). The corresponding t values range from ?1.0 (> ?1.96) to 1.7 (<1.96) and are therefore non-significant. As all items are of a high quality, no item has to be excluded from the test. Our study is oriented towards Messick?s (1989, 1995) concept of validity which integrates different validity evidences. Accordingly, the test instrument designed for measuring CCER meets the requirements of (1) content validity as great importance was attached to the design of authentic test tasks derived from a real empirical research study. To ensure that the constructed tasks are valid with regard to contentrelated aspects, we discussed them with experts (lecturers) of the addressed course on research methods within focus groups. These focus groups were also carried out for ensuring (2) substantive validity through discussing the cognitive solution processes (including the tasks? scoring) intended to be activated by the test tasks with the experts. Furthermore, (3) psychometric validity is indicated through the good values of the presented fit indices. Finally, (4) in a first access regarding the test instrument?s external validity, we considered the relationship of the grades the HRE&M students achieved in their course on research methods and the IRT-based ability parameters for the one-dimensional model. The grades?determined by the course teachers (independently from the research group)?describe a combination of reviewing empirical academic literature and autonomously performing empirical research projects. The Spearman rank correlation coefficient (r?=??.756, p value?=?.00) indicates that performances within the course and performances measured by the CCER test point in the same direction. That means, students who attained a high (low) ability parameter in the test on CCER also displayed a good (weak) performance concerning the course grade (whereby, the smaller the number of the grade, the better the performance). The effect size of the relationship can be interpreted as meaningful. But, it has to be noted, that the validity criterion of the course grade is not able to differentiate between the performance regarding the conceptual and the statistical dimension. Ergo, it does not allow a separate validation for the two dimensions. To sum up, based on the examined quality measures (i)?(v) as well as on the considerations addressing the different aspects of validity concerning the instrument for measuring CCER, a successful test construction can be assumed. Level model: RQ 3 Our last RQ focuses on defining levels of CCER by task characteristics that significantly determine the item difficulty. Therefore, in a first step we determined the level of agreement between the experts by using the interclass-correlation (ICC) (Shrout and Fleiss 1979). The ICC values confirm a strong agreement between the raters concerning all three characteristics (cognitive process: ICC?=?.871; complexity: ICC?=?.818; familiarity: ICC?=?.854). Table?3 outlines the estimation of the common predictive power of all task features (adjusted R2) and the identified predictive | | X| X| XX| X| XX| XXX| XXXXX| XXXX| XXXXXXX| XXXXXXX| XXXXXXXX| XXXXXXXXXX| XXXXXX| XXXXXX| XXXXXXX| XXXXXXXX| XXXXX| XXX| XXX| XX| X| X| X| | | | | | | | | | | | | competencies X| X| X| XX| XX| XXXX| X| XXXX|4 12 XXXX| XXX|5 XXX|26 XXXX|3 15 XXXXX|24 XXXX|16 XXXXX|23 XXXXX|1 XXXXX| XXXXX|20 XXXX|8 19 XXXXX|21 XXXX|22 XXXXXX| XXX|25 XX| XX| XX|6 10 XX|18 X|2 |13 X|14 |9 11 17 | | | |7 | | | ability of analysis by processing at least three content-related each X represents 1.6 test persons Fig. 3 Level model for CCER (based on the Wright map using the two? dimensional PCM); distribution of the person and item parameters effects of each task characteristics based on a multiple regression analysis (all prerequisites are met).4 constitutes a high proportion. The four task features marked in italic significantly influence the item difficulty on an alpha level of 10% and were therefore used for the level modeling. The ?familiarity? of the situations seems to be neither conducive to nor hindering of the task solution. Three proficiency levels could be determined: (A) ?the ability to comprehend research-methodological terms and concepts?, (B) ?the ability to analyze research-methodological situations by processing several content-related elements?, and (C) ?the ability to apply research-methodological concepts and procedures? (cf. Fig.? 3). For all three levels, items covering the conceptual dimension and items covering the statistical dimension of CCER were constructed successfully. In Table? 4 the proficiency levels are defined by the respective logit values for the thresholds. Additionally, the table presents the proportional allocation of the test persons to the proficiency levels. It has to be emphasized that the empirically defined three-stage level model has an explanatory power for 100% of the test persons concerning the conceptual dimension and for 98.71% referring to the statistical dimension. The 4 The Variance Inflation Factor (2.539?5.573) and Tolerance (.179?.394) indices for all predictors do not show any critical values. Therefore, no multicollinearity between any variables exists (B?hner and Ziegler 2009, pp. 681? Adjusted R2?=?.770 b (non?std.) Standard error Beta (std.) t value p value Constant ?1.779 .315 ?5.643 .000 Cognitive process: comprehension .451 .252 .251 1.791 .090 Cognitive process: analysis .917 .349 .559 2.629 .017 Cognitive process: application 1.226 .379 .682 3.239 .005 Complexity: 2 elements .125 .254 .069 .490 .630 Complexity: at least 3 elements .625 .328 .412 1.907 .073 Familiarity: moderate learning opportunities .382 .299 .199 1.276 .218 Familiarity: few learning opportunities .309 .283 .189 1.092 .289 N?=?26 (number of test items for both dimensions of CCER); reference categories for the three characteristics: cognitive process: retrieval, complexity: one element has to be processed, familiarity: many learning opportunities are provided Table 4 Allocation of?the test persons (N?=?155) to?the proficiency levels following description of the levels is based on analyzing the specific requirements of the tasks. After having graduated from the course on research methods, almost 100% of the HRE&M students are able to comprehend essential concepts and procedures of empirical research-methods (proficiency level A). We interpret this level as the criterion for passing the bachelor degree in HRE&M studies. Item 18 (item difficulty? =? ?1.063 logits), for instance, addresses the ability to recognize and establish that the author?s decision to apply a written survey with a closed-ended response format is reasonable (e.g., because a larger number of persons can be questioned when using a closed format). Hence, in accordance with Marzano and Kendall (2007, p. 40), the learner has to mix ?new knowledge??meaning the information contained within the presented extracts of the study??and old knowledge residing in the learner?s permanent memory? to solve tasks of this level. The major part of the students (72.91% for the conceptual and 61.94% for the statistical dimension) even reaches the level of conducting analyses of research-methodological situations by linking a crucial number of content-related elements (proficiency level B). Examining the items assigned to this level, analytical processes such as ?specifying?, ?matching?, and ?classifying? (Marzano and Kendall 2008, pp. 18?19) are needed. For example, to solve item 20 (item difficulty? =? ?.274 logits) (outlined in section ?Research questions?) correctly, two mental processes are relevant: (a) matching?as scientific authors? statements have to be compared with statistical values; and (b) specifying?as the test persons have to ?identify [?] principles that apply to a specific situation? (Marzano and Kendall 2007, p. 50). A proportion of 44.52% regarding both dimensions is even able to achieve the ability of applying research-methodological concepts and procedures (proficiency level C). These students are able to apply mental processes of knowledge utilization such as ?experimenting?, ?investigating?, ?decision making?, and ?problem solving? (Marzano and Kendall 2007, p. 51). Item 26 (item difficulty?=?.413 logits), for instance, prompts the students to create an experimental design based on a follow-up research question to evaluate the effectiveness of responsibility training and its impact on academic achievement. Regarding the process of experimenting, this task requires ?testing hypotheses for the purpose of understanding some physical or psychological phenomenon? (Marzano and Kendall 2008, p. 20). Conclusions Discussion and?limitations The results of the study show that we succeeded in designing a reliable and valid test instrument for assessing (prospective) VET-educators? competencies in consuming empirical research. With regard to the competence structure, our results indicate that the two considered dimensions?frequently referred to in practical applications (e.g., Renkl 1994; Onwuegbuzie 2001; Dunn et?al. 2007)?also become empirically evident. However, with the used approach, it cannot be excluded that empirically a model with more than two dimensions will fit the data better than with two dimensions. Existing finding of the LeScEd group assume a one-dimensional solution. We explain this deviation mainly on the basis of different test conditions (low-stakes vs. high-stakes testing) and a different handling of missing values. While under low-stakes testing conditions omitted items are often ignored (e.g., Gro?-Opphoff et? al.), under high-stakes testing conditions we have evidence that omitted items could be reduced to the fact that the participant does not know the answer. Despite the strong restrictions of the high-stakes testing approach (small sample size and a limited number of test tasks) the values of quality (wMNSQ and t values, assumption of equal discriminability, and test information curve) can be interpreted as sound; only the EAP/PV reliability shows moderate values. Based on the measures for the item quality, no item has to be excluded, and therefore our high standard of theoretically based content validity is fulfilled within the final item pool. In our opinion this positive result is attributable to our decision to follow ambitious standards for test designs and validation (Curriculum-Instruction-Assessment Triade, Evidence-Centered Design, high-stakes testing). But, this decision is also linked to the limitation that the study lacks generalizability, which constitutes a further crucial criterion of validity according to Messick (1995), but is not primarily being dealt with within this article. To provide a generalizable evaluation of prospective VET-educators? CCER, it would be interesting to analyze how the test participants perform in CCER follow-up tests. Additionally, the test on CCER has to be implemented as a high-stakes testing exam for other research methodological training courses within different institutions and in different courses of studies which prepare future VET-educators. These institutions should comprise selected universities, whose module descriptions for courses on research methods we analyzed. The results of the CCER level model specification show that two of the three defined task characteristics (cognitive processes and complexity) are able to explain nearly 100% of the prospective VET-educators? CCER abilities. Besides the generalizability aspect, further limitations are (1) a constrained pool of items, (2) limited criteria for external validity, (3) constrained statements regarding the test fairness, and (4) the test focus on quantitative research methods: (1) Only a constrained selection of situations requiring CCER could be presented within the test. This is based on the high-stakes testing conditions in form of a real exam and on the limited test time. In future large-scale research designs additional CCER test situations?that prompt further statistical procedures (e.g. cluster analyses)? should be included for instance as a multi-matrix design. Furthermore, to measure also the abilities of very high-performing students sufficiently, items with a very high degree of difficulty have to be integrated into the test on CCER. (2) First indications for confirming the external validity of our test instrument are provided. However, in order to make a separate statement regarding the external validity of both scales (the conceptual and the statistical scale) for assessing CCER, external criteria for the two dimensions are necessitated. (3) Due to constrained possibilities for collecting demographic information under high-stakes conditions and related reasons of anonymity, the test fairness could only be examined for the covariate ?gender?. (4) So far, CCER focus on quantitative research aspects. As a consequence, an operationalization of the qualitative part constitutes a crucial desideratum for further research. Implications From the point of capturing and promoting the development of CCER during prospective VET-educators? studies, it is relevant to scale this competence according to features that might determine the difficulty of corresponding tasks. As not all students achieved the learning goal to apply research-methodological concepts and procedures, there is a necessity for a stronger focus on teaching activities in order to inspire learning processes which support the development of abilities to master application tasks. The identification of significant task characteristics can help to design learning environments as well as test tasks. The constructed test tasks vary systematically with regard to the identified characteristics determining the item difficulty. Apart from applying them for assessing CCER they can also be used as learning tasks in order to (further) develop CCER. As hardly any valid tests are available in the field of higher education, usually a minimum score value of 50% of the overall achievable score for passing an exam is used. On the basis of a substantial proficiency level model, a criterion for passing learning goals could be provided. As a consequence, grades could be based on such an a priori defined criterion instead of a more arbitrary defined minimum score value. The superordinate objective must be to implement validated test instruments with defined criterion-based proficiency levels in the form of an adaptive test design. Authors? contributions Both authors contributed substantially to this work. They designed the study, modeled the test items, implemented them, and participated in drafting and discussing the manuscript at all stages. Both authors read and approved the final manuscript. Authors? information Michaela Wiethe?K?rprich studied Human Resource Education and Management at Ludwig?Maximilians?University in Munich from 2007?2011. Since 2012 she has been a research and teaching assistant at the Institute for Human Resource Education and Management, Munich School of Management, Ludwig?Maximilians?University in Munich. Sandra Bley studied Human Resource Education and Management at Georg?August University in G?ttingen from 2001?2006. She holds a Master of Business Research (MBR, 2008) and a doctoral degree (Dr. oec publ. 2010) from Ludwig?Maximilians? University in Munich. Since 2011 she has been a senior researcher at the Institute for Human Resource Education and Management, Ludwig?Maximilians?University in Munich. Availability of data and materials The data supporting the authors? findings are provided by the authors on request. Ethics approval and consent to participate It is confirmed that the study was performed according to the ethical principles that are relevant for writing scientific studies. See Tables?5, 6 and 7. Table 5 Instruction and?scoring rules of?the CCER test items 0 = no research question is derived correctly 2 = both research questions are derived correctly 0 = identified aspect is wrong 0 = no aspect of difficulty is explained correctly 1 = one aspect of difficulty is explained correctly Please derive the two main research questions from the abstract of the study by Koch (2006) You can find some selected sentences of Koch?s study below. To which part of the introduc? tion do these sentences refer? A) Definition/defining the research topic C) Relevance of the problem Sentence 1: ?A long duration of study, subject changes and dropouts are quite characteristic for university studies in Germany [...]. ? Sentence 3: ?A person is self?determined if he takes responsibility for his own actions. We assume that self?determined students are more likely to demonstrate intrinsic motivation for their studies. This in turn seems to be desirable in pedagogical contexts [...].? Sentence 4: ?Schlenker et al. [...] introduce a social?psychological model of personal responsibility and examine its application to study success.? What is the difficulty in operationalizing a latent construct? Please explain Which information is given to the reader by the 1 = information is denoted correctly marked numbers: ?1?/?2?/?3?/?4?? Please assess the information given by the marked numbers ?1?/?2?/?3?/?4?? Factor analysis provides us with various quality measures to assess latent constructs. Please name two quality measures and explain under which conditions the measure is assessed as ?well fulfilled? (an approximate value is sufficient if you want to specify numbers)! 1 = information is assessed correctly 0 = no measure is outlined correctly AND no condition is explained One measure is outlined correctly BUT associated condition is wrong (explained) 1 = one measure is outlined correctly AND the associated condition is correctly explained Two measures are outlined correctly BUT for neither of them the associated condition(s) are (explained) correctly 2 = two measures are outlined correctly AND for both of them the associated condition(s) are (explained) correctly Two measures are outlined correctly BUT only for one of them the associated condition(s) are (explained) correctly The quality of the presented factor analysis 0 = no aspect is outlined (correctly) cannot be assessed, if the assessment is solely 1 = one aspect is outlined correctly based on Table 1. Which aspects are missing for a complete assessment of the quality? Please outline two aspects! 2 = two aspects are outlined correctly Data collection is carried out in Koch?s (2006) study by means of a written survey with a closed answer format. Which other data col? lection methods do you know? Please name four additional methods! How do you assess the decision in this study to use a written survey with a closed answer format? Please justify your assessment with two arguments! The summary (Koch 2006, p. 1) states: ?It is claimed, that personal responsibility for study success [?] has a positive effect on study management and performance.? Would you prefer a correlation or a regression analysis to examine this statement? Please justify your decision with two arguments! Based on the presented correlation matrix included in Koch?s (2006) study, a researcher interpreted the following statement: (1) ?Final university examination grade and commitment to the studies are two inde? pendent criteria for success.? 0 = one or no method is outlined correctly 1 = three or two methods are outlined correctly 2 = four methods are outlined correctly 0 = no or incorrectly marked value AND no or incorrect explanation (2) ?Personal responsibility for your own studies and study success are positively correlated.? (3) ?The two sub dimensions ?clarity of purpose? OR and ?significance? are dependent factors.? (4) ?Measuring accuracy of the sub dimension ?significance? can be assessed as good.? Incorrectly marked value BUT correct explanation Please mark the value that resulted in the given statement (1)?(4) within the presented output (e.g. by circling) and explain in one sentence why the argument is derived correctly! Correlation analyses were supplemented by regression analyses. These analyses also show a positive effect of personal responsibility and study success. Nevertheless, the author argue in the conclusion part, that the results do not have any predictive character. That means variance differences in study success cannot traced back casually to variances in taking personal responsi? bility for their own study Why could the authors come to this conclusion? Please denote a central aspect and justify your explanation! In a following research project, Koch and col? leagues developed a training to strengthen per? sonal responsibility of students for their studies Please sketch a suitable experimental plan in the usual matrix format for evaluating the effectiveness of the new developed training! Please describe each component of a (quasi?) experimental design for evaluating the effec? tiveness of the newly developed training! 0 = wrong or no aspect is denoted and justified 2 = aspect is denoted and justified correctly 0 = no or one column is sketched correctly 1 = two or three columns are sketched correctly 2 = all four columns are sketched correctly 0 = no or one component is described correctly 2 = four or five components are described correctly Table 6 Criteria for?the tasks? difficulty, including?their operationalization and?examples Criterions for?the tasks? difficulty including?their operationalization (II) Complexity concerning the number of content?related/ curricular elements (=solution?relevant variables) (Adams and Wu 2002) Three response categories: (1) Only one isolated content?related element has to be processed (2) Two content?related elements have to be processed (3) At least three content?related elements have to be processed (1) A student who is able to reproduce the concept of ?reliability? has attained the level of retrieval (2) A student who is able to decide which data collec? tion method is suitable depending on the presented investigation context has achieved the level of comprehension (3) A student located on the level of analysis is able to assign presented extracts from a study?s problem defi? nition to its typical elements (4) A student who is able to set up a research design which is based on a presented research objective or question of a concrete study attained the level of knowledge utilization (1) A task that only demands to describe what the term ?nominal scale level? means contains a low complexity (2) A task which requires a decision if a correlation or a regression analysis is fitting better in order to answer a presented research question shows a moderate complexity (3) A task prompting to set up a context?related experimental design which has to include all relevant components?such as pretest, treatment, posttest, experimental group, and control group ? and which requires considering the concept of randomization involves a high complexity Criterions for?the tasks? difficulty including?their operationalization (1) A large proportion of the course?s instruction was spent on performing several in? depth and hence intensive exercises to handle situations belonging to the curricular area of evaluating the adequateness of methods for collecting data (2) A moderate proportion of the course?s instruc? tion was spent on in? depth exercises with regard to interpreting statistical outputs for explorative factor analyses (3) The curricular area of regression analysis was treated very superficially during the course (no hands? on applications and repetitions of the respective con? tents/methods were provided) ?Regarding each item, an index consisting of three criteria was calculated in order to assess the amount of learning opportunities: (a) How many lecture slides are dealing with the relevant content area? (referring to the first element of the course?the lecture) (b) How extensive was the respective content area treated within the instruction? (referring to the first and the second element of the course?the lecture and the tutorial moderated by a lecturer) (i) = low instructional extent (ii) = high instructional extent (c) Did the test persons have the opportunity to participate proactively in a case?related hands? on? application concerning the respective content area? (referring to the third element of the course?the project work in small groups supported by advanced students) (i) = no hands? on?application was performed (ii) = hands? on?application was performed Assumption ad (I): the solving probability decreases with an increase in the kind of cognitive process which is necessary to master the respective task; assumption ad (II): less test persons are able to solve an item addressing many different content?related/curricular elements that have to be linked than an item designed to capture only one or few elements of the complex structure of research methods; assumption ad (III): the solving probability is lower for items which are directed at a quite unfamiliar situation compared to items that display familiar situations Table 7 Item fit statistics .1 .2 ?.4 .4 .5 ?.2 ?.3 .5 ?1.0 ?.5 ?.3 .9 ?1.0 ?.3 1.0 1.5 ?.0 .2 ?.3 ?.4 .1 ?.7 1.7 .8 .4 ?.4 wMNSQ weighted mean square, CI confidence interval Abdi H ( 2007 ) Bonferroni and Sidak corrections for multiple comparisons . In: Salkind NJ (ed) Encyclopedia of measurement and statistics. Sage, Thousand Oaks , pp 103 - 107 Adams RJ , Wu ML ( 2002 ) PISA 2000 technical report . OECD , Paris Adams RJ , Wilson M , Wang W? C ( 1997 ) The multidimensional random coefficients multinomial logit model . Appl Psychol Meas 21 ( 1 ): 1 - 23 Andersen EB ( 1973 ) A goodness of fit test for the Rasch model . Psychometrika 38 ( 1 ): 123 - 140 Baumert J , Kunter M ( 2006 ) Stichwort: Professionelle kompetenz von lehrkr?ften . Z Erziehwiss 9 ( 4 ): 469 - 520 Bley S ( 2017 ) Developing and validating a technology?based diagnostic assessment usingthe evidence ? centered game design approach: an example of intrapreneurship competence . Empirical Res Voc Ed Train 9 ( 6 ): 1 - 32 . doi:10.1186/ s40461? 017 ? 0049 ?0 Bley S , Wiethe?K?rprich M , Weber S ( 2015 ) Formen kognitiver Belastung bei der Bew?ltigung technologiebasierter authentischer Testaufgaben - eine Validierungsstudie zur Abbildung vonberuflicher?Kompetenz . Zeitschrift f?r Berufs? und Wirtschaftsp?dagogik 111 ( 2 ): 268 - 294 Bl?meke S , Felbrich A , M?ller C , Kaiser G , Lehmann R ( 2008 ) Effectiveness of teacher education: state of research, measurement issues and consequences for future studies . Int J Math Educ 40 : 719 - 734 Bl?meke S , Suhl U , Kaiser G ( 2011 ) Teacher education effectiveness: quality and equity of future primary teachers' mathematics and mathematics pedagogical content knowledge . J Teach Educ 62 ( 2 ): 154 - 171 Bl?meke S , Gustafsson J?E , Shavelson RJ ( 2015 ) Beyond dichotomies: competence viewed as a continuum . Z Psychol 223 ( 1 ): 3 - 13 Blum W , Neubrand M , Ehmke T , Senkbeil M , Jordan KA , Ulfig F , Carstensen CH ( 2003 ) Mathematische Kompetenz . In: Prenzel M, Baumert J , Blum W , Lehmann R , Leutner D , Neubrand M , Pekrun R , Rolff H? G , Rost J , Schiefele U (eds) PISA 2003 . Waxmann, M?nster, pp 47 - 92 Bouley F , Wuttke E , Schnick? Vollmer K , Schmitz B , Berger S , Fritsch S , Seifried J ( 2015 ) Professional competence of prospective teachers in business and economics education-evaluation of a competence model using structural equation modelling . Peabody J Educ 90 ( 4 ): 491 - 502 Brown NJS , Furtak EM , Timms M , Nagashima SO , Wilson M ( 2010 ) The evidence?based reasoning framework: assessing scientific reasoning . Educ Assess 15 ( 3 /4): 123 - 141 B?hner M ( 2011 ) Einf?hrung in die test? und fragebogenkonstruktion . Pearson Studium, M?nchen B?hner M , Ziegler M ( 2009 ) Statistik f?r Psychologen und Sozialwissenschaftler . Pearson, Hallbergmoos, pp 681 - 682 Darling?Hammond L , Bransford J ( 2005 ) Preparing teachers for a changing world: what teachers should learn and be able to do . Jossey?Bass, San Francisco de Ayala RJ ( 2009 ) The theory and practice of item response theory . The Guilford Press, New York de Ayala RJ , Plake BS , Impara JC ( 2001 ) The impact of omitted responses on the accuracy of ability estimation in item response theory . J Educ Meas 38 ( 3 ): 213 - 234 Dunn D , Smith RA , Beins B ( 2007 ) Best practices for teaching statistics and research methods in the behavioral sciences . L. Erlbaum Associates, Mahwah Egeln J , Gottschalk S , Rammer C , Spielkamp A ( 2002 ) Hohe Zahl an Spinoff? Gr?ndungen aus der Wissenschaft . ZEW Gr?ndungsreport 2 ( 2 ): 3 - 4 Embretson SE ( 2002 ) Generating abstract reasoning items with cognitive theory . In: Irvine S, Kyllonen P (eds) Generating items for cognitive tests: theory and practice . Lawrence Erlbaum Associates , Publishers, Mahwah, pp 35 - 60 Fichten W ( 2010 ) Forschendes Lernen in der Lehrerbildung . In: Eberhardt U (ed) Neue impulse in der Hochschuldidaktik: Sprach? und Literaturwissenschaften . VS? Verlag f?r Sozialwissenschaften, Wiesbaden, pp 127 - 182 Fleischer J , Koeppen K , Kenk M , Klieme E , Leutner D ( 2013 ) Kompetenzmodellierung: Struktur, Konzepte und Forschungszug?nge des DFG?Schwerpunktprogramms . Z Erziehwiss 16(Sonderheft 18 ): 5 - 22 Fleiss JL , Cohen J ( 1973 ) The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability . Educ Psychol Meas 33 ( 3 ): 613 - 619 Gesellschaft f?r Empirische Bildungsforschung ( 2012 ) Satzung der Gesellschaft f?r Empirische Bildungsforschung . http:// www. gebf? ev.de/?ber? die? gebf/satzung/. Accessed 1 Oct 2015 Glas CAW , Verhelst ND ( 1995 ) Testing the Rasch model . In: Fischer GH, Molenaar IW (eds) Rasch models: foundations, recent developments, and applications . Springer, New York, pp 69 - 95 Gro? Ophoff J , Schladitz S , Lohrmann K , Wirtz MA ( 2014 ) Evidenzorientierung in bildungswissenschaftlichen Studieng?ngen . In: Drossel K, Strietholt R , Bos W (eds) Empirische Bildungsforschung und evidenzbasierte Reformen im Bildungswesen . Waxmann, M?nster, pp 251 - 275 Gro? Ophoff J , Schladitz S , Leuders J , Leuders T , Wirtz MA ( 2015 ) Assessing the development of educational research literacy: the effect of courses on research methods in studies of educational science . Peabody J Educ 90 ( 4 ): 560 - 573 Groth RE ( 2007 ) Toward a conceptualization of statistical knowledge for teaching . J Res Math Educ 38 ( 5 ): 427 - 437 Hartig J ( 2007 ) Skalierung und definition von Kompetenzniveaus . In: Beck B, Klieme E (eds) Sprachliche Kompetenzen Konzepte und Messung DESI?Studie . Beltz, Weinheim, pp 83 - 99 Hartig J , Frey A ( 2013 ) Sind Modelle der Item?Response ? Theorie (IRT) das ?Mittel der Wahl? f?r die Modellierung von Kompetenzen ? Z Erziehwiss 16(Sonderheft 18 ): 47 - 51 Hartig J , Klieme E ( 2006 ) Kompetenz und Kompetenzdiagnostik . In: Schweizer K (ed) Leistung und Leistungsdiagnostik . Springer, Berlin, pp 127 - 143 Holland JL ( 1959 ) A theory of vocational choice . J Couns Psychol 6 : 35 - 45 Interstate Teacher Assessment and Support Consortium ( 2011 ) Model core teaching standards: a resource for state dialogue . Council of Chief State School Officers , Washington Jahed J , Bengel J , Baumeister H ( 2012 ) Transfer von Forschungsergebnissen in die medizinische Praxis . Gesundheitswesen (Bundesverband der ?rzte des ?ffentlichen Gesundheitsdienstes Germany ) 74 ( 11 ): 754 - 761 Janesick VJ ( 2006 ) Authentic assessment primer . Peter Lang, New York Kauertz A , Fischer HE ( 2008 ) Schwierigkeitserzeugende Merkmale physikalischer Testaufgaben . In: H?ttecke D (ed) Kompetenzen, Kompetenzmodelle, Kompetenzentwicklung: Jahrestagung in Essen 2007. Lit, Berlin, pp 218 - 220 Klahr D , Dunbar K ( 1988 ) Dual search space during scientific reasoning . Cogn Sci 12 : 1 - 48 KMK ( 2004 ). Standards f?r die Lehrerbildung: Bildungswissenschaften (Beschluss der Kultusministerkonferenz vom 16 .12. 2004 ). http://www.kmk.org/fileadmin/veroeffentlichungen_beschluesse/ 2004 /2004_12_ 16 ? StandardsLehrerbildung .pdf. Accessed 21 Aug 2014 Koch S ( 2006 ) Pers?nliche Verantwortung f?r den Studienerfolg . Dimensionen und Korrelate. Z P?dag Psychol 20 ( 4 ): 243 - 250 Koller I , Alexandrowicz R , Hatzinger R ( 2012 ) Das Rasch?Modell in der Praxis: Eine Einf?hrung mit eRm . Facultas, Wien Kuckartz U ( 2014 ) Qualitative Inhaltsanalyse . Methoden, Praxis, Computerunterst?tzung ( 2 . Aufl .). Beltz/Juventa, Weinheim Mair P , Hatzinger R ( 2007 ) Extended Rasch modeling: the eRm package for the application of IRT models in R . J Stat Softw 20 ( 9 ): 1 - 20 Marzano RJ , Kendall JS ( 2007 ) The new taxonomy of educational objectives . Corwin Press, Thousand Oaks Marzano RJ , Kendall JS ( 2008 ) Designing and assessing educational objectives: Applying the new taxonomy . Corwin Press, Thousand Oaks Masters GN ( 1982 ) A rasch model for partial credit scoring . Psychometrika 47 ( 2 ): 149 - 174 Messick S ( 1989 ) Validity . In: Linn RL (ed) Educational measurement. American Council on Education, New York , pp 13 - 103 Messick S ( 1995 ) Validity of psychological assessment: validation of inferences from persons' responses and performances as scientific inquiry into score meaning . Am Psychol 50 ( 9 ): 741 - 749


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1186%2Fs40461-017-0052-5.pdf

Michaela Wiethe-Körprich, Sandra Bley. Prospective educators as consumers of empirical research: an authentic assessment approach to make their competencies visible, Empirical Research in Vocational Education and Training, 2017, 8, DOI: 10.1186/s40461-017-0052-5