Towards validation of a rational number instrument: An application of Rasch measurement theory

Venkat and Spaull (2015) reported that 79% of 401 South African Grade 6 mathematics teachers showed proficiency of content knowledge below Grade 6–7 level in a Southern and East African Consortium for Monitoring Educational Quality (SACMEQ) 2007 mathematics teacher test. Universities recruit and receive students from some of these school where these teachers are teaching. In the previous years of teaching first-year students in the mathematics module in the Foundation Phase teacher development programme, we noticed that each cohort of prospective teachers come with knowledge bases that are at different levels. These classes, of students’ with varied mathematics knowledge, are difficult to teach unless you have some idea of their conceptual and procedural gaps. This varied knowledge base is greatly magnified in the domain of rational numbers in which they are expected to be knowledgeable and confident in order to teach and lay a good foundation in future teaching. An instrument, functioning as a diagnostic and baseline test for the 2015 first-year Foundation Phase cohort, was constructed at the university level in the fractions-decimals-percentages triad. This instrument aimed at gauging the level of students’ cognitive understanding of rational numbers as well as evaluating the validity of the instrument that was used to elicit their mathematical cognition. All the participants admitted into the Foundation Phase teacher training programme were tested on 93 items comprising multiple choice, short answer and constructed response formats. That elicited both conceptual and procedural understanding.


Introduction
reported that 79% of 401 South African Grade 6 mathematics teachers showed proficiency of content knowledge below Grade 6-7 level in a Southern and East African Consortium for Monitoring Educational Quality (SACMEQ) 2007 mathematics teacher test. Universities recruit and receive students from some of these school where these teachers are teaching. In the previous years of teaching first-year students in the mathematics module in the Foundation Phase teacher development programme, we noticed that each cohort of prospective teachers come with knowledge bases that are at different levels. These classes, of students' with varied mathematics knowledge, are difficult to teach unless you have some idea of their conceptual and procedural gaps. This varied knowledge base is greatly magnified in the domain of rational numbers in which they are expected to be knowledgeable and confident in order to teach and lay a good foundation in future teaching. An instrument, functioning as a diagnostic and baseline test for the 2015 first-year Foundation Phase cohort, was constructed at the university level in the fractions-decimals-percentages triad. This instrument aimed at gauging the level of students' cognitive understanding of rational numbers as well as evaluating the validity of the instrument that was used to elicit their mathematical cognition. All the participants admitted into the Foundation Phase teacher training programme were tested on 93 items comprising multiple choice, short answer and constructed response formats. That elicited both conceptual and procedural understanding.
Application of the Rasch model enabled a finer analysis of the test construct, the individual item and person measures, and the overall test functioning through making explicit the expected responses according to the model versus the actual responses by the students. In addition, the test as a whole was investigated for properties that are requirements of valid measurement such as the local independence where each item functions independently of each of the other items.
• To what extent does the test provide valid measures of student proficiency? • How might the test be improved for greater efficiency of administration, and greater validity for estimating student proficiency?
The aims of the immediate analyses were to: • Evaluate the assessment tool in terms of fit to the model, both item and person fit, thereby checking whether the tool was appropriate for this student cohort. • Provide detailed descriptions of selected items in relation to the students taking the test.
The validity and reliability of the assessment tool were analysed through the Rasch model incorporating both the dichotomous and partial credit model using Rasch Unidimensional Measurment Models (RUMM) software (see Andrich, Sheridan, & Luo, 2013). The processes of analysis and refinement, and the final outcome of this cycle are described. As this test was used as a preliminary diagnostic instrument, we regard ongoing cycles of refinement as pertinent in the interests of informing the teaching of mathematics on fractions-decimals-percentages to preservice cohorts of teachers in our programme.

Literature review
In an attempt to clarify the assessment, how it was conducted and its purpose, we provide the justification for the exercise. It was critical to ensure that certain conditions were satisfied in order to safeguard the effectiveness of the assessment as well as the validity of the test items. Stiggins and Chappuis (2005) explained that assessment must be guided by a clear purpose and it must accurately reflect the learning expectations. Wiliam (2011) affirms that a method of assessment must be capable of reflecting the intended target and also act as a tool for gauging teaching proficiency. These were the core intentions of the assessment in this research, and therefore the validation of the test as a whole, and the validation of independent items was critical and appropriate.
The learning and teaching of rational number concepts is particularly complex. The representation of a fraction 6 25 as 6 25 = 0.24 has a meaning different to whole numbers 6 and 25.
The numbers 6 and 25 are called local values while together, as a single entity, yielding 0.24, they constitute a global value and have a different meaning and value from 6 and 25 represented separately (Gabriel, Szucs, & Content, 2013;Sangwin, 2007). These authors found that it was not a simple process for either learners or adults to cross the bridge from whole numbers to fractions (global value form). Vosniadou (2007, 2010)  numbers. Given this complexity, the operations on rational numbers, for example addition, subtraction, multiplication and division, require procedures that may previously have been learned when working with natural numbers but that now appear to generate misconceptions and associated errors (Harvey, 2011;Pantziara & Philippou, 2012;Shalem, Smith & Sorto, 2014). In fact the operations on rational numbers are somewhat distinct, and require additional conceptual understanding together with the associated procedures.
Besides the features mentioned previously, there are different representational systems for rational numbers, namely common fractions, decimal fractions and percentages. While there is equivalence across the three systems within the triad fractions-decimals-percentages, this equivalence is not obvious at face value unless the student has understood the organising principles of each system. For instance, the denominator of a percentage representation is always 100, for common fractions the choice of denominator is infinite, while for decimal fractions, the denominator is 1 (one).
The apparent simplicity of the percentage because of its everyday use belies the complexity of this 'privileged proportion' (Parker & Leinhardt, 1995, p. 421). For example, an additive difference between two percentages, may be confused with a ratio difference. Hiebert and Lefevre (1986, p. 3-4) define conceptual knowledge as 'knowledge that is rich in relationship, that can be thought of as a connected web of knowledge, a network in which linking relationships are as prominent as the discrete pieces of information'. Such knowledge is described as that which is interconnected through relationships at various levels of abstraction. Conceptual knowledge is essential for learners to have conceptual understanding as in its absence they will indulge ineffectively in problem solving and follow wrong procedures to solve them. Conceptual knowledge plays a more important role, although interactively the two facets support a solid knowledge foundation. Stacey et al. (2001) found that preservice primary school teachers had problems understanding the size of decimals in relation to zero including limited awareness on the misconception that 'shorter is larger' among learners. Ryan and Williams (2007) also highlight and explain misconceptions and associated errors on adding and subtraction fractions, working with decimals, and the meaning of place value that are commonly committed by learners, such as having problems with zero when subtracting smaller from larger digits. Huang, Liu and Lin (2009) report that preservice teachers in Taiwan displayed better fraction knowledge of procedures but lacked conceptual knowledge because of the way they had received this knowledge themselves. They recommended that these preservice teachers need more opportunities to construct their conceptual knowledge before they graduate. Pesek, Gray and Golding (1997) believe that clear understanding of rational numbers is one of the most foundational sections in the primary school curriculum and yet, presently, is one of the least understood by both teachers and learners. Identifying mathematical competence levels of incoming preservice teachers provides an opportunity for the timely remediation of at-risk students. The conceptual complexities that generate misconceptions and associated errors emerge from lack of conceptual understanding (Ryan & Williams, 2007;Charalambous & Pitta-Pantazi, 2005, 2007. Research shows that in most cases both teachers and learners appear to have instrumental understanding of fractions, but do not really know why the procedures are used (Post, Harel, Behr, & Lesh, 1991). Students tend to develop conceptual schemes and information processing capacities to master fractions, decimals and percentage concepts individually but they also need to understand the commonalities between the different representations in their interaction with each other (Kieren, 1980). The educational aim however is for these students to have a balanced ability to follow a procedure with conceptual or relational understanding as the two facets interactively support a solid knowledge foundation (Zhou, 2011).

Assessment and measurement
The rich theorising of and research into rational numbers provides the theoretical base for the assessment instrument, which therefore meets the requirement for measurement to define clearly what is to be tested (Wright & Stone, 1979, 1999. The next requirement is to outline the interrelationships between component parts of the construct; in the case of this study, the interrelationships between fraction, decimal and percentage representations. The third stage is the construction and selection of items that will operationalise the construct, keeping in mind its complexity, and which will provide the teacher with evidence of misconceptions that would need to be addressed in class. A final phase is the post hoc verification of the functioning of the test as a whole and of the individual items.

Research design (participants, measures and models)
The primary study (Maseko, 2019) investigated the extent to which the 2015 cohort had mastered and retained their procedural and conceptual knowledge from their school level mathematics. This prior study reports on the level of relational understanding in the triad of concepts fractionsdecimals-percentages of the first-year Foundation Phase student teachers entering the education programme. This article reports on the appropriateness of the instrument designed to test the students' levels of understanding and conceptual knowledge as they entered the teacher education programme.
The assessment tool was administered to the whole population of students that were admitted into the Foundation Phase teacher training programme (N = 117). The test comprised 93 items that were designed to elicit prior knowledge at the beginning of the academic year.
The main research study comprised five conceptual categories that facilitated the analyses. The categories are understanding rational number concepts: definitions and conversions (14 items); manipulating symbols (operations) (17 items); comparing and sequencing rational numbers (15 items); alternate forms of rational number representation (35 items); as well as solving mathematical word problems with rational number elements (12 items). The items were drawn from selected projects, for example 'the rational number project' (Cramer, Behr, Post, & Lesh, 2009), and other such literature, and then adapted to post secondary school level.
The items were primarily informed by the conceptual categories above, and could be identified according to the following requirements: • The items demanded a demonstration of procedural as well as conceptual understanding. • The items included fraction, decimal and percentage representations. • Items were generated with the specific purpose of evoking misconceptions. • The items were comprehensive, covering most concepts and sub-concepts within the three representational systems -fractions, decimal fractions and percentages. • The format of the test item types included multiple choice items, short answer, as well as extended response items.
The reason for such a comprehensive selection of items was that the lecturers needed to identify the many difficulties and misconceptions the students could bring into their first semester mathematics class. A range of difficulty that would include learners of current low proficiency, and high proficiency, was also required. Also, at the time of setting the items, the instructors were not sure from which categories the difficulties would emerge.
The Rasch model was applied in this study in order to either confirm or challenge the theoretical base, to check the validity of the instrument, and to measure the students' cognition of rational number concepts. The hypothesis was that the assessment tool would function according to measurement principles. The Rasch model provided information of where the item functioning and student responses were unexpected. Possible explanations could then be inferred, and presented, as well as provide some indications for the refinement of the test instrument.

Ethical considerations
This study has been cleared by the University of Johannesburg Ethics Committee, with the ethical clearance number SEM 1 2018-021.

Findings
The first analysis showed the test instrument to have a sound conceptual base and to be well targeted to the cohort, with a range of items, such that the students of current lower proficiency could answer a set of questions with relative ease, while students of high proficiency would experience some challenging items. Table 1 shows summary statistics of the Rasch analysis. In this model, the item mean is set at zero, with items of greater and lesser difficulty calibrated against the mean. Person proficiency is then estimated against the item difficulty. The item standard deviation was 1.6302. The person mean location is estimated to be -0.4238 logits, and the person standard deviation is 0.9686, which shows fairly good targeting and spread. The person separation index of 0.9114 shows that the assessment tool was able to differentiate well between students' proficiencies and that the power of fit was excellent, in essence a high reliability.
As observed in the person-item map Figure 1, a range of items from easy to difficult was achieved, and the test is well targeted.
Easier items are located at the lower end of the map (Item 65 and Item 66), while the difficult items are located at the higher end (Item 27 and Item 28). Similarly, learners of high proficiency are located higher on the map, 2.903 and 1.733, while learners of low proficiency are located at -2.159 and -2.143. The mathematical structure of the Rasch model is such that where a person's proficiency location is aligned with an item difficulty location, an individual of that proficiency level has a 50% probability of answering an item of that difficulty level correctly (Rasch, 1960(Rasch, /1980). From the model one is able to predict how a student in a particular location will perform against an item: at, below or above their location on the scale.

Individual item analysis
The individual items when constructed were initially reviewed by the lecturers. The application of the Rasch model provided empirical output calibrating a relative location and giving the probability that a person located at a certain proficiency location will get the item correct within the instrument.
Item 63 (43. Fraction form of 0.21) at position -0.646 is shown on the category probability curve (Figure 2). Aligned with Item 63 are seven students (each represented by an ×, as shown on Figure 1). From their overall performance on the test as a whole, these students are estimated to have a 50% chance of answering Item 63 correctly. Each of the items shown by the category probability curves can be represented to show the item's unique characteristics in relation to the student cohort as a whole.
In Figure 2, depicting Item 63, the horizontal axis shows the student locations from -5 to +5. The vertical axis indicates the probability of getting a correct response. The item difficulty is calibrated at -0.646 (the dotted line shows the meeting point of the two curves). As stated previously, the seven students located at this point will have a 50% probability of answering the question correctly. Students located above -0.646 will have a greater than 50% probability of answering this question correctly. Students located below -0.646 will have a less than 50% chance of answering this question correctly. The light grey curve indicates the probability, according to the model, of a correct response. Inversely, the solid black curve shows the probability of getting an incorrect answer. Both curves plot either an increased or decreased probability of a correct response from a particular location of both a question item as well as a person responding to that item.
When an item is difficult or easy for the students, the curves show a shift of the meeting point away from the zero position (0) on the x-axis. Two items, Item 58, a relatively difficult item with an item location of about +3 logits (see Figure 3), and Item 39, a relatively easy item, with an item location of about -3, are presented (see Figure 4).
Very few students are to the right of position +3, implying that it was only students located at +3, or higher, that had a greater than 50% probability of answering the item correctly.
Item 39 (Figure 4) had a 50% or greater probability of being answered correctly even by students with relatively low proficiency. All those to the right of location -3 had a greater than 50% chance of providing the correct answer.
In summary, applying the Rasch model to a data set is essentially testing a hypothesis that invariant measurement has been achieved. Where there are anomalies, the researcher is required to investigate the threat to valid measurement. The model enables the researchers to identify the items that did not contribute to the information being sought or those items that were deemed faulty in some respect. Likewise, where students' responses to the question were unexpected the researchers were also alerted. The Rasch model is to some extent premised on the Guttman pattern, which postulates that in addition to some difficult questions, a person of greater proficiency should answer all the items correctly that a person of lower proficiency answers correctly. Likewise, easier items should be answered correctly by low proficiency learners, and also by moderate proficiency and higher proficiency learners. While a strict Guttman pattern is not possible in practice, the principle is a good one (Dunne, Long, Craig, & Venter, 2012).
We briefly report on six students against four questions close enough to their locations to illustrate the relationship of person proficiency to item difficulty as seen through the Guttman pattern model.
The student of low proficiency (A, location -2.159) struggled with the range of items that included the easiest of the items. The other student categorised as of low proficiency (B, location -2.143), offered no response to these particular items. From the person-item map, we would expect students at these locations to have a 50% chance of answering correctly, meaning that if there were 100 students at that location approximately 50 could have answered the items correctly.
Of the two students in the moderate category, one of the students (C, location 0.003) did not attempt the easiest item (location -2.234) (missing response), while the other student (D, location 0.029) answered this item far below his location correctly. The next two items which were above the two students' locations were either not answered or answered incorrectly.
The two students located in the high proficiency category are located at 1.733 logits (E) and 2.903 logits (F), more than a logit apart. We therefore deal with them separately. Student E answered the easiest item correctly and this was to be expected; however, the next easiest item was answered incorrectly. In theory the student should have had a greater than 50% correct response. The difficulty of the third item is aligned with the proficiency of Learner E. In theory Learner E has a 50% chance of answering Item 57 correctly. Item 58 has a greater difficulty by a large margin. One would expect the student to perhaps get this incorrect.   Student F (location 2.903) answered three items correctly but was not able to answer Item 58 (location 2.903) correctly. According to the model the student had a 50% probability of answering this item correctly, as it is located at the same point on the scale. In the case of the most difficult item, Item 58, the requirement was to make decisions on converting the existing form before comparing and sorting the elements in ascending order. The cognitive demand required the students to connect their knowledge and make decisions in the process of working out the solution.

Problematic items
It was noted in the first analysis that there were two items that did not function as expected. These two items were removed from this analysis, although for future testing they may be refined. One multiple choice item was removed due to an error. The second item, Question 8A, was revised as shown below and was reserved for the next cycle. Item 88 (Question 8A) was found to be a misfit as the grammatical representation of the mathematical idea is confusing. The original and possible revised versions are briefly discussed below.

Original question:
Tell if the fraction on the left is less or greater than or equal to the fraction on the right. Use < or > or = for each case to make the statement true. The responses to item 8A produced the distribution displayed in Figure 6.
The black dots represent the means of the 5 class intervals into which the students were divided. The allocation to class intervals is decided by the researcher. The black dots representing students' mean responses did not follow the expected pattern according to the model. The expectation is that students of lower ability will be less likely to answer an item correctly than those of higher ability. The analysis revealed that learners of lower proficiency (four × marks left of 0 logits) on the test as a whole performed relatively higher than the students of higher proficiency (one × mark at about 1 logits). This anomaly was investigated, and it was found that the grammar and length of the instructions appeared to have interfered with the understanding of the question. For the next three items in Question 8 the instructions did not seem to mislead the students. When the instructions were revised and reduced to 'Use < or > or = for each case to make the statement true', the whole question seemed clearer.

Local independence
A further check on the validity of the test required an investigation of local independence. In any test, one expects that each item would contribute some information to the test construct (Andrich & Kreiner, 2010). There may be cases of construct irrelevance, where items do not contribute to the construct, and may be testing other dimensions, or construct underrepresentation, where the construct is not fully represented (Messick, 1989). On the other hand, there may be cases where there is response dependency, where answering a second item correctly is dependent on answering the previous item correctly. Another threat to validity of the construct is where there are too many items targeting one aspect of the construct, for example five items asking for similar knowledge. In such a case the student who knows the concept is unduly advantaged, while a student who does not know the concept is unduly disadvantaged. High residual correlations between items can be resolved by forming a subset, essentially a super-item, where the two items contribute to the score (Andrich & Kreiner, 2010).
In this instrument analysis, we checked the residual correlations of the items and found high correlations, both positively correlated sets of items and negative correlations across some items. The implications of such a threat to local independence is that there are many items contributing the same information, as in a high positive correlation, and those with a negative correlation are 'pulling in the other direction'. A resolution of this threat is to remove the items that seem to test the same thing or create subtests of items that are highly  correlated, by investigating both the item context and the statistics it conveys.
In a second round, eight items were removed due to redundancy. In order to resolve response dependency, 18 subtests were created. These subtests were then checked for ordered or disordered thresholds. For illustrative purposes four sets of items are discussed.
Question 6: Item 6a 'Draw a representation of fraction 2 5 , and Item 6b 'Explain the meaning of the following fraction: 2 5 , were subsumed into a subtest. The subtest was structured in such a way that instead of having two items that were highly correlated, there was one partial credit item, for which the student could obtain a 0 for none correct, a 1 for one of the two questions correct, or a 2 for fully correct.
On investigating the subtest, Question 6 (combined a and b), which required students to both draw a representation and explain the meaning of 2 5 , the now partial credit item, it was observed that the common response was either none correct, or both correct. The middle category for which one mark awarded was almost redundant. The solution was to re-score the item as a dichotomous item and the resulting category probabilistic curve to show an improved scoring (Figure 7b).
Question 27 required the students to provide the fraction and percentage form for 0.75 as individual responses, but the correct answer depended on whether the student knew how to perform the conversions to both forms of fractions from the decimal form, that is, fraction and percentage form. This was the second set of items observed to be highly correlated and was subsumed into a subtest. For this subtest (see Figure  8) it was found that the category probability curves functioned appropriately. The three categories, 0, 1 and 2, corresponded to both incorrect, one correct and two correct. The group of students of middle proficiency were most likely to obtain 1 mark for being proficient in converting a decimal fraction to either a common fraction or a percentage, whereas the higher proficiency group obtained the full 2 marks, meaning that they were proficient in both conversions of the item.
The next subtest was created by subsuming four items into one set. The four sections of the question asked similar questions, which were to convert from an improper fraction to a mixed fraction. These four items -11A = 16 5 ; 11B = 18 6 ; 11C = 19 4 and 11D = 24 5 -appear to be testing only one skill because the distribution showed that students either answered all four items correctly or answered none correctly. The resulting category probability curve is shown in Figure 9a. There may be a case here for rescoring, 0, 1 or 2 (see Figure 9b).
The final subtest was made up of four different question items, where the requirement was to order a combination of the fractions-decimals-percentages representations in ascending or descending order (See Figure 10).
Here it appeared that although these items were highly correlated, they increased in complexity. This subset functioned as expected in that the categories mark increase in proficiency with a clearer differentiated distribution of the curves ( Figure 11).
As exhibited in the examples above, the investigation of specific subsets, from both a conceptual perspective and a statistical perspective, was conducted in order to ascertain which items could reasonably be subsumed into subtests. http://www.pythagoras.org.za Open Access The subtests that functioned as expected were retained, but for those whose categories were for some conceptual reason not functioning according to measurement principles, the rescoring of the subtests items was implemented. The process reported in this article works together with a qualitative investigation that was done in the main study, and also formed part of improving the functioning of the instrument (Dunne et al., 2012;Maseko, 2019).
The outcome, after this final analysis, was a test with 50 items, including both dichotomous and polytomous items, 22 of which were multiple choice format and 28 constructed response format. Figure 12 appears more compact in the distribution of both the test item difficulty locations and students' proficiency levels. The easiest of the items (ST031) is by far the easiest and there is some distance from the next easiest items by almost 2.5 logits points. There were two items that were at difficulty levels where no one had a 50%, or greater, chance of answering correctly. Table 3 shows the refined test person mean; a mean of -0.4172 in the initial analysis, moved closer to zero, 0.099, implying that by resolving some of the test issues, the targeting of the test to the learners was found to be better. The item standard deviation in the initial analysis was rather large at 1.6302, but after the refinement of the test it moved closer to 1, at 1.4933. The person standard deviation after refinement was somewhat smaller, implying that the range of proficiency was narrower.

Conclusion and implications
As stated in the introduction, this article forms part of a larger study into the student understanding of rational number, fractions, decimals and percent. The purpose of the investigation was to gather information about the cohort entering the Foundation Phase teacher development on working with rational numbers, especially fractionsdecimals-percentage. This article reported on how the instrument was functioning to assess their knowledge level of work done at school. The assessment tool covered understanding rational number concepts, manipulating symbols (operations), comparing and sequencing rational numbers, alternate forms of rational number representation, as well as solving mathematical word problems with rational number elements. It is clear that the number of items does not impact the quality of the test. Beyond a certain amount, some of the items might be redundant. One has to check if the test instrument as a whole is fit for purpose. Beyond the total score obtained by each student in the test, the Rasch model indicates a position on a unidimensional scale where the student's proficiency level is differentiated. The power and usefulness of the Rasch model is that it supports the professional judgement of the subject expert in making decisions about the validity of items (Smith & Smith, 2004).
The Rasch model was applied in this study in order to confirm or challenge the theoretical base, to check the validity of the instrument and to quantify the students' cognition of rational number concepts.
The application of the Rasch measurement model enabled checking whether the test content was consistent with the construct under investigation, and supported expectations of a sharper understanding of these students in terms of proficiency level within a set of items in the test. The outcome showed the data to fit the model, the person separation index was high, and the target was appropriate, thereby confirming the theoretical work that supported the design of the test.