Towards quantification of the FCI ’ s validity : the effect of false positives

Michael M. Hull,1 Jun-ichiro Yasuda,2 Masa-aki Taniguchi,3 and Naohiro Mae4 Department of Physical Sciences and Mathematics, Wayne State College, 1111 Main St, Wayne, NE, 68787 Institute of Arts and Sciences, Yamagata University, Yamagata, Yamagata 990-8560, Japan Center for Teacher Education, Meijo University, Nagoya, Aichi 468-8502, Japan Center for Development of Instructional System of Science and Engineering, Kansai University, Suita, Osaka 564-8680, Japan


I. INTRODUCTION
When it comes to assessing student understanding of Newtonian mechanics, the first survey that comes to the minds of many instructors and education researchers is the Force Concept Inventory (FCI) [1].The multiple choice instrument was first published in The Physics Teacher in 1992 and has since been translated into over 25 languages to be used internationally.Perhaps in part because it is so widely-used, legitimate concerns and criticisms about the FCI have been expressed.For example, question 16 of the FCI (Q.16) presents students with the situation of a car pushing a truck at constant speed and asks how the force from the truck on the car compares with the force from the car on the truck.The correct answer is "(A) the amount of force with which the car pushes on the truck is equal to that with which the truck pushes back on the car", and the correct explanation is that one force is the reaction of the other, and by Newton's Third Law, is hence of equal magnitude.However, some interviewed students chose option A for the wrong reasoning: since they are moving at a constant speed, the force from the truck on the car must cancel out the force from the car on the truck [2].Since the two forces act on different bodies, this is an incorrect application of Newton's First Law that is frequently seen with students taking introductory physics.This "false positive"-obtaining the correct answer despite not correctly understanding the content tested by the question-on Q.16 was observed by the FCI creators as well as other researchers validating the FCI [3][4][5].False positives are a source of systematic error, inflating student scores on the FCI.Regarding this issue, the FCI authors have acknowledged that "Newtonian choices for non-Newtonian reasons were fairly common.Therefore. . . the Inventory score should be regarded as an upper bound on a student's Newtonian understanding."[1].They do not, however, quantify to what degree the Inventory score (described in this paper as "raw score") is higher than a student's Newtonian understanding ("true score").Our research aims to quantify that difference.
Much research has explored the validity of the FCI, including its wording and diagrams [1], its distractors [6,7], and whether or not the specific question content influences how students answer [8].Item response theory has been used to quantify the tendency of students to get an individual question correct, as a function of their overall understanding.For example, it was found that students of little understanding still got Q.16 correct [9].However, IRT has not been used to estimate student true score given raw score-indeed, it might not be suitable for such a task.We build upon this body of research and present here the first published report of a method to estimate a student's true score given raw score.

A. Subquestions
Previous validation studies of the Japanese version of the FCI showed that, similar to the studies mentioned above, some students answer Q.16 correctly by inappropriately utilizing Newton's First Law instead of Third.Furthermore, it was also observed that some students answered Q.6 and Q.7 correctly with incorrect reasoning as well [2].Question 7 involves a ball being swung around in a circle in a horizontal plane.Students must choose the direction that the ball will travel when the string breaks.The correct answer is (B), which is an arrow drawn tangent to the circle.The correct reasoning involves recognizing that, when the string breaks, there will no longer be a force acting on the ball in the horizontal plane.However, several students who chose the correct answer incorrectly explained that at that particular moment of release, the force being exerted on the ball is in the direction of motion, and that force is what would make it continue to go in a straight path.Question 6 is identical to Q.7, except that a ball is leaving a circular track instead of being on a cut string.
Realizing that the FCI does not allow us to distinguish between students who answer Q.6, Q.7, and Q.16 correctly for correct reasons vs. incorrect reasons or guessing, we wished to introduce a minimum number of additional questions to the FCI that would allow us to make this distinction.To assess whether students were answering Q.16 correctly with the correct reasoning, we introduced two questions to the survey, one asking what force "balances" the force of the car on the truck (the correct selection being #2, the force of air resistance and friction on the truck) and the other asking what force is the "reaction" to the force of the car on the truck (#1, the force of the truck pushing on the car) [10].Students who answer Q.16 correctly with the incorrect reasoning detected in interviews should incorrectly answer these "subquestions".In particular, we would expect them to select #1 for the first subquestion, thinking that the force of the truck pushing on the car balances the force of the car on the truck.Students who answered Q.16 correctly, but answered at least one subquestion incorrectly, were coded as false positives for Q.16.Students who answered Q.16 correctly and got both subquestions correct were coded as true positives for Q.16.A similar process and analysis was conducted for Q.6 and Q.7, using three subquestions instead of two (one asking about the force on the ball, one asking about the ball's acceleration, and one asking about the ball's velocity).Only students who answered all three subquestions correctly were coded as true positives.
We did not observe false positives for the remaining 27 FCI questions in validation interviews and so we consider them, in contrast to the other 3 questions, to be "valid-like" with comparably small systematic errors.Nevertheless, we considered that false positives could still exist on these questions, and we wished to find a way to estimate the size of their effect as well.Adding subquestions to all remaining FCI questions would make the inventory exceedingly long for students to complete.Thus, as a first step, we selected one question, Q.5, and repeated the process of making subquestions and checking for false positives.
We selected Q.5 to represent the remaining 26 questions because we could create subquestions for it in a straightforward manner and because the guessing parameter is relatively small [9].Since the guessing parameters of the remaining 26 questions tend to be higher, choosing Q.5 prevents over-estimating the systematic error and thus constitutes a conservative approach.
In Q.5, students must select, among a limited set of options, a plurality of forces acting on a ball traveling in a passage.Therefore, in the subquestions, we broke down the combinations of the plurality of forces and asked students to determine whether each force was working on the ball or not.If a student answered Q.5 correctly by correctly understanding that each of the forces in their selection is acting, they should get the subquestions correct.If, on the other hand, they had guessed randomly, had used process of elimination, etc., then we would expect them to get at least one of the subquestions wrong.
We placed the subquestions in such a way as to avoid giving students hints to the correct answers to the FCI.We put the subquestions for Q.6, 7, and 16 after the 30 FCI questions.On the other hand, because Q.5 subquestions did not contain any cues to the correct answer to the FCI question, we put these subquestions before the 30 FCI questions.Students were instructed to answer questions sequentially on the FCI and to not return to already-answered questions.
We ensured the clarity of the wording and diagrams of the subquestions by interviewing a few students to confirm that they understood the intent of the questions.The survey instrument used in this study consisted of a total of 42 questions, the 30 original Japanese-language FCI questions [11], and a total of 12 subquestions (4 for Q.5, 3 for Q.6, 3 for Q.7, and 2 for Q.16).

B. Data collection
We surveyed students at the beginning of introductory physics courses at one public university and three private universities in Japan in April 2015.These four universities are middle-rank universities in Japan.The total number of survey responses was 513.From this, we excluded the responses of students who did not answer some of the questions, who wrote a letter which was not one of the choices available for a given question, or who wrote the same or serial letters continuously.In total, the number of valid responses was 503.

C. Analysis method
As described above, we are interested in estimating a student's true score on the FCI, given their raw score on the assessment.The raw score, S raw , is what the FCI measures directly-it is the number of correct answers on the 30 FCI questions.The true score, S true , in this study is taken to be the number of true positives, i.e. the questions answered correctly and with correct reasoning.For Q.5, Q.6, Q.7, and Q.16, we directly calculate the number of true positives using subquestions, as described above.For the remaining 26 questions, we estimate the true positives with the method we will soon describe.Before that, however, let us represent the true score for the 30 questions (S true , or, written more explicitly, S 30 true ) of an individual student as, S 30 true = S 4 true + S 26 raw • r tp (1) true is the true score for the 4 questions (Q.5, Q.6, Q.7, Q.16) for which we created the subquestions.S 26 raw is the raw score for the 26 questions for which we did not create the subquestions.r tp is the true positive ratio for a respondent, which is defined by where S 26 true is the true score for the remaining 26 questions, which we are not measuring directly.
Since r tp includes S 26 true , we cannot calculate this value directly-rather, we estimate it with the method that we now describe.The 26 questions for which estimation is necessary are the questions for which we have not found respondents answering correctly with clearly erroneous reasoning.Therefore, we modelled the tendencies of these 26 questions to induce false positives after Q.5.We can think of r tp of Eq. ( 2) as representing the likelihood of a respondent to get true positives on the 26 questions.At the same time, we can also think of the likelihood of respondents to get a true positive on Q.5 to be represented by R Q5 tp , where R Q5 tp is defined by N true positives for Q.5 N raw positives for Q.5 where N represents the number of respondents.Note that r tp is for a given respondent whereas R Q5 tp is calculated from a group of respondents.
We assume that within a given group of students who all have the same S 26 raw , the true positive ratio r tp of an individual respondent can be approximated by the calculated value of true positive ratio RQ5 tp of that student's group.Namely, In Eq. ( 4), we show the dependence on S 26 raw explicitly, in order to show that we can use the approximation only within each of the 27 groups (0, 1, 2, ... 26) corresponding to the possible values of S 26 raw .In other words, the S 26 raw of the individual respondent (l.h.s of Eq. ( 4)) must be the same as of the group (r.h.s).The ∧ added to R Q5 tp denotes that we will be using the predicted value obtained by regression analysis, as described below.This is an assumption that requires validation.Future research could attend to this, for example, by introducing subquestions for a random subset of the remaining 26 questions.

III. RESULTS
In Fig. 1, the percentage students who obtained a true positive on Q.5, R Q5 tp , is plotted as a function of S 26 raw .Note that at higher values of S 26 raw , the true positive ratio increases, as we would expect.The curve passing through the data, RQ5 tp (S 26 raw ), was obtained by performing weighted logistic regression analysis with SPSS.A Hosmer and Lemeshow test of goodness-of-fit was performed.We found our model's predictions fit the data at an acceptable level, χ 2 (df = 8, N = 152) = 4.79, p > .05[12].The function is  We performed one final regression analysis so as to obtain a function from the data shown in Fig. 2 , we find that β 0 = −1.88 ± .12 ; and β 1 = .097± .006 for the mean logistic regression coefficients.Notice that the true score tends to be around 50% of the raw score.For example, S 30 true 10, when S 30 raw = 20.

IV. DISCUSSION AND CONCLUSIONS
In this paper, we described a novel approach to addressing invalidity of the FCI and of surveys in general.Rather than discarding the survey or a particular question on the survey (especially when so much data has already been universally collected!), the effect of statistical errors caused by false positives and the like can be calculated and corrected for after the fact with the method of subquestions.
However, there are still several issues with this method with which we are wrestling.There remains a need to find a systematic process for creating subquestions.Here, we only analyzed the effect of false positives and estimated the reduction of a true value from a raw value, namely, the negative part of the systematic error.It is necessary to analyze the effect of false negatives as well.By considering them, the reduction of the raw value could be eased to some extent.However, we do not expect that the effect of false negatives fully counters the effect of false positives, as Hestenes et al. wrote that the false negatives are "certainly less than ten percent" [14].Nevertheless, for a still more accurate measure of a learner's understanding of Newtonian mechanics, subquestions could be created and used in future research to calculate the effect of false negatives.
Despite these limitations, we feel that this method of subquestions is useful for instructors in calculating more accurately the degree to which their students are Newtonian thinkers.Currently our data is restricted to four universities in Japan, and there is need to determine how similar other populations are for a more accurate calculation of true score.We welcome educators interested in administering our modified FCI to contact the first author.
β 0 = −1.30±.50, p < .05 ; and β 1 = .09±.03, p < .05.We chose to use weighted logistic regression analysis because the dependent variable, coded as either a true positive or a false positive, is binary[13].Substituting this function for r tp (S 26 raw ) in Eq. (1) allows us to calculate values of S 30 true for each of our data points (recall that we are measuring S 4 true directly by use of the subquestions designed for those four questions).These values of S 30 true are plotted in Fig.2as a function of that respondent's S 30 raw .There are multiple values of S 30 true for a given S 30 raw because, in part, S 4 true for a student can take on values 0 − 4.

FIG. 1 FIG. 2 .
FIG. 1. R Q5 tp for each group of students with a given S 26 raw .The trendline is RQ5 tp (S 26 raw ).The error bars show the standard error of the dependent variable.The radius of the bubble corresponds to the number of students in each of the 26 groups who answered Q.5 correctly.
to allow us to estimate a student's true score given his or her raw score.Using SPSS to fit the data to the function Ŝ30 true