Machine learning for automated content analysis: characteristics of training data impact reliability
Natural language processing (NLP) has the capacity to increase the scale and efficiency of content analysis in Physics Education Research. One promise of this approach is the possibility of implementing coding schemes on large data sets taken from diverse contexts. Applying NLP has two main challenges, however. First, a large initial human-coded data set is needed for training, though it is not immediately clear how much training data are needed. Second, if new data are taken from a different context from the training data, automated coding may be impacted in unpredictable ways. In this study, we investigate the conditions necessary to address these two challenges for a survey question that probes students' perspectives on the reliability of physics experimental results. We use neural networks in conjunction with Bag of Words embedding to perform automated coding of student responses for two binary codes, meaning each code is either present (positive) or absent (negative) in a response. We find that i) substantial agreement is consistently achieved for our data when the training set exceeds 600 responses, with 80-100 responses containing each code and ii) it is possible to perform automated coding using training data from a disparate context, but variation in code frequencies (outcome balances) across specific contexts can affect the reliability of coding. We offer suggestions for best practices in automated coding. Other smaller-scale investigations across a diverse range of coding scheme types and data contexts are needed to develop generalized principles.
Physics Education Research Conference 2022
Part of the PER Conference series Grand Rapids, MI: July 13-14, 2022 Pages 194-199
ComPADRE is beta testing Citation Styles!
Record Link
<a href="https://www.compadre.org/portal/items/detail.cfm?ID=16231">Fussell, R, A. Mazrui, and N. Holmes. "Machine learning for automated content analysis: characteristics of training data impact reliability." Paper presented at the Physics Education Research Conference 2022, Grand Rapids, MI, July 13-14, 2022.</a>
AIP Format
R. Fussell, A. Mazrui, and N. Holmes, , presented at the Physics Education Research Conference 2022, Grand Rapids, MI, 2022, WWW Document, (https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600).
AJP/PRST-PER
R. Fussell, A. Mazrui, and N. Holmes, Machine learning for automated content analysis: characteristics of training data impact reliability, presented at the Physics Education Research Conference 2022, Grand Rapids, MI, 2022, <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600>.
APA Format
Fussell, R., Mazrui, A., & Holmes, N. (2022, July 13-14). Machine learning for automated content analysis: characteristics of training data impact reliability. Paper presented at Physics Education Research Conference 2022, Grand Rapids, MI. Retrieved October 6, 2024, from https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600
Chicago Format
Fussell, R, A. Mazrui, and N. Holmes. "Machine learning for automated content analysis: characteristics of training data impact reliability." Paper presented at the Physics Education Research Conference 2022, Grand Rapids, MI, July 13-14, 2022. https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600 (accessed 6 October 2024).
MLA Format
Fussell, Rebeckah K., Ali Mazrui, and Natasha G. Holmes. "Machine learning for automated content analysis: characteristics of training data impact reliability." Physics Education Research Conference 2022. Grand Rapids, MI: 2022. 194-199 of PER Conference. 6 Oct. 2024 <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600>.
BibTeX Export Format
@inproceedings{
Author = "Rebeckah K. Fussell and Ali Mazrui and Natasha G. Holmes",
Title = {Machine learning for automated content analysis: characteristics of training data impact reliability},
BookTitle = {Physics Education Research Conference 2022},
Pages = {194-199},
Address = {Grand Rapids, MI},
Series = {PER Conference},
Month = {July 13-14},
Year = {2022}
}
Refer Export Format
%A Rebeckah K. Fussell %A Ali Mazrui %A Natasha G. Holmes %T Machine learning for automated content analysis: characteristics of training data impact reliability %S PER Conference %D July 13-14 2022 %P 194-199 %C Grand Rapids, MI %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600 %O Physics Education Research Conference 2022 %O July 13-14 %O application/pdf
EndNote Export Format
%0 Conference Proceedings %A Fussell, Rebeckah K. %A Mazrui, Ali %A Holmes, Natasha G. %D July 13-14 2022 %T Machine learning for automated content analysis: characteristics of training data impact reliability %B Physics Education Research Conference 2022 %C Grand Rapids, MI %P 194-199 %S PER Conference %8 July 13-14 %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600 Disclaimer: ComPADRE offers citation styles as a guide only. We cannot offer interpretations about citations as this is an automated procedure. Please refer to the style manuals in the Citation Source Information area for clarifications.
Citation Source Information
The AIP Style presented is based on information from the AIP Style Manual. The APA Style presented is based on information from APA Style.org: Electronic References. The Chicago Style presented is based on information from Examples of Chicago-Style Documentation. The MLA Style presented is based on information from the MLA FAQ. Machine learning for automated content analysis: characteristics of training data impact reliability:Know of another related resource? Login to relate this resource to it. |
ContributeRelated MaterialsSimilar Materials |