Detail Page

Machine learning for automated content analysis: characteristics of training data impact reliability
written by Rebeckah K. Fussell, Ali Mazrui, and N. G. Holmes
Natural language processing (NLP) has the capacity to increase the scale and efficiency of content analysis in Physics Education Research. One promise of this approach is the possibility of implementing coding schemes on large data sets taken from diverse contexts. Applying NLP has two main challenges, however. First, a large initial human-coded data set is needed for training, though it is not immediately clear how much training data are needed. Second, if new data are taken from a different context from the training data, automated coding may be impacted in unpredictable ways. In this study, we investigate the conditions necessary to address these two challenges for a survey question that probes students' perspectives on the reliability of physics experimental results. We use neural networks in conjunction with Bag of Words embedding to perform automated coding of student responses for two binary codes, meaning each code is either present (positive) or absent (negative) in a response. We find that i) substantial agreement is consistently achieved for our data when the training set exceeds 600 responses, with 80-100 responses containing each code and ii) it is possible to perform automated coding using training data from a disparate context, but variation in code frequencies (outcome balances) across specific contexts can affect the reliability of coding. We offer suggestions for best practices in automated coding. Other smaller-scale investigations across a diverse range of coding scheme types and data contexts are needed to develop generalized principles.
Physics Education Research Conference 2022
Part of the PER Conference series
Grand Rapids, MI: July 13-14, 2022
Pages 194-199
Subjects Levels Resource Types
Education Foundations
- Assessment
= Methods
- Research Design & Methodology
= Data
= Evaluation
= Statistics
= Validity
Education Practices
- Technology
- Graduate/Professional
- Reference Material
= Research study
Intended Users Formats Ratings
- Researchers
- application/pdf
  • Currently 0.0/5

Want to rate this material?
Login here!


Mirror:
https://doi.org/10.1119/perc.2022…
Access Rights:
Free access
License:
This material is released under a Creative Commons Attribution 4.0 license. Further distribution of this work must maintain attribution to the published article's author(s), title, proceedings citation, and DOI.
Rights Holder:
American Association of Physics Teachers
DOI:
10.1119/perc.2022.pr.Fussell
NSF Number:
2000739
Keyword:
PERC 2022
Record Creator:
Metadata instance created September 7, 2022 by Lyle Barbato
Record Updated:
September 14, 2022 by Lyle Barbato
Last Update
when Cataloged:
September 15, 2022
Other Collections:

ComPADRE is beta testing Citation Styles!

Record Link
AIP Format
R. Fussell, A. Mazrui, and N. Holmes, , presented at the Physics Education Research Conference 2022, Grand Rapids, MI, 2022, WWW Document, (https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600).
AJP/PRST-PER
R. Fussell, A. Mazrui, and N. Holmes, Machine learning for automated content analysis: characteristics of training data impact reliability, presented at the Physics Education Research Conference 2022, Grand Rapids, MI, 2022, <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600>.
APA Format
Fussell, R., Mazrui, A., & Holmes, N. (2022, July 13-14). Machine learning for automated content analysis: characteristics of training data impact reliability. Paper presented at Physics Education Research Conference 2022, Grand Rapids, MI. Retrieved October 6, 2024, from https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600
Chicago Format
Fussell, R, A. Mazrui, and N. Holmes. "Machine learning for automated content analysis: characteristics of training data impact reliability." Paper presented at the Physics Education Research Conference 2022, Grand Rapids, MI, July 13-14, 2022. https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600 (accessed 6 October 2024).
MLA Format
Fussell, Rebeckah K., Ali Mazrui, and Natasha G. Holmes. "Machine learning for automated content analysis: characteristics of training data impact reliability." Physics Education Research Conference 2022. Grand Rapids, MI: 2022. 194-199 of PER Conference. 6 Oct. 2024 <https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600>.
BibTeX Export Format
@inproceedings{ Author = "Rebeckah K. Fussell and Ali Mazrui and Natasha G. Holmes", Title = {Machine learning for automated content analysis: characteristics of training data impact reliability}, BookTitle = {Physics Education Research Conference 2022}, Pages = {194-199}, Address = {Grand Rapids, MI}, Series = {PER Conference}, Month = {July 13-14}, Year = {2022} }
Refer Export Format

%A Rebeckah K. Fussell %A Ali Mazrui %A Natasha G. Holmes %T Machine learning for automated content analysis: characteristics of training data impact reliability %S PER Conference %D July 13-14 2022 %P 194-199 %C Grand Rapids, MI %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600 %O Physics Education Research Conference 2022 %O July 13-14 %O application/pdf

EndNote Export Format

%0 Conference Proceedings %A Fussell, Rebeckah K. %A Mazrui, Ali %A Holmes, Natasha G. %D July 13-14 2022 %T Machine learning for automated content analysis: characteristics of training data impact reliability %B Physics Education Research Conference 2022 %C Grand Rapids, MI %P 194-199 %S PER Conference %8 July 13-14 %U https://www.compadre.org/Repository/document/ServeFile.cfm?ID=16231&DocID=5600


Disclaimer: ComPADRE offers citation styles as a guide only. We cannot offer interpretations about citations as this is an automated procedure. Please refer to the style manuals in the Citation Source Information area for clarifications.

Citation Source Information

The AIP Style presented is based on information from the AIP Style Manual.

The APA Style presented is based on information from APA Style.org: Electronic References.

The Chicago Style presented is based on information from Examples of Chicago-Style Documentation.

The MLA Style presented is based on information from the MLA FAQ.

Machine learning for automated content analysis: characteristics of training data impact reliability:


Know of another related resource? Login to relate this resource to it.
Save to my folders

Contribute

Related Materials

Similar Materials