Students’ productive strategies when generating graphical representations: An undergraduate laboratory case study

Generating graphical representations is an essential skill for productive student engagement in physics laboratory settings, and is a key component in developing representational competency (RC). As physics lab courses have been reformed to prioritize student engagement in authentic scientiﬁc skills and practices, students experience additional freedom to decide what data to include in graphs and what types of graph(s) would allow for appropriate sensemaking towards answering experimental questions. With this, however, there is a dearth of PER literature highlighting the strategies students use while working to generate graphs using their own experimental data. This paper presents a case study analysis of a student group’s lab investigation to call attention to how students enact various productive strategies when working towards generating graphical representations in an introductory physics laboratory course. Results of this case study analysis identify three productive strategies students enact when working to generate graphs in lab settings, each of which is related to aspects of representational competency (RC): 1) identifying (potential) covarying quantities; 2) choosing representative data subsets suitable for representation; and 3) iteratively reducing data and generating graphs to assess graph’s viability in answering research questions. Our analysis also shows how students frequently refer back to their experimental goals and hypotheses when deciding what strategies to enact to generate graphs.

Generating graphical representations is an essential skill for productive student engagement in physics laboratory settings, and is a key component in developing representational competency (RC). As physics lab courses have been reformed to prioritize student engagement in authentic scientific skills and practices, students experience additional freedom to decide what data to include in graphs and what types of graph(s) would allow for appropriate sensemaking towards answering experimental questions. With this, however, there is a dearth of PER literature highlighting the strategies students use while working to generate graphs using their own experimental data. This paper presents a case study analysis of a student group's lab investigation to call attention to how students enact various productive strategies when working towards generating graphical representations in an introductory physics laboratory course. Results of this case study analysis identify three productive strategies students enact when working to generate graphs in lab settings, each of which is related to aspects of representational competency (RC): 1) identifying (potential) covarying quantities; 2) choosing representative data subsets suitable for representation; and 3) iteratively reducing data and generating graphs to assess graph's viability in answering research questions. Our analysis also shows how students frequently refer back to their experimental goals and hypotheses when deciding what strategies to enact to generate graphs. Further distribution must maintain the cover page and attribution to the article's authors.

I. INTRODUCTION
Visually representing scientific data is a central component of scientific inquiry [1,2]. Stakeholders across STEM disciplines describe representing experimental data as an integral component of laboratory experimentation (e.g., Refs. [3][4][5]). Students should gain representational competency (RC) in multiple aspects of experimental data representation, including generating graphs and diagrams, identifying relevant features, and sensemaking with representations [5][6][7][8]. Generating graphical representations, a component of RC, is a scientific practice commonly utilized by professional physicists and is an essential skill associated with "thinking like a physicist." While a significant body of literature in the PER community has historically focused on student interpretation and sensemaking of graphical representations in lecture/studio settings (e.g., Refs. [9][10][11][12][13]), less scholarship focuses on how students generate graphs in laboratory settings using self-collected data [14]. To more effectively guide students in developing skills associated with generating graphical representations, instructors and researchers jointly require additional insight into the productive strategies students enact when working to generate graphical representations in laboratory courses, settings most closely associated with authentic scientific experimentation. In this paper, we ask the following research question: What productive strategies might students enact when working to generate graphical representations of self-collected data in physics laboratory course settings?

A. Generating Representations: A Component of Representational Competency
The ability to generate appropriate graphical representations is one component of representational competency (RC), which is defined as the "ability to appropriately interpret and produce a set of disciplinary-accepted representations of realworld phenomena and link these to formalised scientific concepts" [15]. Summarized from Kozma and Russell (2005), students should be able to generate appropriate representations and effective describe and use representations for a specific scientific purpose [16].
The ability to appropriately generate graphical representations has been shown to have numerous benefits to student learning of concepts and skills, though the extent of these benefits is still under scrutiny. For example, generating representations has been shown to increase conceptual learning and transfer in mathematics more than simple interaction with pre-generated representations [17]. As well, several studies have shown that generating representations within scientific domains leads to more productive mental model formations of the domain, leading to greater scientific inferencing and reasoning (e.g., [18,19]). Conversely, Nitz et al. (2014) results suggest a negative gain relationship between students generating representations and building conceptual knowledge [20], which refutes earlier stud-ies (e.g., [21,22]). Apparent is the lack of a conclusive understanding of how student-generated representations impact students' science conceptual and technical learning [23][24][25][26]. Due to this lack of clarity, in this study we treat development of graph generation RC as an individual component of learning to "think like a physicist", distinguishable from learning other RC components or scientific concepts [20].

B. Generating Representations in PER
Historically, the PER community has focused on identifying and understanding how students interpret and sensemake with pre-generated representations (e.g., [9][10][11]). For example, McDermott, Rosenquist, and van Zee (1987) highlighted how undergraduate physics students commonly experience difficulty connecting graphs to physics concepts and to the real world [9]. More recently, relevant PER studies have broadened to focus on how students engage with multiple representations (e.g., [27,28]), how students' use of representations varies in specific learning contexts (e.g., [29,30], or how students choose and shift between different modes of representations (e.g., [31][32][33] [34]). However, few studies in PER have focused explicitly on understanding how students generate graphical representations, either manually (i.e., paper and pencil) or with computer software, even when this scientific practice is paramount to the field of physics. Eshach (2020) used intuitive rules theory [35] to develop a conceptual framework to understand challenges students encounter when generating graphical representations of kinematic phenomena [36]. They showed that students use simple intuitive rules, such as "same A -same B," to identify salient features of existing representations to make new representations for different purposes. Most closely related, Nixon et al. (2016) studied students' abilities to manually generate (by hand with paper and pencil) and interpret graphs during lab instruction [14]. Researchers scored students' hand-drawn graphs from lab activities to assess their quality and interpretation via bestfit lines. Their analysis showed that students in introductory physics lab courses could successfully generate and interpret graphs using best-fit lines, though this often occurred without connection to underlying physics concepts.
Our study moves beyond prior PER studies in several ways. First, to highlight a lesser-studied aspect of students' RC, we investigate students' generation of graphical representations using self-collected data, rather than their interpretations of pre-generated graphs. By situating this study observationally in a laboratory course setting, we aim to better understand students' graph generation RC as it would naturally occur in authentic scientific inquiry. Second, our study occurs in a learning setting where students collect and maintain a large data corpus and use spreadsheet software to organize, manipulate, and represent their data, rather than using manual graphing techniques. Use of computer software for visual representation is a more common representational technique for students and professionals alike.

II. CASE STUDY: SELECTION AND METHODOLOGY
We provide a case study analysis of a student group's activity in a Fall 2019 (in-person) introductory physics for life sciences (IPLS) lab course at a research-intensive university in the western United States. In this course, students are expected to generate research questions and conduct two-or three-week independent investigations with minimal direct instruction from teaching or learning assistants (TAs or LAs, respectively). This case study comes from a larger project investigating the nature of student engagement with experimental data in physics laboratory settings [37]. To identify this case study group, we reviewed previously collected research data, including: 1) observational data from student groups, which included screen capture, video, and audio data; 2) students' submitted pre-investigation design plans, where they outline their plans for conducting their investigations' and 3) students' individual lab reports. The chosen group comprises four students: Pam, Andy, Neesha, and Chloe [38]. All four were non-freshmen students majoring in life or behavioral sciences and intended to enroll in post-graduate health science programs. We chose this group for several reasons. First, the group exhibited consistent verbal discussion related to graphical representations throughout their investigation. Second, students' interactions with TAs/LAs only involved general support and guidance, not direct instruction. Third, by comparing final lab reports with other students, the quality of this group's final graphical representations and experimental results was representative of the course population.
We focus on the group's Lab 1 investigation, which involved studying the biological kinematics of five confined zebrafish. The group was provided a video of five zebrafish swimming in a roughly 1f t 2 tank; they qualitatively observed that the fish may be swimming faster when closer together. Their experiment focused on testing a hypothesis that confined zebrafish are antisocial; this hypothesis relied on observations of an inverse relationship between fish swimming velocity and fish-to-fish (f2f ) distance (the closer two fish are to each other, the faster they will swim). Our analysis used screen-capture data collected from the group's Lab 1 investigation, which had been previously coded for instances when students engaged in various experimental actions, including creating and modifying representations [37]. Subsequent narrative analysis focused on truncating the group's investigation into natural excerpts where students discussed and enacted strategies to generate graphs.

III. RESULTS
Our analysis begins after the group finished collecting data. Using manual tracking software, the group collected xy position-tracking data for all fish for the length of the video (∼ 10s), distance traveled per frame, instantaneous velocities, and various irrelevant data. The group spent roughly 1 hour per week engaging in active experimentation.

Identifying (potential) covarying quantities
Choosing (potential) covarying quantities to represent was the group's first strategy in moving towards generating a graph that would effectively test their hypothesis. The following narrative comes from a group conversation that occurred 35 minutes into Week 1 experimentation (Week 1 -35min).
After the group finalizes data collection, Neesha shifts the group's attention to determining what they are graphing, including what data they should compare in their graph (Neesha: "What are we graphing? Are we doing the same thing from [the warm-up], or distance versus time or ...?"). The group's discussion quickly revisits their hypothesis' implied quantities (f2f distance and fish swimming velocity): In lines 3 and 4, Pam and Chloe acknowledge that their collected data's distance values are not the f2f distances they need. Neesha and Pam then respond that they need to identify an equation that can convert their x-y position tracked data into f2f distances: Here, the group implicitly agrees they need to determine f2f distance for various fish pairings and corresponding fish swimming velocities. After further discussion, the group calculates f2f distances for their first fish pairing (fish A and B), chosen based on observations that fish A and B were the closest two fish at any point in time.
Overall, this excerpt highlights students' immediate efforts to identify (potential) covarying quantities they would need to test their hypothesis. Immediately after collecting data, students identified appropriate (potential) covarying quantities, even though their raw data did not include these quantities. Enactment of this strategy occurred without prompting from instructional staff, suggesting that students chose to identify these quantities of their own volition. Students were able to backward-plan from their needed covarying quantities to identify initial data to manipulate (e.g., via equations) to obtain the desired quantities.

Choosing appropriate data samples for representation
After calculating f2f distances for the A-B fish pairing, the group's next strategy was to choose appropriate data samples from their large dataset to include in their graph.
Upon completing their calculations (Week 1 -51min), the group recognizes that further calculations could result in thirteen unique fish pairings in their analysis. Likely hesitant to engage with what they perceive as a large amount of data (Pam: "I just want to ... start over!"), the group begins discussing which fish pairings would be best to include in their representations. The group consults the TA, who says they can choose a representative sample that shows variation in f2f distances and velocities. The group takes this as permission they can reduce their dataset as long as they appropriately justify their decisions. Further discussion ensues, with students negotiating potential strategies of reducing their data to a representative subset to include in their graph. Andy suggests postponing selection of further pairings until they complete calculations and generates graphs for the first pairing (A-B) (Andy: "I think we can do the main ones and see what we get."). Chloe and Neesha propose using extreme cases, fish pairings that are closest and farthest at any point in time: 7 Neesha: If we were just concerned about them being close together and them being far ... does that make sense? Cause our claim [39] was kind of like, if they're closer, they're faster ... 8 Chloe: B and E are the farthest ... 9 Neesha: The farthest and slowest, does that make sense? Andy rebuts by proposing they could use a single pairing of interest and a "control" pairing to directly compare against (Andy: "Okay, I think we should do B and C and then a control of either ..."). Pam advocates they can use the minimal amount of data necessary to test their hypothesis effectively (Pam: "... we could actually just take two fish, we could analyze just two fish, and how their velocities change when they're farther versus when they're closer ..."). Likely recognizing the numerous potential strategies being offered without clear direction, Neesha reintroduces the group's initial hypothesis to reorient the discussion, using this to again argue for her choice of the extreme case fish pairings (Neesha: "So let's go back, so our claim is that if they're closer, they'll move faster, if they're apart, they'll move slower. So, if we just analyze the fishes that are closer together and the fishes that are farthest, then we can compare whatever we find, right?"). The group comes to an agreement on this strategy. Their final strategy was to identify which fish were closest and farthest to the original fish pairing (A-B); this culminated in their inclusion of four fish pairings, representing the fish pairings they observed closest and farthest to fish A and B.
Notable in this excerpt is how the group self-identified and negotiated several different strategies for choosing a representative subset of their large (∼3,300 unique data points and thirteen potential covarying quantity comparisons) dataset that they could reasonably include in their graphical representation. These potential strategies included: 1) choosing two extreme case subsets of data that bookend all other data; 2) choosing the most representative data subset and a control subset with which to compare; and 3) choosing the minimal amount of experimental data necessary to create a graphical representation to test the hypothesis. Again, the group fre-quently revisited their initial experimental goal throughout discussion and used this to determine a productive strategy, eventually deciding to use a larger representative subset that included multiple extreme cases. Also notable is how all four students advocated for different potential strategies and made a consensus decision based on all potential strategies.

Reducing data and iteratively generating graphs
Beginning their Week 2 investigation time (Week 2 -3min), the group's next enacted strategy was to further organize and reduce their large dataset to prepare to generate their final graph. To orient readers, the group chose to limit their analysis to only the velocities of each fish at specific points in time -when it was at its maximum and minimum distance from its partner fish -not each fish's velocity throughout the video.
The group begins by identifying maximum and minimum f2f distances and corresponding velocity values in their dataset and copying them to a new data table in Excel. During this, the group again refers back to how they should represent their organized data on a graph to test their hypothesis: 10 Andy: ... do you guys want to figure out how to graph that? 11 Chloe: Yeah, we can put that in one table, so ... so like, uh ... distance, so, first column [in the table] would be fish, and the distance ... between ... oh that's fine. Do we want to do farthest distance on one graph and closest distance on another graph? 12 Pam: I feel like we can do both the same since we're just looking at the relationship between distance and velocity ...
At this point, the group's organized data table includes a column of fish pairings (f ish_pair, see [40]) and two columns of their maximum (d max ) and minimum (d min ) distance separations, respectively. Without explicit group agreement on the graphing method, Chloe highlights this data and clicks "Line Chart," creating the graph shown in Figure 1. Chloe recognizes that the graph is not appropriately repre- senting the covarying quantities they identified, since the xaxis is categorically organized by f ish_pair, not numerically by distance (Chloe: "Uh ... it's not graphing how I want it. I want these [fish pairings on x-axis] to be here [in the legend]."). Pam reiterates that they are attempting to generate a graph of (f2f ) distance and velocity (Pam: "... and then we'd have a chart of distance against velocity."). This prompts the group to recognize that they omitted fish velocities from their data table. The group locates the velocities that correspond to when each fish was closest or farthest from its paired partner and adds these values to their data table as two respective columns (v 1,max and v 1,min ). They then create a second version of their line graph incorporating f ish_pair, d max , and v 1,max . Their resultant graph again has f ish_pair as the categorical x-axis, with two lines plotting d max and v 1,max with respect to f ish_pair. Chloe again recognizes the error of f ish_pair on the x-axis, and the group begins to iteratively generate graphs using trial-and-error (Chloe: ''We're kinda getting there. Pressing every button we need!"), choosing different subsets of their data table and different types of representations (e.g., line, bar).
Still without success after several iterations of generating different graphs, they seek guidance from the LA. During discussion, the LA asks what type of graph and data would support their hypothesis (LA: "Now, picture, if we had a graph that supported that, what would it look like?"), then prompts the group to consider using a scatterplot. The students then guide the LA as he roughly sketches their data by hand, with each fish velocity (y-variable) and its associated f2f distance (x-variable) as a point on the scatterplot. The group agrees that a scatterplot would be an appropriate representation but has hesitancy that it removes information about the fish pairing relationships. Additional discussion ensues and the group eventually decides the resultant graph outweighs the loss of the fish labels (Pam: "That would, like, I know we wouldn't label the fish, but that might still get us ... somewhere."). After reorganizing their data to have all f2f distances in one column and all corresponding velocities in another column, the group creates a final scatterplot, shown in Figure 2. This segment highlights how students utilized several strategies to organize their chosen data and create an appropriate graphical representation. Most apparent was their use of "trial-and-error" methods to iteratively organize and select different subsets of data for subsequent generation of graphical representations. Students' initial unsuccessful "trial-anderror" graph generation prompted the transition to a new productive strategy, introduced by an LA, where students discussed and helped sketch a simplified graph that would align with their hypothesis. By sketching what they expected their graph to look like if proving their hypothesis correct, the group was able to clarify how to organize their data and utilize the computer software to generate an appropriate graph-ical representation. This process also prompted students to omit some features (i.e., fish pairing labels) of their data at the expense of other features (i.e., scatterplot graph type) that better aided in answering the research question.

IV. DISCUSSION
This study identified three productive strategies students use when generating graphical representations with their collected data: 1) identifying appropriate (potential) covarying quantities; 2) choosing representative data subsets suitable for representation; and 3) iteratively reducing data and generating graphs. Overarching these enacted strategies, the group continually referred back to their hypothesis when determining what strategies would support their representational goals. Through numerous experimental steps to create an appropriate (but not necessarily ideal) graphical representation, the group's productive progression is evidence of students' RC [6,41]. We emphasize that these are not the only productive strategies enacted by students in these contexts, nor are they necessarily the most effective. This work brings up new research questions about whether there are larger connections between the strategies students enact to generate graphical representations and how the representation can foster sensemaking about the represented scientific phenomena.
This study shows how students may utilize productive strategies to generate graphical representations of data from large complex datasets collected in undergraduate physics lab settings. Productive engagement with large datasets in physics lab courses is a new but growing learning goal in introductory physics lab courses; this analysis suggests that students maintain degrees of competency in this crucial skill, but still face challenges navigating large datasets in computer software when generating representations. Second, as has been described in prior literature, informal representational drawing may be productive in moving students along in their generation of formal scientific graphical representations. When the group struggled to create an appropriate representation during their iterative graphing, the LA prompted them to draw their hypothesized graph's general trend, allowing them to determine a more appropriate type of representation. Pedagogically, it may be beneficial to prompt students to create informal drawings of their intended graphs, as this may provide a more natural generative space while potentially limiting technological hindrances from computer software.