header
published quarterly by the university of borås, sweden

vol. 27 no. 3, September, 2022



Fitness for use of data: scientists' heuristics of discovery and reuse behaviour framed by the FAIR Data Principles


Bradley Wade Bishop, and Hannah R. Collier


Introduction. Data reuse in the natural sciences is necessary due to several factors—data volume, the often-real-time one-time collection, the associated collection costs, as well as data’s long-term scientific value with many potential reuses across domains beyond its original purpose.
Method. This study captures heuristics used by natural scientists to inform their discovery and reuse. Using a critical incident technique, forty-three scientists from biology, geology, and oceanography were asked to describe their most recent data discovery resulting in reuse.
Analysis. To frame participants’ behaviour along a sequence of actions from finding to reusing data, the survey used the FAIR Data Principles.
Results. Results show a common lack of understanding of controlled vocabulary and a common determination of fitness of use by noting familiar sources rather than searching through extensive metadata.
Conclusions. Scientists’ perceptions of their discovery and reuse behaviour provide considerations for research and impact practice to inform the design of tools and services to improve data reuse by both humans and machines.

DOI: https://doi.org/10.47989/irpaper942


Introduction

Science data can be transformed and combined with other datasets, leading to unanticipated new discoveries as new methods emerge. This long-term benefit of science datasets is an important reason to enable reuse in the natural sciences (Corti et al., 2020). Science data reuse also gives researchers the chance to scrutinise and duplicate analyses that increases transparency and accountability of science (National Academies of Sciences, Engineering, and Medicine, 2019). For example, biologists rely on both systematised specie collection and data sharing beyond the original collectors across vast distances and time for the benefit of future generations of scientists (Griesemer and Gerson, 1993). Recently, reuse of the public datasets of the human genome to project the molecular mechanisms that promote progression of the coronavirus-19 disease underscore the criticality of data sharing in the health sciences (Gardinassi et al., 2020). Data in the natural sciences have long-term value multitude of reuse scenarios with longitudinal studies and data aggregation, often at global scales (e.g., poleward migration). Natural science data preservation and sharing are inherent due to several factors, including that data might only be collected once in real-time at great expense (Bishop and Hank, 2018).

In most natural sciences, secondary analyses are so entrenched in the fields of research that the concepts of use and reuse are synonymous. Still, approaches to sharing data vary across domains and nations (e.g., private exchange, Websites, supplemental journal materials, and so forth) (Pasquetto et al., 2017). Reuse involves data that are discoverable, available, and technically compatible with software, code, and other data. The increasing volume, variety, and speed of data creation, require scientists to utilize machines’ processing power to conduct research. The natural science disciplines in this study advance knowledge through methods that use machine-learning and artificial intelligence, which makes machine-actionable data another feature of consideration in the data sharing landscape. When considering machines not only as tools, but also as potential autonomous data reusers, more focus is put on metadata completeness to make machine-actionable data a reality. The FAIR Data Principles present one codification of how to enable machine-actionable reuse of data through a list of fifteen principles in four broad categories: findability (F); accessibility (A); interoperability (I); and re-usability (R) (Wilkinson et al., 2016).

Simply viewing data and/or metadata may suffice to assess fitness for use for human reusers. Experienced scientists may address any reuse issues on the fly by cleaning, reformatting, validating, transforming, or other actions that fall under the broad term ‘data wrangling’ (McKinney, 2013). Natural scientists’ perceptions of their discovery and reuse behaviour related to fitness for use provides considerations in the design of tools and services to improve data reuse by both humans and machines. While human data reusers’ perceptions and behaviour have been investigated in the past, the purpose of this study is to study natural scientists’ perceptions of discovery and reuse to capture the heuristics used to assess fitness for use framed by the FAIR Data Principles. These human heuristics, which provide mental shortcuts used by scientists in data discovery, could inform machine learning and data services. The sufficient tactics used by experienced scientists in quickly assessing the fitness for use of data provides search markers that may also be useful assessments to build into machines.

These types of representativeness heuristics help a user judge options by mapping a choice back to the user’s existing prototypes or stereotypes in-mind (Savolainen, 2017). For example, one representativeness heuristic could be if a new data repository looks like other data repositories a user has accessed before then that scientist trusts that source. In the exact sciences, a pervasive representative heuristic of the authority of certain organizations and institutions provides inherent credibility to data generated by those entities (Sundar, 2008). Additionally, data services could be tailored to these common practices to inform the design of tools and other assistance that enable reuse. Using a critical incident technique, forty-three scientists completed a survey that began with the prompt asking participants to describe their most recent data discovery resulting in reuse. The development of the survey questions was informed by the findings from twenty-four personal, semi-structured interviews with similar natural scientists (Bishop et al., 2019). These interviews operationalized the FAIR Data Principles from the data consumer’s perspective.

Literature review

Reuse begins with data sharing. The difficulties of creating sharable qualitative and quantitative data for reuse have been previously investigated (Carlson and Anderson, 2007). Research has brought attention to the culture changes needed to enable sharing of data across the research enterprise. Past work on scientists’ attitude toward data sharing, indicating that the main difficulties in data sharing fall under one of the three categories: 1) willingness to share, 2) locating shared data, and 3) using shared data (Birnholtz and Bietz, 2003). A longitudinal survey study found researchers have more positive outlooks of data sharing and reuse over time (Tenopir et al., 2015). For scientists to embrace the culture of data openness, data sharing should be incentivised, and metadata standards should be created based on knowledge of specific communities’ data use requirements (Birnholtz and Bietz, 2003). In addition, scientists should be made aware of the value in “efficacy and efficiency” of reusing trusted datasets, rather than creating their own, to advance science (Curty et al., 2017).

After deciding to reuse data, scientists must determine which datasets they trust to use in their research. A survey of data reusers of the Inter-University Consortium for Political and Social Research (ICPSR) found that data qualities of completeness, accessibility, ease of operation, and credibility all had significant associations with reusers’ satisfaction and a later study of similar data reusers concisely describe the three broad categories of information that researchers consider before deciding to trust data for reuse as: 1) data production information, 2) repository information, and 3) data reuse information (Faniel et al., 2016; Faniel et al., 2019). Another data reuse study described how engineers were more likely to reuse data with high levels of documentation about how data was collected and indicated that engineers were likely to reuse data that explicitly described any errors or failures with their experiments that would allow the reuser to distinguish between valid and invalid data (Faniel and Jacobson, 2010). Therefore, more stringent quality control methods and data flagging could be used to increase scientists’ confidence in unknown datasets (Vandepitte et al., 2014). A more recent study demonstrated the search and access for data to be reliant on scientists’ professional networks (i.e., reaching out to contacts at conferences or requesting data from partners (Suhr et al., 2020).

In support of the importance of digital repository usability and accessibility, one survey across many disciplines found scientists’ reuse intentions of the availability of data repositories and the perceived effort to access data had positive significant relationships to reuse (Kim and Yoon, 2017). With more scholarly data support and trusted repositories, reuse increases. Data reusers from social work and public health inferred ‘trustworthiness of data from the characteristics of the responsible parties’ (Yoon, 2017, p. 954). This included assessing common communities of practice of researchers, trusted funding organizations, and reputations of both prior to even reviewing the actual data. Scientists are also concerned with how datasets have been previously reused to determine if the data will fit their research needs as well (Faniel et al., 2019; Rolland and Lee, 2013).

Studies have shown that natural science researchers in particular rely on data that is easily accessible and interoperable (Davis et al., 2014). Other authors found that scientists prefer various levels of processing from the provided data (Gregory et al., 2019). Scientists often base decisions on reuse from confidence gained from word-of-mouth, rather than relying on metadata (Dow et al., 2015). To make large natural science data more accessible, several have suggested providing tools that allow for easier user discoverability through automated tools and repositories with integrated visualizations (Faniel et al., 2019; Press, 2016). Previous work using interviews and surveys have some limitations as each study investigated one niche area with reusers from the same disciplines. This research attempts to gather data on heuristics at a broader domain level than earlier work covering only one discipline.

FAIR Data Principles

The Future of Research Communication and e-Scholarship (FORCE11) created the FAIR Data Principles as guidelines to describe aspirational attributes that any meta(data) should address to be machine-actionable. Computing advancements allow for the automation of production and analyses. Given that some researchers have reported cleaning data accounts for 80% of their time, improved data quality ultimately also benefits human reusers (Dasu and Johnson, 2003). Researchers spend enough time on the tasks of manipulating, modelling, and visualising data that some new terms like data munging and data wrangling have been created and used (Wickham, 2014). The original paper, since cited more than 7,500 times, provides fifteen principles to enable machine-actionable data reuse (Wilkinson et al., 2016). The combination of the terms ‘metadata’ and ‘data’ into “(meta)data” was done because in some natural and life sciences there are nno clear distinctions between the two concepts (e.g., specimen lists are data and metadata). One researcher’s data could be another’s metadata. Some of the original FAIR authors re-emphasised machine-actionable as the central reason for the principles as numerous misuses and confusion of these data principles spread (Mons et al., 2017). Further, another paper attempted to provide metrics and clarification to address the confusion as others began (mis)using them (Wilkinson et al., 2018).

This brief, discipline agnostic, set of principles is easily read and understood by scientists and data professionals, while retaining the everyday meaning of "fair" for other involved parties and decision makers. This has led to wide adoption across many organizations. The GO FAIR Project helped implement open science policy in the European Union through Horizon 2020 (GO FAIR). In the US, the American Geophysical Union’s Enabling FAIR Data Commitment Statement in the Earth, Space, and Environmental Sciences guides that organizations’ 62,000 members toward FAIR alignment as well as work from the U.S. Geological Survey, that nation’s mapping agency, to make their data FAIR (COPDESS, n.d.; U.S. Geological Society, 2022). Finally, the Australian Research Data Commons developed a FAIR data programme to incorporate training and policies to move their researchers, infrastructure, and processes to be more aligned to FAIR (Australian Research Data Commons, 2020). Despite this success and wide spread adoption, not all reuse data considerations fit under the FAIR Data Principles, such as CARE principles (Collective benefits, Authority to control, Responsibility, Ethics). The CARE principles would ensure scientists engage with indigenous worldviews to intellectual property when considering appropriate use as even data collection often reinforces historic power imbalances (Kukutai et al., 2020). Still, the FAIR Data Principles provide this study with a framework to capture natural scientists’ heuristics that help assess fitness for use during discovery and reuse. This study’s findings could inform machine-actionable design considerations for tools and data services by including fitness for use considerations in meta(data).

Fitness for use

Fitness for use is a concept of how consumers assess whether a product’s specifications meets their needs (Juran and Godfrey, 1999). To assess data quality, like the assessment of any product, context matters. One of Juran’s original inputs to consider a product’s fitness for use is the consumer’s determinants (Neely, 2005). The FAIR Data Principles provide a valid list of several qualities necessary for data reuse, but the data principles are written to assess datasets from a more exhaustive perspective, that of data providers, than the shortcuts used by most reusers when assessing fitness for use. Still, FAIR may be used to frame data quality expectations and the decision factors reusers employ to determine fitness for use (Stvilia et al., 2007). Discrepancies between reusers’ heuristics and the data principles could inform other considerations for data providers to make data more reusable by both humans and machines.

Fitness for use considerations include scale, cost, and syntactic and semantic heterogeneity (Chrisman, 1984; Veregin, 1999). Some of these specifications relate to FAIR Data Principles. For example, I1 and I2 expect that (meta)data use broadly applicable language or vocabularies. Semantics vary across domains, but using known knowledge representation and ontologies could reduce human and machine mistakes through misinterpretations of terms across fields (Bishop and Hank, 2016). User limitations inform all fitness for use assessments. For a data example, metadata are not always stored with datasets, viewable when reusing data, or used when present. This absence of metadata inhibits assessment of fitness for use for human users. A consumer of any product does not evaluate all specifications to determine fitness for use, especially when there are time constraints or a lack of viable options.

Therefore, fitness for use assessment for data might rely on heuristics related to appearance and brand without considering all potential data factors. Trust in data from communities of practice, trusted funding organizations, and reputations of either, have shown to be a factor in reuse for social scientists (Yoon, 2017). For a non-data example, when purchasing avocados a consumer likely selects darker ones to test with a gentle squeeze. For natural science data, an equivalent does not exist to the grocery shopping approaches of the avocado squeeze or bread smell test. Since it is not feasible for scientists to individually catch grasshoppers around the globe, gather phytoplankton across the ocean, or romp through the cryosphere, they share data and trust each other to an extent for various reasons. Scientists make do with whatever data are collected because they know no data are perfect, and must assume a trusted resource (e.g., the National Aeronautics and Space Administration) will have the best and likely only data for their purposes.

The literature review demonstrated increased data sharing and reuse trends and that previous studies show reuse depends on trust factors such as completeness, accessibility, ease of operation, and credibility. The reputations of researchers from common communities of practice and known funding organizations and trusted repositories also inform reuse. Earlier work shows natural science researchers rely on easily accessible and interoperable data and awareness of repositories in their reuse. The fitness for use literature is often theoretical with expert users outlining factors; for those studying actual reusers, the research is of smaller and specific subdisciplines. There is a lack of knowledge not covered in current literature about all data reuse specifications used to determine fitness for use, which may inform the design of tools and services to improve data findability, accessibility, interoperability, and reusability by both humans and machines.

Theorectial framework

FAIR Data Principles

This study uses the FAIR Data Principles to frame scientists’ perceptions of their discovery and reuse behaviour along a sequence of actions from finding to reusing data. Each principle addresses many different questions any reuser of data should consider when searching for and deciding to reuse data. Some of the principles lend themselves to dichotomous questions (e.g., either the data have an element or not) and others are more complex. These complications require qualifiers and context to make the data principles more actionable from a reuser’s perspective rather than evaluating data itself. Transposing each of the FAIR Data Principles into a meaningful element required lengthy considerations outlined in a prior methods paper (Bishop and Hank, 2018). This study’s survey was informed from both conducting in-depth interviews with twenty-two scientists from oceanography, hydrology, and seismology as well as two additional interviews with biologists (Bishop et al., 2019). Table 1 presents the FAIR Data Principles along with questions derived to address each principle from a data consumer’s perspective. Each section of questions began with a broad question on how scientists found, accessed, and made interoperable data. The survey questions included qualifiers to clarify questions about data principles that needed clarification during the early interview study. The potential responses are not listed in Table 1, but are discussed at length in the results section. Reuse questions focus on capturing the heuristics of any potential fitness for use considerations because participants were asked to describe a recent search and use of data with reusability inherent. The reuse issues answer options were derived from the prior interview study’s findings.


Table 1: FAIR data principles and questions from data consumer’s perspective.
To be findable: F1. (meta)data are assigned a globally unique and eternally persistent identifier.
F2. data are described with rich metadata.
F3. (meta)data are registered or indexed in a searchable resource.
F4. metadata specify the data identifier.
How did you find the data?
Did the data have a persistent identifier (e.g., DOI)?
Did the data have accompanying metadata? (Metadata are information that describes data, which are found in associated documentation, like a file or within the data file)
Did the metadata help you find the data?
To be accessible: A1. (meta)data are retrievable by their identifier using a standardized communications protocol.
A1.1 the protocol is open, free, and universally implementable.
A1.2 the protocol allows for an authentication and authorization procedure, where necessary.
A2 metadata are accessible, even when the data are no longer available.
How did you access the data?
Were the data in an open format? (An open format is one that is platform independent, machine readable, and made available to the public without restrictions that would impede the re-use of that information).
Were the data free?
Did the data have any use constraints (e.g., limitations of use)?
To be interoperable: I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
I2. (meta)data use vocabularies that follow FAIR principles.
I3. (meta)data include qualified references to other (meta)data.
Were the data in a usable format?
Were the data in a common format to other data used in your research (i.e., common format for your discipline)?
Were the data using any of the following data standards?
Were the data machine-actionable (i.e., could it be processed without humans)?
Were the data available in multiple formats?
To be reusable: R1. meta(data) have a plurality of accurate and relevant attributes.
R1.1.(meta)data are released with a clear and accessible data usage license.
R1.2.(meta)data are associated with their provenance.
R1.3.(meta)data meet domain-relevant community standards.
Were there any issues?
What were some other data quality issues?
What were some of the other reuse issues? (if any)
Did the metadata provide sufficient information for reuse?

Using the FAIR Data Principles to frame scientists’ understanding of their discovery and reuse along a sequence of actions from finding to reusing data, the data quality expectations and decision factors reusers employ to determine fitness for use are captured in this study using a survey. The theoretical concept of fitness for use provides the study a catchall term for the human heuristics used by scientists when making their data reuse assessments.

Methods

The purpose of this study is to capture natural scientists’ discovery and reuse heuristics used to assess fitness for use. To address the research question, "What fitness for use heuristics are used by natural scientists to inform their discovery and reuse?" the linear framework of finding, accessing, making interoperable, and reusing of the FAIR Data Principles was used. The critical incident technique is an alternative method to direct observations to understand human behaviour by giving participants defined situations. The technique is often used in communication, counselling, education, marketing, medicine, nursing, psychology, and social work (Flanagan, 1954; Butterfield et al., 2005). The use of the technique in information seeking behaviour research typically involves giving participants a chance to reflect and describe recent behaviour (Marcella et al., 2013). Patterns emerge when several participants describe their choices due to the randomness of each participant’s select on of their most recent search. The technique has been used to study faculty’s information-seeking behaviour in scholarly communication (Tenopir et al., 2015). More recently, it was used to identify failures within information-seeking behaviour, and to assess the data sharing and data reuse satisfaction of scientists (Wang and Shah, 2016; Shen, 2018).

The critical incident technique approach used in this study uses a questionnaire derived from a prior semi-structured interview of twenty-two scientists in oceanography, hydrology, and seismology (Bishop et al., 2019). In addition, two more interviews with biologists were pre-tested to assess the questions in that domain and clarify questions from their perceptions. The interview study used the same theoretical framework of the FAIR Data Principles and fitness for use and a similar natural scientist population. The critical incident started the survey with the prompt, “Think of a recent search for data (or more). The following questions will determine how you discovered and evaluated that data for fitness for use”. The survey began with two open-ended questions asking participants to “briefly describe the data from your recent search: (i.e., species list, .csv)” and “If you can recall, please name the data repository, database, or other resource used to access the data". The survey then had two questions to assess the scientists’ type of institution and research domain, which were followed by the FAIR Data Principles questions described earlier in Table 1. Any closed question response options for this survey were quantified and based on common responses from the prior twenty-four in-depth interviews (Bishop et al., 2019).

The survey software Qualtrics was used to send the questionnaire to organizations across the domains of biology, geology, and oceanography. Upon Institutional Review Board (IRB) approval, participants were recruited in March 2020 through the following: the American Geophysical Union (AGU) Connect Community that reaches 517,000 of its members; ECOLAG-L, a listserv for the Ecological Society of America (ESA) with 24,037 members; NHCOLL-L, which is a collective listserv for both the 600 members of the Society for the Preservation of Natural History Collections (SPNHC) and the fifty-seven member institutions of the Natural Science Collections (NSC) Alliance; and the Society for the Study of Evolution Community Facebook group, which has 5,822 members. These organizations were selected for purposive and convenience sampling due to the domains targeted in this study that are known data reuser communities. One month after the initial recruitment in April 2020, a reminder was sent to the same organizations. Ultimately, forty-three reusers fully completed the survey. The data are open and available through the Tennessee Research and Creative Exchange (TRACE).

Data analysis

Quantitative analysis was used for most of the survey questions and counts and percentages are provided where applicable. These quantitative questions related to the FAIR Data Principles and the participants’ behaviour while finding and reusing data follow the sequence of findability, accessibility, interoperability, and reusability with response options derived from the prior interview study. The open-ended questions such as data type were transposed and aggregated as counts when similar. For example, the data search, "CSV file of specimen records" and "Occurrences of invasive species in urban habitat" were both counted as specimen record searches. Another data example, for atmospheric data "satellite-derived currents" and "precipitation grids" and similar searchers were combined. Responses to the data repository seldom named specific repositories, but types could be grouped based on the agency that created that data.

Results

Data, resource, domain, and institutions

The first four questions about the scientist’s data, the resources used to locate the data, their institution type and domain provide an overview of the search and the searcher.

Data

The data described from the most recent search resulted in forty-three responses that specified the data: most responses (17) related to oceanographic and atmospheric data, with one user mentioning data of both types in the same search. Atmospheric data searches related to conductivity, temperature, pressure, climate record data, air quality data, precipitation grids, wind measurements, snowpack, snowfall, and others related to weather or climate. Oceanographic data related to ocean chemistry (3), currents, ecosystem restoration, and depths. Twelve searches related to biological and paleontological data, including specimen records (8), fossils (2), taxonomic information and predation data. Seven recent searches related to geological data, including hydrologic data (2), soils (2), permafrost, geochemistry, and one for drilling records from the only geoscientist to participate.

Two participants referred to generic formats to describe their data, the Network Common Data Form (NetCDF) and Light Detection and Ranging (LiDAR), otherwise data format was not mentioned. Time series without other details were also mentioned by two participants, which could be from any field and data about a great number of things. Two mentioned COVID-19 cases and deaths as their most recent search, which likely relates to the time of survey distribution. A heliophysicist looked most recently for magnetospheric multiscale particle data.

Repositories

The thirty-eight responses to the resource question reflect some common repositories for these domains, but these reusers responded without knowing the actual name of their repositories (e.g., air quality something) or only list the funder (e.g., NASA). The numerous acronyms of science, programme name changes, and complex partnerships complicate answering this question and analysing the responses. The US distribution of the survey may have limited international resources named, with only two UN resources and only four resources from countries other than the US (e.g., EU Space Agency and British Geological Survey’s PalaeoSaurus).

Several responses simply list multiple agencies without any specificity of resources or repositories such as this response ‘the US Department of Agriculture’s National Agricultural Statistics Service (NASS) and Natural Resources Conservation Service (NRCS), National Oceanic and Atmospheric Administration (NOAA), and NASA’. NASA alone has fifty-nine data centers, but only two of five responses mention NASA named specific data centers: Solar Data Analysis Center and Space Physics Data Facility. To illustrate further how the funding organization alone does not indicate potential resources, the National Snow and Ice Data Center is funded by NASA, National Science Foundation (NSF), and the NOAA. An example of name changes also complicate a user’s understanding of what they are accessing. One example is two NOAA resources named the National Climatic Data Center (NCDC) and World Ocean Database are both now aggregated into a broader data portal.

The identifiable biology repositories names include Global Biodiversity Information Facility (GBIF) (4), Integrated Digitized Biocollections (iDigBio) (2), and one mention each of American Museum of Natural History (AMNH), California Phenology Thematic Collections Network, Consortium of California Herbaria, Denver Museum of Natural History (DMNH), Florida Museum of Natural History (FMNH), Marine and Atmospheric Research Laboratories Information Network (MarLIN), Ocean Biodiversity Information System (OBIS), University of California Museum of Paleontology (UCMP), VertNet, and World Register of Marine Species (WoRMS). This longtail of resources is a result of reusers of biodata consulting multiple resources to aggregate data manually, especially to gather comprehensive specimen lists. For example, one user’s response included AMNH, DMNH, FMNH, and UCMP. Machine-actionable data across these many resources benefit from a better understanding of these biologists’ behaviour during their multi-faceted discovery and reuse workflows to consult and gather across repositories.

Some other geology resources mentioned were the NRCS’s Web Soil Survey, NSF’s Incorporated Research Institutions for Seismology (IRIS), and Consortium of Universities for the Advancement of Hydrologic Science Inc. (CUAHSI) HydroShare. Of note, two resources mentioned modeling data, the Coupled Model Intercomparison Project (CMIP) and National Center for Atmospheric Research’s Community Earth System Model. Only three responses referred to data resources from state and local governments. Finally, two responses described going directly to published articles to extract data from figures, two participants obtained the data by contacting the scientist directly, and one mentioned going to a library and searching library databases. With a great deal of legacy data entombed in static journal articles, machine-learning might benefit from knowing how these scientists are manually extracting data in order to automate those processes.

Institution type and research domain

Of the forty-three participants, most identified their type of institution as academic (28; 65%) with one participant selecting both academic and government. The remaining participants identified their institution type as government (7; 16%), non-profit (4; 9%), two others responded as self-employed and a government contractor (2; ~4%); one private and one blank.

Participants from a range of scientific domains completed the survey. The domains reflect the recruitment approach with American Geophysical Union sections listed as well as biology. Over half the participants came from the two largest domains of biology 12 (28%) and ocean sciences 11 (26%). The category of ‘Other’ also comprised a large group with fourteen participants. Of the ‘Other’ domains mentioned, five were palaeontology and the remainder from various sciences that might have related to other broader categories listed in the question (e.g., astrophysics, environmental science, conservation studies, noosphere, and so forth). The remaining participants selected hydrology (3) and one each for atmospheric sciences, planetary sciences, and seismology.

(F)AIR: Findability

The four principles on findability focus on discovery and having comprehensive metadata and uniquely identifiable data. This section of the survey asked how the participants found data and if and how they used the corresponding metadata.

For the question, "How did you find the data?" participants were asked to check all that apply, so some selected more than one answer. There were sixty-two total responses from forty-three participants. Table 2 presents the counts and percentages. The most selected approach to finding data was going to a trusted source as indicated by those named data repository open-ended question. Perhaps, those that know the name of a trusted source do navigate there through a search engine, but only two checked both. ‘Other’ responses included through a literature review of related articles and Websites (4), told to use particular data they were given (3), and collecting the data themselves (2).


Table 2. How participants found data.
Answer Count Percentage
By going straight to a trusted source 28 45%
By using a search engine 12 19%
By going straight to a government source 12 19%
Other 10 16%

Persistent identifier

Fifteen (35%) of the forty-three participants who responded to the question about persistent and unique identifiers indicated that the data did have unique identifiers. Twelve (28%) of participants said that they did not, while sixteen (37%) indicated that they were not sure. This particular principle is vital because even though human users can reuse data without unique identifiers or without knowing that data have them or not; machine reuse will require them. These responses at least show natural scientists are not certain of this data principle in data they are reusing.

Accompanying metadata and if metadata helped

Of the forty-three responders, the majority twenty-one (95%) said that the data they found did have accompanying metadata, while seven participants (16%) said the data did not have metadata, and one participant (2%) said they were not sure. Forty-one participants responded to the question of whether the metadata helped them find the data. The answers were divided: seventeen (40%) said yes while eighteen (42%) said no. Another seven participants (15%) were not sure and one response was blank. Metadata helps both machines and humans discover data and determine its fitness for use. Reuse may occur without acknowledgment or awareness of how metadata assists in these processes.

F(A)IR: Accessibility

The five questions asked to address the four principles related to accessibility cover how each participant accessed data and to check all answers that applied. Table 3 presents the count and percentage of how many times each answer was selected.


Table 3: How participants accessed data.
Answer Count Percentage
By manually downloading 35 60%
By calling up a machine-actionable Web service for automatic download 11 19%
By contacting the data creators/authors 6 10%
Other 6 10%

Most of the participants responded that they accessed at least some data by manually downloading it, with only eleven indicating an automated process. The machine-actionable Web service response is the only FAIR aligned option, but scientists access data in many non-machine actionable ways. Many of the data resources might require upgrades to enable machine-actionable Web services. Of the six that listed ‘Other’ only one scientist’s response did not map to the other options—that being a search for data in an analog format. An example of ‘Other’ responses that did map to other options included downloading, server connection, and others that were coded as manually downloading.

Open format and free data

Forty-three participants responded to the question asking if the data was in an open format. Thirty-seven (86%) participants said the data were in an open format, while four (9%) said the data were not in an open format, and only two (5%) were not sure. Of the forty-three participants who responded to the question, forty-one (95%) said they reused the data for free, and only two (5%) participants said the data were not free. In these natural science disciplines, data are commonly open and free. Some of these formats, however, may or may not be machine-actionable without human intervention.

Use contraints

Forty-one participants responded to the question which asked if the data had any use constraints. Thirty-one (75.6%) participants said no to any use constraints, but seven (17%) participants responded that they did have some and three (7.3%) participants were not sure. The scientists that indicated use constraints included four of the five palaeontologists, two from ocean sciences, and a hydrologist and biologist. Palaeontologists’ licensing awareness may be heightened compared to other sciences not due to a fear of cloning dinosaurs, but a concern over lack of attribution for data and the potential commercial uses of 3D models of fossils (Mounce, 2014). There is confusion over how reused data should be licensed. It is well-known that data generators do not always receive their deserved credit in research conducted from reused data, and there has been discussion on how this accreditation process should occur (Hrynaszkiewicz and Cokcerill, 2012; Pierce et al., 2019). Machines may be programed to assume no use constraints, but hopefully licenses. To be machine-actionable, a community default of open will require a license, such as those of Creative Commons (https://creativecommons.org/), stating the fact more explicitly as computers compute and humans assume.

Metadata accessibility

Forty-three participants addressed the question on metadata accessibility. Thirty-one (72%) participants said that the metadata was accessible, six (14%) that they were not, and six (14%) that they were not sure. Metadata might have been accessible in most instances. Humans have the training and experience to view and understand data even when metadata are not present and especially in domains where the concepts of data and metadata are nearly indistinguishable. Accessible metadata is important for machine-actionable data because machines will not make sense of data without these details. The skills scientists have to assess fitness of use by simply viewing data, its structure, and some common annotation, may be exactly the human heuristics captured through additional qualitative work in each discipline that would be essential for machine-learning in these domains.

FA(I)R: Interoperability

Common format

Thirty-six (84%) participants said the data were in a format that was common to their research. Science data has standardized practices for format and encoding in the same way science approaches measurement and analyses.. Five participants (12%) said the format was not common, and two (5%) said they were not sure. When data aggregation is significant for analyses using common formats streamlines the work, but interdisciplinary research increase the data types a scientist must work with and aggregate.

Data standards

Participants were asked to select all the applicable data standards (if any) used by the data. The response options reflect common responses during the prior interview study with an attempt to remain as domain agnostic as possible with the naming of common data standards. Seventeen participants left this blank, but those that did checked several options. Table 4 shows the response counts and percentages. Seven selected all options except ontologies. Another seven selected both discipline-specific metadata standard and controlled vocabularies.


Table 4: Data standards.
Answer Count Percentage
Discipline-specific metadata standard 19 41%
Controlled vocabularies 17 37%
Data dictionaries 6 13%
Ontologies 4 9%

This question is difficult to ask across a wide variety of natural scientists. The open-ended option allows for more details to be provided by participants, but the smattering of responses might indicate most scientists do not know their data standards. In fact, only nine participants opted to name a data standard and those included: Darwin Core (3); NetCDF (3); and one mention each of XML, Flexible Image Transport System (FITS), Standard for the Exchange of Earthquake Data (SEED) and two items that are not a data standard (e.g., NASA’s Common Metadata Repository, Coupled Model Intercomparison Project (CMIP)). A disproportionate number of these data standards come from three participants identified as a seismologist, a space scientist, and a heliophysicist.

Machine-actionable data

Participants were fairly evenly dispersed in their answers to whether the data were machine-actionable. Machine-actionable was a problematic question in the prior interview study with several scientists not understanding the concept. Machine-readable and machine-actionable concepts are not synonymous, but several scientists conflate the two. It may be early in some natural science disciplines to imagine machines as both reusers and autonomous knowledge creators. Of the forty-one that responded to this question, sixteen (39%) participants said yes, ten (24%) said no, and fifteen (37%) said maybe.

Multiple formats

Responses were also split in answering the question of whether the data were available in multiple formats. Thirteen (31%) said yes, fifteen (36%) said no, and fourteen (33%) said they were not sure. Human reusers of data may overcome non-machine-actionable data by manually making and transforming data on their own. Only one format is needed, so the multiple format availability of data may not be a concern. Some of these interoperability issues will only impact machine reusers and future work should not rely on scientists’ heuristics of these last two data principles as they are not reliably answered.

FAI(R): Reusability

While writing the FAIR Data Principles, one version had Findability, Accessibility, and Interoperability equating to Reusability (i.e., F+A+I=R), but they went with the cleaner look as data might have reuse issues beyond the first three areas (L. O. Bonino da Silva Santos, personal communication, November 6, 2018).

Of the forty-one participants who responded to the reusability question, 26 (63%) participants indicated that there were no issues that impacted the reuse of the data. Fifteen (37%) participants responded that there were some reuse issues. To indicate what these issues were, the participants were asked to select all the reuse issues from a list of answers. Table 5 depicts the number of times the participants selected a certain answer as well as percentages. The answers indicate a range of common reuse issues identified in the prior interview study, including other issues for any that were not included in this selection. Six participants wrote explanations regarding the other issues, including legal constraints, typographic errors, poor metadata, and continuing update requirements.


Table 5: Reuse issues.
Answer Count Percentages
Version control 7 21%
Spatial accuracy 7 21%
Spatial precision 6 18%
Geographic scale 4 12%
Other 9 27%

Other data quality and reuse issues

Given another opportunity to expand on data quality issues, thirteen participants wrote some of the following: missing data, dead links, misspellings, ambiguous variables, transcription errors, a mix of formats, missing minus sign on latitude coordinates, and data outside of the stated spatial extent. Finally, scientists were asked to list any other reuse issues and six provided the following additional complaints: problems with spatio-temporal interpolation, understanding how geological data was assigned, converting to higher taxonomy updates over time, legal constraints of the locality data, and again, missing variables and typographic errors, which prevented automation. A human may work through these data quality and other reuse issues, but machines cannot without learning how to recognise and address missing elements and inaccurate labelling. These scientists’ heuristics require further study as valuable components to enable machine-actionable data.

Sufficient information for reuse

Only thirty-nine participants responded to the question asking if the metadata provided sufficient information for reuse. Of these, thirty-one (80%) participants said they did, and eight (20%) participants said they did not. Other feedback on the study highlights some discussion points already mentioned. For example, a participant stated “reliable but sometimes too much metadata” concerning their search. A machine might not have this same concern with the volume of detail. Another participant suggested that “metadata could advise about intended objective measures of data quality” and this type of metadata consideration would impact reuse by setting parameters for fitness of use of that data. Some issues are known such as error propagation through aggregation from multiple data providers. One participant indicated that there are no alternatives available for the type of data they were reusing and another expressed concern over database currency. Expert training and experience of scientists allows for proper data reuse. Studying these heuristics humans use, and considering them when building machine learning, may avoid improper data reuse.

Discussion

This study expanded upon the past use of the FAIR Data Principles as a theoretical framework to describe natural scientists’ perceptions of data discovery and reuse from oceanography, hydrology, and seismology to include biology and atmospheric sciences. The responses to the critical incident technique survey provide further insight into scientists’ perceptions on their discovery and reuse behaviour. The questions derived from the FAIR Data Principles inherited the broad applicability and field agnostic traits of the data principles. The discussion will include both the practical and theoretical implications of the results. The FAIR Data Principles were initially proposed to evaluate the machine-actionability of datasets, but this framework lends itself to other uses. Several data principles do not map to human reuser heuristics, but this study provides lessons learned and suggests how to mitigate them in future work. The following discusses some practical implications for research communities.

Practical implications for research communities

The skills scientists have to assess fitness of use by simply viewing data, its structure, and some common annotation, may be exactly the human heuristics captured through additional qualitative work in each discipline that would be essential for machine learning in these domains. The practical implications for research communities from this study relate most to what was learned about scientists’ search and reuse behaviour. Human reusers determine fitness for use without knowing the resource’s specific name and may not consciously use any type of data principles in their assessments. To rely on trusted brands and past experience to ensure data quality verifies past research on data resusers’ focus on trust (Faniel et al., 2016; Yoon, 2017). This also confirms previous findings highlighting a lack of data information literacy training among natural scientists, particularly those outside of academia, and an inattention to metadata within the natural sciences (Davis, 2014; Gregory et al., 2019). Participants in this study may have assumed aspects of data quality without alternative data options for types of data that are rare and expensive to collect. In practice, data providers may need to create new value-added aspects of available data to make data more machine-actionable. Most participants worked in academic or government settings and these likely reusers require further study to improve data services.

Future work should test if reusers’ assumptions that data have unique identifiers, a license, controlled vocabularies, and are open and free by validating this information. This work could lead to recommendations for data repositories to make their data FAIR-aligned. In the likely event that some scientists do not consult metadata before reuse, based on past studies, new fitness for use parameters that inform machine reuse would also help humans assess what data may or may not be useful for their purposes (Bishop et al., 2019). These types of features and functionality would save time assessing data that would not be fit for particular purposes. One other practical implication from these results is that most participants responded that they accessed at least some data by manually downloading it, with only ten indicating an automated process. The machine-actionable Web service response is the only FAIR aligned option, but scientists access data in many non-machine actionable ways. The manual work of scientists to transform and convert data could be automated if the perceived behaviour is collected across domains and subdisciplines. Several natural science repositories still require upgrades to enable machine-actionable Web services.

The results indicate confusion over how reused data should be licensed. It is well-known that data generators do not always receive their deserved credit in research conducted from reused data, and there has been discussion on how this accreditation process should occur (Hrynaszkiewicz and Cockerill, 2012; Pierce et al., 2019). Machines may be programmed to assume no use constraints, but hopefully licensed so that researchers receive credit for their intellectual contributions.

When addressing the FAIR Data Principles, data providers might show too much albeit relevant metadata. Although this benefits all reusers in data discovery, providers might think about a move toward displaying less metadata if most human reusers get confused by “too much metadata”. Some data might be fair for humans without complaint as they are equipped with knowledge and experience to see what may not be explicit in the meta(data). A machine might not have this same concern with the volume of detail. Another participant suggested that “metadata could advise about intended objective measures of data quality” and this type of metadata consideration would impact reuse by setting parameters for fitness of use of that data. Some issues are known such as error propagation through aggregation from multiple data providers. One participant indicated that there are no alternatives available for the type of data they were reusing and another expressed concern over database currency. Expert training and experience of scientists allows for proper data reuse. Studying these heuristics humans use and considering them when building machine learning may avoid improper data reuse. Social norms of natural sciences may change to increase the importance of data quality and its attributes as machine-actionable data become necessary for analyses via artificial intelligence. Training and education related to research data management across all sciences could address this gap.

Future work should attempt to clarify the understanding of scientists across disciplines to know how to ask them about data standards, differentiate between data formats and standards, and their awareness of both concepts. It might be that data standards like secondary analyses itself are so ingrained in most natural sciences that data reuse occurs without consciousness of the data standards. This is a fitness for use aspect that is more vital to machine reusers because unlike humans, machines will need to know how data are encoded to analyse it. A human may work through these data quality and other reuse issues, but machines cannot without learning how to recognise and address missing elements and inaccurate labelling. These scientists’ heuristics require further study as valuable components to enable machine-actionable data.

Theoretical implications for the FAIR Data Principles

This research has some important theoretical implications for the FAIR Data Principles. Each subsequent use of the FAIR Data Principles alters the original purpose, given situational factors and regional interpretations. More actionable findings for both dataset evaluations and studies of human reusers would benefit from some refinement of the principles with discipline-specific data quality elements that make data findable, accessible, interoperable, and reusable, in each potential reuse. For example, many organizations and domains have (meta)data standards of all types (e.g., DublinCore) with wide adoption. These standards might need to be revisited in the era of FAIR since data availability for humans is different from the machine-actionable research world of today. This study demonstrates some areas that lack specificity in the Data Principles. For example, Interoperability 2-(meta)data use vocabularies that follow FAIR Data Principles needs disciplines to approved data standards to assess how FAIR-aligned data may be. This study results had only nine participants name a data standard, which is concerning given that all science data needs to be operationalizable (i.e., consistently categorize) to be valid, reliable, and reusable into measurement and format. A more useful operationalized FAIR framework for each discipline would list standards to ensure some moved toward more FAIR data.

This study results also show natural scientists are not certain of data principles. Additional education and training related to metadata was mentioned as a practical implication, but the developers and implementors of the FAIR Data Principles have key takeaways from this study as well. The metadata elements recommended in FAIR help make machine-actionable data, but when the framework is used to assess human users’ fitness for use behaviour, FAIR demonstrates reuse may occur without acknowledgment or awareness by any reuser of how metadata assists in these processes. This underscores a gap between the data providers’ ideals to ensure findable, accessible, interoperable, and reusable data and the limited minimal data considerations needed to actually reuse and potentially misuse data. Future work should explore further the metadata awareness of scientists and the potential impacts on reuse.

The method in this study might work on most data-intensive sciences to further explore fitness for use considerations. Future research in this area should target specific disciplines where common data repositories are known to best address machine-actionable limitations from human reusers’ concerns while seeking and reusing data. With additional question refinement and targeted recruitment, the survey could be used for a survey with generalizable results and statistical analysis not appropriate for this quantitative study. Learning how fitness for use is determined by scientists could inform machine learning tools and data services. Design of data repositories and services that focus on the most vital data quality heuristics used by scientists in discovery and reuse could improve the usability (and satisfaction for those reusers that are capable) with these services and tools.

Study limitations

As mentioned, the FAIR Data Principles relate to the most critical elements within (meta)data itself and not how reusers discover or evaluate data. This repurposing of the principles to describe the fitness for use behaviour of reusers works well for several sequential steps necessary for human reuse. The data principles were created foremost to enable machine-actionable data (i.e., data use without humans). Therefore, this dissonance in purpose is inherent to the methods using the principles to describe human behaviour. This is justified as the purpose of the study was to gather these human representative heuristics that will help machine learning parameters for automatically assessing data fitness for use. One other limitation of this study is the number of participants. Natural science organizations that supported the distribution of the survey reached very widely across many disciplines and to many thousands of scientists as potential participants. The survey distribution timing of March 2020 may have contributed to a very low participation as well, but even so the number of students, citizen scientists, and others reached by this method was not targeted enough for generalizations on behaviour of scientists to be made. Future work should use a sampling frame that reaches scientists in certain roles within certain domains.

Although this survey did not lead to generalizable results, which limits their impact, the responses given do further validate the interview study and provide more data search stories. In addition, there is the potential limitation in study design that only successful data search and reuse scenarios would be recalled, failed reuses would not include the heuristics that lead to reuse. The representative heuristics that lead to nonuse could be very informative for machine-actionable data. The only incentive for participation in the fifteen-minute survey was that scientist’s responses would help understand the most vital aspects of data in determining fitness for use. Future work could be timelier in its distribution and targeted to specific subdisciplines of scientists. Certain data repositories users may have unique mental shortcuts that could lead to more actionable recommendations, but again reduce the generalizability of any search actions described. Perhaps, generalizable data discovery behaviour across all earth sciences is not possible. FAIR Data Principles themselves may need revisions to specific how to operationalize them across disciplines.

Conclusion

This study captures heuristics used by natural scientists to inform their discovery and reuse that in turn may inform design of data tools and services. It may be too soon to conduct machine data-seeking behaviour research, but not too soon to anticipate it. A better understanding of human approaches to discovery and reuse will assist in programming machines to benefit from human search strengths and capture the heuristics of data munging, while at the same time avoiding some of the drawbacks of human limitations, foremost their assumptions. Perhaps, some assumptions should be a part of artificial intelligence to trust certain brands and data repositories more as a measure of quality to ensure analyses is conducted using authoritative sources (e.g., NASA). The perceptions of scientists’ discovery and reuse heuristics framed with the FAIR Data Principles provides a start to inform the design of tools and services for human reuse of machine-actionable data. It also creates the foundation for considerations on the usability of these data repositories by machines as reusers as well.

Acknowledgements

The authors would like to acknowledge study participants, and organizations’ leadership and communications personnel that helped distribute this survey during a difficult time. Material in this paper is the result of data collection done for the Spring 2020 Faculty Development Leave of the first author. Anonymized and deidentified transcripts in the Tennessee Research and Creative Exchange (TRACE), which serves as the University of Tennessee’s institutional repository

Data

Bishop, B. W. (2020). Data fitness for use survey. Knoxville, TN: TRACE. https://doi.org/10.7290/2OvMbcfHBA

About the authors

Bradley Wade Bishop is an Associate Professor in the School of Information Sciences at the University of Tennessee. Bishop has an MLIS from the University of South Florida and a PhD from Florida State University. His research focus is on research data management, data discovery, geographic information science, as well as the study of data occupations, education, and training. He can be contacted at: wade.bishop@utk.edu
Hannah R. Collier was a Graduate Assistant at the School of Information Sciences at the University of Tennessee Knoxville, TN. She completed a MSIS as part of the Institute of Museum and Library Services’ (IMLS) Collaborative Analysis Liaison Librarians (CALL) project and works as an Earth Science Informatics Technical Professional at Oak Ridge National Laboratory. collierhr@ornl.gov.

References

Note: A link from the title, or from "Internet Archive", is to an open access document. A link from the DOI is to the publisher's page for the document.

Appendix

Questionnaire

  1. Briefly describe the data from your recent search: (i.e., species list, .csv)
  2. If you can recall, please name the data repository, database, or other resources used to access the data.
  3. Which of the following *best* matches the type of institution you work for?
    • Academic
    • Government
    • Private
    • Other
  4. Which of the following *best* matches your research domain?
    • Atmospheric Sciences
    • Biology
    • Hydrology
    • Planetary Sciences
    • Ocean Sciences
    • Seismology
    • Other
  5. Findability

  6. How did you find the data? (check all that apply)
    • Search engine (e.g., Googled it)
    • Trusted source
    • Government source
    • Explain your discovery process if different than options above
  7. Did the data have a persistent identifier (e.g., DOI)?
    • Yes
    • No
    • Not sure
  8. Did the data have metadata?
    • Yes
    • No
    • Not sure
  9. Did the metadata help you find the data?
    • Yes
    • No
    • Not sure
  10. Accessibility

  11. How did you access the data? (Check all that apply)
    • By calling up a machine-actionable Web service for automatic download.
    • Manually downloading (Website)
    • Contacting data creators/authors
    • Other
  12. Were the data in an open format?
    • Yes
    • No
    • Not sure
  13. Were the data free?
    • Yes
    • No
    • Not sure
  14. Did the data have any use constraints?
    • Yes
    • No
    • Not sure
  15. Were the metadata accessible?
    • Yes
    • No
    • Not sure
  16. Interoperability

  17. Were the data in a usable format?
    • Yes, usable as is
    • Yes, but required transformation or conversion
    • No
  18. Were the data in a common format to other data used in your research?
    • Yes
    • No
    • Not sure
  19. Were the data using any of the following data standards? (check all that apply)
    • Discipline-specific metadata standard
    • name:__________
    • Controlled vocabularies
    • name:__________
    • Data dictionaries
    • name:__________
    • Ontologies
    • name:__________
  20. Were the data machine-actionable?
    • Yes
    • No
    • Maybe
  21. Were the data available in multiple formats?
    • Yes
    • No
    • Not sure
  22. Reuseability

  23. Were there any issues with the data that impacted reuse of the data?
    • Yes
    • No
  24. If yes, what were the reuse issues? (check all that apply)
    • Version control
    • Spatial accuracy
    • Spatial precision
    • Geographic scale
    • Other
  25. What were some other data quality issues?
  26. What were some of the other reuse issues? (if any)
  27. Did the metadata provide sufficient information for reuse?
    • Yes
    • No

How to cite this paper

Bishop, B.W. & Collier, H.R. (2022). Fitness for use of data: scientists' heuristics of discovery and reuse behaviour framed by the FAIR Data Principles Information Research, 27(3), paper 942. Retrieved from http://InformationR.net/ir/27-3/paper942.html (Archived by the Internet Archive at https://bit.ly/3eKzb2X) https://doi.org/10.47989/irpaper942

Check for citations, using Google Scholar