Academic journal article Journal of the Medical Library Association

The Need for a Multidisciplinary Team Approach to Life Science Workflows*[dagger](*)

Academic journal article Journal of the Medical Library Association

The Need for a Multidisciplinary Team Approach to Life Science Workflows*[dagger](*)

Article excerpt

INTRODUCTION

Information retrieval for life science research (a broad rubric encompassing many traditional disciplines such as biochemistry, botany, cell biology, and molecular biology [1]) often involves the use of combinations of multiple information resources. Such combinations have been called "workflows" [2, 3] and may include factual databases such as Genbank [4], literature databases such as Entrez-PubMed [5], and analysis tools such as the Basic Local Alignment Search Tool (BLAST) [6]. Information resources can be combined in different ways toward the same goal; varying combinations may produce different results for the same research question. Combinations that produce different results may appear equivalent to a scientifically sophisticated user who lacks knowledge of metadata about the resources that may indicate the possibility of varying results. In addition, a user who pursues only a single combination of resources may not even realize that another combination might produce different results.

This study's objective was to compare the results of three intuitively plausible and seemingly similar workflows for retrieving gene function information, with the goal of illustrating the importance of library science in bioinformatics and the need for a multidisciplinary team approach to authoring, vetting, and using life science workflows.

METHODS

Microarray analysis is a high-throughput experimental technique that engenders significant information retrieval requirements [7]. One use of microarrays is analyzing gene expression: raw data from the microarray are statistically analyzed to determine which genes show significant changes in expression, with one or more lists of genes as the final result. Interpreting the biological meaning of this result often necessitates retrieving information from other sources about the function of the listed genes. Microarray analysis, therefore, is one example of a domain in which information from the biological literature must be integrated with information contained in sequence and other databases.

For some microarray analyses, each gene has a related representative DNA sequence. The identifier of that DNA sequence (its nucleotide sequence accession number, hereafter, "accession number") may be used to search for information about the function of the associated gene. This study compared three workflows that used accession numbers as starting points and utilized linkages among PubMed and other Entrez databases [8]. Although using accession numbers to search for gene function information has problems [9], the workflows compared here have been selected as simple, intuitively plausible strategies similar to some of those the authors have seen used in practice. Other workflows, using other starting points or information resources, are also possible and potentially useful.

This study used a list of 251 accession numbers representing genes determined to be of interest in a microarray experiment related to muscle recovery after immobilization (NIH grant AG18881) [10-12]. The genes on the list represented an example of real-world microarray results for which researchers might need to retrieve gene function information. The list of accession numbers was used as the test-set against which workflows were executed and their results compared.

Description of the three workflows

The three workflows are depicted in Figure 1 (available online). Each starts with an accession number (e.g., M29293), denoted as "xxxxxxx."

Workflow 1: PubMed only. The Entrez-PubMed "Secondary Source" or SI field (which identifies secondary data sources and associated accession numbers discussed in MEDLINE articles) [13] was searched using a query of the form genbank/xxxxxxx[si]. The result was a set of PubMed records, represented here as a set of PubMed IDs (PMIDs). For example, the query "genbcmk/M29293[si]" retrieved PMID 2532363.

Workflow 2: Nucleotide-PubMed. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.