Secondary Data Use

I know over the years many researchers have commented both positively and negatively on the NIH and NIAID’s data sharing policy.  They are set to change it again such that everyone who gets NIH funding will need to comply with the policy. Comments can be made on the new policy for at least another week.  So now is the time to speak up! Although, do it today as the comment period has almost expired. The new policy can be found at: http://www.gpo.gov/fdsys/pkg/FR-2013-09-20/pdf/2013-22941.pdf

In summary, all projects are not expected to release raw data like instrument image data. However within 6 months of data generation they are expected to release the initial sequence reads, data after alignment and QC (e.g. bam files), and analyses like expression profiling and variant calling. If a manuscript is published within those 6 months, the data needs to be release upon acceptance of the publication. In addition all analyses relating genomic data to phenotype of other biological states must be released upon publication. It reads as if there are no embargo dates. 

Having spent a significant amount of time on both sides of secondary data usage--being both a generator and a heavy user, I provided input on three issues in the comments to the policy change:

1.       I think data generators need a protected time from when data is released to when they can publish—something akin to an embargo date. Such an embargo date should be made clear to secondary users at the time data is acquired and should not change. An embargo date is important because I don’t think it will be long before groups automatically download new data, write a low quality paper, and publish it in a lower quality journal, making it impossible to have the time to do the validation and follow-up studies need to send it to a high quality journal. Allowing data generators time to publish their research and ideas in a high quality journal is essential to the future of this type of science.  I know others disagree with this and believe that everything should be open access. But today sequencing is cheaper than preparing the samples, so it is no longer a precious commodity only available to a limited few. One solution would be for the data generators to outline their particular focus.  While that focus area would be off limits, other research would be fair game to conduct an publish without embargo. Of course, reviewers and editors would need to do their diligence to enforce such a system and the scope allowed would need to be limited.

2.       Currently, raw human data is not required to be deposited; it is exempted and only alignments following cleaning are required. However, cleaning is not defined.  Since this could remove microbial reads, this is a problem for my research.  I think that if raw data is not provided, it needs to be stipulated that alignments need to include ALL reads.  In addition, I think that if microbial users provide alignments with ALL reads, they should not have to deposit FASTQ files either.

3.       There needs to be a system for retracting data and notifying users—there currently is not.  For instance, one data generator I rely on for data retracted multiple pieces of data because the metadata said the sequence data was from a man and it was clearly from a woman. They also retracted data because three samples were sequenced that should be different, but were genetically identical. This is good and important--ensuring high quality secondary data is available. Yet they did not notify users who had already downloaded the data that it was retracted. This is a major problem for secondary users. The short time frames required for deposition can make it difficult to identify all the problems, making the data less useful to secondary users.

Just some thoughts, and the hopes that more of you will comment.  It is nice to be given an opportunity to improve the system. The best system will arise from the consideration of a variety of thoughts and opinions put forth in the comments.

--jdh