Table 5

Roadmap for refactoring corpora. The list of corpora came from [32] and [33], where there are links to the corpora. Column headings indicate the steps that corpora may need to undergo to be refactored; those corpora that would require that step are noted with a dot. The heading "get original" means the original text needs to be retrieved. "Detect spans" means the corpus is a metadata corpus so spans of entities need to be detected. "Alt. search" means techniques other than exact-match searching must be used.


get original
detect spans
alt. search

Arabidopsis Thaliana Circadian Rhythms [34]



Bio1 [35]



BioCreative 2004 Task 1A [28]



BioCreative 2004 Task 1B [36]



BioCreative 2004 Task 2 [37]



BioCreative 2006 Task GM [38]



BioCreative 2006 Task GN [39]



BioCreative 2006 Task IPS/IMS [40]



BioCreative 2006 Task ISS [40]



BioInfer [41]



BioText: Recognizing Abbreviation Defintions [42]



BioText: Protein-Protein Interaction Data [43]



BioText: Relations between Disease/Treatment Entities [44]



Brown-Genia Treebank [45]



DepGenia [46]



DIPPPI [47]



EDGAR [48]



GENIA [49, 50]



FetchProt [51]



Human Gene ID-Serve



IEPA [52]



ImmunoTome



iProLink [53]



Medstract [54, 55]



MedTag [7]



OHSUMED [56, 57]



PASBio [58]



PASTA [59]



PathBinder [60]



PennBioIE [12]



PICorpus



ProSpecTome [61]



PDG [9]



Texas [62]



TREC Genomics 2004 Categorization Task [63]



TREC Genomics 2005 Categorization Task [64]



TREC Gemonics 2006 IR Task [65]



TREC Genomics 2007 IR Task [65]



Wisconsin [66]



WSD [67]



Yapex [68, 69]




Johnson et al. Journal of Biomedical Discovery and Collaboration 2007 2:4   doi:10.1186/1747-5333-2-4