The EDP French/English Parallel Medical Corpus

Introduction

We identified four open access CC-BY journals, referenced by EDP Sciences as having content in French and in English: the articles were originally written in French but the journals also publish the titles and abstracts in English, using a translation provided by the authors. Two journals are listed by the publisher under Health: ”Actualités Odonto-Stomatologiques” and ”Médecine Buccale Chirurgie Buccale”, which are journals addressing dentistry. Two journals are listed under Life & Environmental Sciences: ”Cahiers Agriculture” and ”Oilseeds and fats, Crops and Lipids”.
A list of the journal URLs was obtained and crawled on March 15, 2017. The html pages were parsed to extract the titles and abstracts in French and English as well as the author names. Any articles lacking some of this information were discarded.
The dataset was pre-processed for sentence segmentation using the Stanford CoreNLP toolkit for use in the WMT17 and WMT18 biomedical tasks [1] [2]. The segmented corpus was used in a study of sentence segmentation methods for French medical corpora[3] .
A manual reference for sentence segmentation was then created independently by revising baseline segmentation after the punctuation marks: full stop, interrogation point, exclamation point and colon.
Based on the manually validated sentence segmentation, the dataset was aligned automatically at the sentence level using YASA (Lamraoui and Langlais, 2013).
Manual evaluation conducted on a sample set suggests that 94% of the sentences are correctly aligned, with about 20% of the sentence pairs exhibiting additional content in one of the languages.

License

The EDP French/English Medical corpus is released under the CC BY (Licence Creative Commons).

The scientific article abstract and titles used in this corpus were obtained on march 15, 2017. Subsequently, the corpus was segmented into sentences. There were no updates of the corpus since 2017 so the articles in the corpus may differ from those found in more recent versions of EDP Sciences.

Any research using this corpus for running experiments should include the following citation:

Jimeno Yepes A, Névéol A, Neves ML, Verspoor K, Bojar O, Boyer A, Grozea C, Haddow B, Kittner M, Lichtblau Y, Pecina P, Roller R, Rosa R, Siu A, Thomas P, Trescher S. Findings of the WMT 2017 Biomedical Translation Shared Task. Second Conference on Machine Translation. 2017(Vol 2):234-247.

Here is the Bibtex entry:

@inproceedings{
	Title = {Findings of the WMT 2017 Biomedical Translation Shared Task},
	Author = {Antonio Jimeno Yepes and Aurelie Neveol and Mariana Neves 
	  and Karin Verspoor and Ondrej Bojar and Arthur Boyer and Cristian Grozea 
	  and Barry Haddow and Madeleine Kittner and Yvonne Lichtblau 
	  and Pavel Pecina and Roland Roller and Rudolf Rosa and Amy Siu 
	  and Philippe Thomas and Saskia Trescher},
	BookTitle = {Proceedings of the Second Conference on Machine Translation},
	Month = {9},
	Year = {2017},
	Publisher = {Association for Computational Linguistics},
	Volume = {2: Shared Task Papers},
	Pages = {234-247}
}
  		

File Format

The corpus is available in the MEDLINE (without sentence segmentation) and BioC (with sentence segmentation) formats.

A corpus excerpt in MEDLINE format is shown below.

Sample document in MEDLINE format
PMID- aos2009246p113
TIEN- Oral symptoms of systemic pathologies:Crohn's disease and ulcerative colitis
TIFR- Manifestations buccalesdes maladies systémiques :La maladie de Crohnet la rectocolite hémorragique
AU - Samira Cherbi
AU - Claude-Bernard Wierzba
ABEN- Inflammatory bowel disease (IBD) are systemic pathologies with chronic disorders, and originate from unidentified causes. Two main types exist: Crohn's disease and ulcerative colitis, both of which have very different clinical, topographic and morphological characteristics. (...)
ABFR- Les entérocolites inflammatoires idiopathiques sont des pathologies systémiques d'étiologie inconnue et d'évolution chronique. Elles regroupent deux principales affections : la maladie de Crohn et la Rectocolite Hémorragique (RCH) dont les caractéristiques cliniques, topographiques et morphologiques sont nettement différentes. (...)

The same corpus excerpt in the BioC format is shown below.

Sample BioC EDP document
<id>aos2009246p113</id>
<authors>
  <infon key="author">Samira Cherbi</infon>
  <infon key="author">Claude-Bernard Wierzba</infon>
</authors>
<passage>
  <infon key="language">EN</infon>
  <infon key="section">title</infon>
  <sentence>
   <infon key="sentnum">0</infon>
   <offset>0</offset>
   <text><![CDATA[Oral symptoms of systemic pathologies:Crohn's disease and ulcerative colitis]]></text>
  </sentence>
</passage>
<passage>
  <infon key="language">EN</infon>
  <infon key="section">abstract</infon>
  <sentence>
   <infon key="sentnum">0</infon>
   <offset>0</offset>
   <text><![CDATA[Inflammatory bowel disease (IBD) are systemic pathologies with chronic disorders, and originate from unidentified causes.]]></text>
  </sentence>
  <sentence>
   <infon key="sentnum">1</infon>
   <offset>122</offset>
  <text><![CDATA[Two main types exist: Crohn's disease and ulcerative colitis, both of which have very different clinical, topographic and morphological characteristics.]]></text>
  </sentence>
(...)
</passage>
<passage>
  <infon key="language">FR</infon>
  <infon key="section">title</infon>
  <sentence>
   <infon key="sentnum">0</infon>
   <offset>0</offset>
   <text><![CDATA[Manifestations buccalesdes maladies systémiques :La maladie de Crohnet la rectocolite hémorragique]]></text>
  </sentence>
</passage>
<passage>
  <infon key="language">FR</infon>
  <infon key="section">abstract</infon>
  <sentence>
   <infon key="sentnum">0</infon>
   <offset>0</offset>
   <text><![CDATA[Les entérocolites inflammatoires idiopathiques sont des pathologies systémiques d'étiologie inconnue et d'évolution chronique.]]></text>
  </sentence>
  <sentence>
   <infon key="sentnum">1</infon>
   <offset>127</offset>
  <text><![CDATA[Elles regroupent deux principales affections : la maladie de Crohn et la Rectocolite Hémorragique (RCH) dont les caractéristiques cliniques, topographiques et morphologiques sont nettement différentes.]]></text>
  </sentence>
(...)
</passage>

Download

EDP 2017 Version, archive the various formats:
EDP parallel corpus, MEDLINE format (no sentence segmentation)
EDP parallel corpus, BioC format; automatic (coreNLP) and manual segmentation; WMT split.
EDP French corpus, text format: one sentence per line; automatic (coreNLP) and manual segmentation.
EDP English corpus, text format: one sentence per line; automatic (coreNLP) and manual segmentation.

People Involved

  • Arthur Boyer
  • Aurélie Névéol
  • Mariana Neves
  • Jimeno Yepes

Publications

  • [1] Jimeno Yepes A, Névéol A, Neves ML, Verspoor K, Bojar O, Boyer A, Grozea C, Haddow B, Kittner M, Lichtblau Y, Pecina P, Roller R, Rosa R, Siu A, Thomas P, Trescher S. Findings of the WMT 2017 Biomedical Translation Shared Task. Second Conference on Machine Translation. 2017(Vol 2):234-247.[pdf]
  • [2] Neves ML, Jimeno Yepes A, Névéol A, Grozea C, Siu A, Kittner M, Verspoor K. Findings of the WMT 2018 Biomedical Translation Shared Task: Evaluation on Medline test sets Third Conference on Machine Translation. 2018:328--343.[pdf]
  • [3] Boyer A, Névéol A Détection automatique de phrases en domaine de spécialité en français. Traitement Automatique de la Langue Naturelle - TALN. 2018[pdf]
  • [4] Névéol A, Jimeno Yepes A, Neves ML, Verspoor K. Parallel Corpora for the Biomedical Domain. Language and Resource Evaluation Conference, LREC 2018. 2018:286-291. [pdf]

Acknowledgements

This work was supported by the French National Agency for Research under grant CABeRneT ANR-13-JS02-0009-01.