nlpIrish

Irish NLP Dataset Descriptions

This is a collection of descriptions, sources and extraction instructions for Irish language natural language processing (NLP) text datasets for NLP research.

Would you like to add to or collaborate on this collection? Great! Head up to the About section to see how to contribute 👌

This site is hosted on GitHub and built using the fabulous fastpages

Parallel Corpora*

In order of dataset size (but remember lines of text doesn’t equal quality!):

ParaCrawl, v6
- Lines of text: 1,366,628
- GA Word count: 32,824,533
DGT-TM, DGT-Translation Memory
- Lines of text: 190,500
- GA Word count: 4,852,515
DCEP, Digital Corpus of the European Parliament
- Lines of text: 46,146
- GA Word count: 1,029,348
ELRC, European Language Resource Coordination
- No. source documents: 33
- Lines of text: 23,946
- GA Word count: 485,570
Tatoeba
- Lines of text: 1,973
- GA Word count: 10,352

*Sizes as of June 2020, word count defined as space-separated tokens

Monolingua Irish Corpora

tbd

Task-specific Corpora

tbd

ALL DATASET DESCRIPTIONS 👇

Tatoeba
A collection of sentences and translations, crowdsourced, collaborative, open and free
Jun 12, 2020
ParaCrawl
Open source tools to crawl, align and clean bilingual data
Jun 11, 2020
ELRC, European Language Resource Coordination
Documents published on the European Parliament's official website
Jun 11, 2020
DGT-TM, DGT-Translation Memory
European Commission's Directorate-General for Translation multilingual Translation Memory
Jun 11, 2020
DCEP, Digital Corpus of the European Parliament
Documents published on the European Parliament's official website
Jun 11, 2020

Irish NLP Dataset Descriptions

Parallel Corpora*

Monolingua Irish Corpora

Task-specific Corpora

ALL DATASET DESCRIPTIONS 👇

Tatoeba

ParaCrawl

ELRC, European Language Resource Coordination

DGT-TM, DGT-Translation Memory

DCEP, Digital Corpus of the European Parliament