Irish NLP Dataset Descriptions
This is a collection of descriptions, sources and extraction instructions for Irish language natural language processing (NLP) text datasets for NLP research.
Would you like to add to or collaborate on this collection? Great! Head up to the About section to see how to contribute ๐
This site is hosted on GitHub and built using the fabulous fastpages
Parallel Corpora*
In order of dataset size (but remember lines of text doesnโt equal quality!):
- ParaCrawl, v6
- Lines of text: 1,366,628
- GA Word count: 32,824,533
- DGT-TM, DGT-Translation Memory
- Lines of text: 190,500
- GA Word count: 4,852,515
- DCEP, Digital Corpus of the European Parliament
- Lines of text: 46,146
- GA Word count: 1,029,348
- ELRC, European Language Resource Coordination
- No. source documents: 33
- Lines of text: 23,946
- GA Word count: 485,570
- Tatoeba
- Lines of text: 1,973
- GA Word count: 10,352
*Sizes as of June 2020, word count defined as space-separated tokens
Monolingua Irish Corpora
- tbd
Task-specific Corpora
- tbd