Available for Download ✅

⚠️ Always check the license of the data source before using the data ⚠️

Main page: https://ec.europa.eu/jrc/en/language-technologies/dcep
Download Link: https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-Download-Page.html
Extraction Instructions: https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-extract-README.html
Format: Sentence-aligned data is in plain text

Brief Description

Contains the majority of the documents published on the European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to European Parliament's activities and bodies. The current version of the corpus contains documents that were produced between 2001 and 2012.

Other Notes

Lines of text: 46,146
GA Word count: 1,029,348

Word Count Distribution

<matplotlib.axes._subplots.AxesSubplot at 0x7fc6a3dc8ed0>

Code to Extract Files to Pandas DataFrame

GA-EN specific instructions are below, for more info see the offical extraction instructions page

Download and extract language files

!wget -q http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-GA-pub.tar.bz2

!wget -q http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-EN-pub.tar.bz2

!tar jxf DCEP-sentence-GA-pub.tar.bz2

!tar jxf DCEP-sentence-EN-pub.tar.bz2

Download and extract language pair info

!wget -q http://optima.jrc.it/Resources/DCEP-2013/langpairs/DCEP-EN-GA.tar.bz2

!tar jxf DCEP-EN-GA.tar.bz2

Download and extract alignment scripts

!wget -q http://optima.jrc.it/Resources/DCEP-2013/DCEP-extract-scripts.tar.bz2

!tar jxvf DCEP-extract-scripts.tar.bz2

Create aligned file

The --numbering-filter is a crude but useful heuristic that attempts to drop numberings and short titles from the output. It works simply by matching sentences on both sides against a Unicode regex that looks for two alphabetic characters with space between them.

The --length-filter-level=LENGTH_FILTER_LEVEL argument is used to throw away as suspicious all bisentences where the ratio of the shorter and the longer sentence (in character length) is less than LENGTH_FILTER_LEVEL percent.

!cd dcep && ./src/languagepair.py --numbering-filter --length-filter-level=40 EN-GA > EN-GA-bisentences.txt

Open as a Dataframe

import pandas as pd

df = pd.read_csv('dcep/EN-GA-bisentences.txt', header=None, sep='\t')
df.columns = ['en', 'ga']
df.to_csv('dcep_en-ga_bisentences.csv')
print(len(df))
df.head()

46147

	en	ga
0	RULES OF PROCEDURE	RIALACHA NÓS IMEACHTA
1	7th parliamentary term	7ú téarma parlaiminteach
2	July 2009	Iúil 2009
3	Interpretations of the Rules (pursuant to Rule...	Tá léirmhínithe ar na Rialacha (de bhun Riail ...
4	MEMBERS, PARLIAMENT BODIES AND POLITICAL GROUPS	FEISIRÍ, COMHLACHTAÍ PARLAIMINTE AGUS GRÚPAÍ P...