DCEP, Digital Corpus of the European Parliament
Documents published on the European Parliament's official website
Available for Download ✅
⚠️ Always check the license of the data source before using the data ⚠️
- Main page: https://ec.europa.eu/jrc/en/language-technologies/dcep
- Download Link: https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-Download-Page.html
- Extraction Instructions: https://wt-public.emm4u.eu/Resources/DCEP-2013/DCEP-extract-README.html
- Format: Sentence-aligned data is in plain text
Brief Description
Contains the majority of the documents published on the European Parliament's official website. It comprises a variety of document types, from press releases to session and legislative documents related to European Parliament's activities and bodies. The current version of the corpus contains documents that were produced between 2001 and 2012.
Code to Extract Files to Pandas DataFrame
GA-EN specific instructions are below, for more info see the offical extraction instructions page
- Download and extract language files
!wget -q http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-GA-pub.tar.bz2
!wget -q http://optima.jrc.it/Resources/DCEP-2013/sentences/DCEP-sentence-EN-pub.tar.bz2
!tar jxf DCEP-sentence-GA-pub.tar.bz2
!tar jxf DCEP-sentence-EN-pub.tar.bz2
- Download and extract language pair info
!wget -q http://optima.jrc.it/Resources/DCEP-2013/langpairs/DCEP-EN-GA.tar.bz2
!tar jxf DCEP-EN-GA.tar.bz2
- Download and extract alignment scripts
!wget -q http://optima.jrc.it/Resources/DCEP-2013/DCEP-extract-scripts.tar.bz2
!tar jxvf DCEP-extract-scripts.tar.bz2
- Create aligned file
The
--numbering-filter
is a crude but useful heuristic that attempts to drop numberings and short titles from the output. It works simply by matching sentences on both sides against a Unicode regex that looks for two alphabetic characters with space between them.The
--length-filter-level=LENGTH_FILTER_LEVEL
argument is used to throw away as suspicious all bisentences where the ratio of the shorter and the longer sentence (in character length) is less than LENGTH_FILTER_LEVEL percent.
!cd dcep && ./src/languagepair.py --numbering-filter --length-filter-level=40 EN-GA > EN-GA-bisentences.txt
- Open as a Dataframe
import pandas as pd
df = pd.read_csv('dcep/EN-GA-bisentences.txt', header=None, sep='\t')
df.columns = ['en', 'ga']
df.to_csv('dcep_en-ga_bisentences.csv')
print(len(df))
df.head()