DGT-TM, DGT-Translation Memory
European Commission's Directorate-General for Translation multilingual Translation Memory
Available for Download ✅
⚠️ Always check the license of the data source before using the data ⚠️
- Link: https://ec.europa.eu/jrc/en/language-technologies/dgt-translation-memory
- Format: .tmx
- NOTE:
- There are no Irish translations in:
- DGT-TM Version 1 (Released in 2007)
- DGT-TM-release 2011
- "DGT-TM-release 2012" is the first release with Irish translations
- There are no Irish translations in:
Brief Description
A parallel multilingual corpus of the European Union’s legislative documents (Acquis Communautaire) in 24 EU languages. The aligned translation units have been provided by the Directorate-General for Translation of the European Commission by extraction from one of its large shared translation memories in EURAMIS (European advanced multilingual information system). This memory contains most, although not all, of the documents which make up the Acquis Communautaire, as well as some other documents which are not part of the Acquis.
Pip install the tmx2dataframe
package here
pip install tmx2dataframe
from pathlib import Path
import pandas as pd
from tmx2dataframe import tmx2dataframe
metadata, df = tmx2dataframe.read('Volume_1/22003D0033.tmx')
df.head()
df.source_sentence[0], df.target_sentence[0]
len(df), len(df.target_sentence[1].split())
The metadata is also included:
metadata
lang='GA'
#yr='2013'
yr_list=[]
for y in range(2,10):
yr_list.append(f'201{y}')
# For each release year
for yr in yr_list:
dir_path = Path(f'{yr}_release')
dir_list=[]
for dd in dir_path.iterdir():
if dd.is_dir(): dir_list.append(dd)
mb = master_bar(dir_list)
# For directory in a specific release year
for d in mb:
if d.is_dir() & (d.suffix != '.zip'):
# For each file in a specific directory
for f in progress_bar(list(d.iterdir()), parent=mb):
if f.suffix == '.tmx':
try:
_, df = tmx2dataframe.read(str(f))
# If target_language in dataframe contains the language string (like 'GA')
if len(df[df.target_language.str.contains(lang)]) > 0:
tmp = df[df.target_language.str.contains(lang)].copy()
tmp['filepath'] = str(f)
var_exists = 'ga_df' in locals() or 'ga_df' in globals()
if var_exists: ga_df = pd.concat([ga_df, tmp])
else: ga_df = tmp
except:
print(f"Couldn't open {f} in {d}")
print(f'{yr} DONE!')
var_exists = 'ga_df' in locals() or 'ga_df' in globals()
if var_exists:
print(f'{len(ga_df)} samples found in {yr} release')
ga_df.reset_index(inplace=True, drop=True)
ga_df.to_csv(f'dgt_tm_{yr}_release_en-ga.csv')
del ga_df
gc.collect()
else: print(f'No {lang} text found in {yr} release')
print()
#ga_df.head()
for y in range(2,10):
try:
if y == 2: ga_df = pd.read_csv(f'dgt_tm_201{y}_release_en-ga.csv', index_col=0)
tmp = pd.read_csv(f'dgt_tm_201{y}_release_en-ga.csv', index_col=0)
ga_df = pd.concat([ga_df, tmp])
except:
print(f'Error with opening dgt_tm_201{y}_release_en-ga.csv')
ga_df.reset_index(inplace=True, drop=True)
print(len(ga_df))
ga_df.to_csv('dgt_tm_2012-2019_releases_en-ga.csv', index=False)
ga_df.head()
df_ls = pd.read_csv('Volume_1/file_list.txt', header=None)
df_ls.columns = ['fst']
df_ls.head()
df_ls.fst[0]