Available for Download ✅

⚠️ Always check the license of the data source before using the data ⚠️

Main page: https://elrc-share.eu/
Data Browse Link: https://elrc-share.eu/repository/search/
Format: .tmx

Brief Description

The ELRC-SHARE repository is used for documenting, storing, browsing and accessing Language Resources that are collected through the European Language Resource Coordination and considered useful for feeding the CEF Automated Translation (CEF.AT) platform.

Other Notes

The files here are all hosted individually on ELRC and so have to be downloaded individually which requires a little patience. Let us know if there is a more efficient way to download them!

No. source documents: 33
Lines of text: 23,946
GA Word count: 485,570

Word Count Distribution

Code to Extract to a Pandas DataFrame

metadata, df = tmx2dataframe.read('elrc/citizens_information_en-ga.tmx')
print(len(df))
df.head()

10297

Code to Interate and Extract all `.tmx` files downloaded

lang='ga'            
dir_path = Path(f'elrc') 
samp_count=0
for f in progress_bar(list(dir_path.iterdir())): 
    if f.suffix == '.tmx': 
        try:
            _, df = tmx2dataframe.read(str(f))
            # If target_language in dataframe contains the language string (like 'ga')
            df.target_language = df.target_language.str.lower()
            if len(df[df.target_language.str.contains(lang)]) > 0:
                ga_df = df[df.target_language.str.contains(lang)].copy()
                ga_df['filepath'] = str(f)
        except:pass
            #print(f"Couldn't open {f}") 

        var_exists = 'ga_df' in locals() or 'ga_df' in globals()
        if var_exists:
            #print(f'{len(ga_df)} samples found in {f}')
            samp_count+=len(ga_df)
            ga_df.reset_index(inplace=True, drop=True)
            ga_df.to_csv(f'{str(f).lower()}.csv')
            del ga_df
            gc.collect()
        #else: print(f'No {lang} text found in {f} ?')
        #print()
        
print(f'{samp_count} total text samples extracted')

34235 total text samples extracted

Compile Saved CSVs

lang='ga'            
dir_path = Path(f'elrc') 

f_list = []
for f in list(dir_path.iterdir()):
    if f.suffix == '.csv': f_list.append(f)

for i,f in enumerate(progress_bar(f_list)): 
    try:
        if i == 0: ga_df = pd.read_csv(f, index_col=0)
        tmp = pd.read_csv(f, index_col=0)
        ga_df = pd.concat([ga_df, tmp])
    except:
        print(f'Error with opening {f}')

ga_df.reset_index(inplace=True, drop=True)
print(len(ga_df))
ga_df.to_csv('elrc_en-ga_compiled_2020-06-11.csv', index=False)
ga_df.head()

34243

Number source documents:

33

Number of lines per source document:

	source_language	source_sentence	target_language	target_sentence
0	en	about Citizens Information	ga	maidir le faisnéis do shaoránaigh
1	en	the Citizens Information Board is the statutor...	ga	is é an Bord um fhaisnéis do shaoránaigh ( BFS...
2	en	it provides the Citizens Information website ,...	ga	cuireann sé an láithreán gréasáin um fhaisnéis...
3	en	it also funds and supports the Money Advice an...	ga	cuireann sé maoiniú agus tacaíocht ar fáil fre...
4	en	Citizensinformation.ie provides comprehensive ...	ga	cuireann citizensinformation.ie faisnéis chuim...

	source_language	source_sentence	target_language	target_sentence	filepath
0	en	Press release31 March 2020Brussels	ga	Preaseisiúint March 31, 2020An Bhruiséil	elrc/covid19_eu_presscorner_en-ga.tmx
1	en	State aid: Coronavirus: Irish Repayable Advanc...	ga	Státchabhair: An coróinvíreas: Scéim Réamhíoca...	elrc/covid19_eu_presscorner_en-ga.tmx
2	en	(i) Direct grants, selective tax advantages an...	ga	(i) an deontas díreach, buntáistí cánach roghn...	elrc/covid19_eu_presscorner_en-ga.tmx
3	en	(i) Direct grants, equity injections, selectiv...	ga	(i) Deontais dhíreacha, instealltaí cothromais...	elrc/covid19_eu_presscorner_en-ga.tmx
4	en	State aid_coronavirus_IrelandThe European Comm...	ga	Bearta tacaíochta na hÉireann	elrc/covid19_eu_presscorner_en-ga.tmx

	0
filepath
elrc/citizens_information_en-ga.tmx	10297
elrc/Tuarascalaca_Bliantula_na_Roinne_Leanai_agus_Gnothai_Oige_en_ga_clean.tmx	2954
elrc/Tuarascail_Bhliantuil_Chomhairle_Chontae_Longfoirt_2017_en_ga_clean.tmx	2646
elrc/medical_domain_en-ga.tmx	1289
elrc/website_parallel_corpus_2259.en-ga.tmx	1134
elrc/Programme_for_Government_Annual_Report_2013_en_ga_clean.tmx	1020
elrc/Preasraitis_Gaois_Fiontar_Scoil_na_Gaeilge_DCU_1_en_ga_clean.tmx	975
elrc/Raitis_Airgeadais_Ollscoil_Mha_Nuad_2017-2018_en_ga_clean.tmx	677
elrc/Raitis_Airgeadais_Oifig_an_Choimisineara_Teanga_en_ga_clean.tmx	487
elrc/eu_vacination_portal_en-ga.tmx	359
elrc/Press_Releases_from_Department_of_Children_January-May_2019_en_ga_clean.tmx	353
elrc/coimisineir_teanga_web_corpus.tmx	321
elrc/Oifigi_Ombudsman_in_Eirinn_en_ga_clean.tmx	249
elrc/Leabhran_dAonad_Altranais_Pobail_Teach Uí_Riada_en_ga_clean.tmx	220
elrc/Tearmaiocht_agus_aistriucha_in_a_bhaineann_le_fograi_poist_foluntais_abhair_chomortha_1916_agus_eolas_ginearalta_ar_Oifig_na_Gaeilge_en_ga_clean.tmx	188
elrc/Polsasi_ar_Fheiniulacht_agus_Leiriu_Inscne_Ollscoil_Mha_Nuad 2019 _en_ga_clean.tmx	177
elrc/Preasraitis_Gaois_Fiontar_Scoil_na_Gaeilge_DCU_2_en_ga_clean.tmx	162
elrc/Preasraitis_Oifig_an_Choimisinéara_Teanga_en_ga_clean.tmx	91
elrc/Preasraitis_Ollscoil_Mha_Nuad_Earrach_2019_en_ga_clean.tmx	71
elrc/Tuairisc_a_thug_Maire_Nic_Shiubhlaigh_en_ga_clean.tmx	50
elrc/Preasraiteas_Mi_Iuil_en_ga_clean.tmx	40
elrc/Preasraitis_Ollscoil_Mha_Nuad_Samhradh 2019_en_ga_clean.tmx	38
elrc/Preasraiteas_faoi_foirgneamh_nua_scoile_en_ga_clean.tmx	22
elrc/Litir_ó_Oifig_an_Choimisinéara_Teanga_en_ga_clean.tmx	22
elrc/Faisnéis faoi IDS_en_ga_clean.tmx	20
elrc/Toiliu_don_Scagthastail_Scoile_um_Amhairc_Eisteachta_en_ga_clean.tmx	19
elrc/covid19_eu_presscorner_en-ga.tmx	16
elrc/covid19_europarl_v1_en-ga.tmx	13
elrc/Pleananna_ITBAC_le_comóradh_a_dheanamh_ar_1916_en_ga_clean.tmx	10
elrc/Foirm FSS Iarratais Duine ar a Shonraí _en_ga_clean.tmx	10
elrc/Postaer_faoi_scoil_ag_claru_en_ga_clean.tmx	6
elrc/Preasraiteas_faoi_Uachtarán_nua_en_ga_clean.tmx	5
elrc/covid19_europarl_v2_en-ga.tmx	5