Available for Download ✅

⚠️ Always check the license of the data source before using the data ⚠️

Main page: https://tatoeba.org/eng
Data Browse Link: https://tatoeba.org/eng/downloads
Kaggle notebook showing how to Download: https://www.kaggle.com/alvations/how-to-get-parallel-sentences-from-tatoeba
Github: https://github.com/Tatoeba/tatoeba2
Format: .tsv and .csv

Brief Description

Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members.

Tatoeba provides a tool for you to see examples of how words are used in the context of a sentence. You specify words that interest you, and it returns sentences containing these words with their translations in the desired languages. The name Tatoeba (for example in Japanese) captures this concept.

Other Notes

Getting a parallel Irish-English corpus involves downloading and joining up a number of different files like the Irish sentences file, English sentences file, a Links file that maps one to the other and a Users file that provides the skill level of the person who added the translation.

Lines of text: 1,973
GA Word count: 10,352

Word Count Distribution

Lets take a quick peek at the word count distribution for Irish. Turns out to be mostly super short sentences

Code to Extract to a Pandas DataFrame

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv('tatoeba/gle_sentences_detailed.tsv', sep='\t', header=None)
df.columns = ['id', 'lang', 'ga', 'username', 'date_added', 'date_modified']
df['ga_len'] = df.ga.str.split().str.len()
df.head()

Load remaining necessary files. All these files can be downloaded from the Tatoeba downloads page: https://tatoeba.org/eng/downloads

# english sentences
en_df = pd.read_csv('tatoeba/eng_sentences_detailed.tsv', sep='\t', header=None)
en_df.columns = ['en_id', 'lang', 'en', 'username', 'date_added', 'date_modified']

# translation links files
l_df = pd.read_csv('tatoeba/links.csv', sep='\t', header=None)
l_df.columns = ['id1', 'id2']

# tags file - Not super helpful for Irish as not many tags
# t_df = pd.read_csv('tatoeba/tags.csv', sep='\t', header=None)
# t_df.columns = ['id', 'tag']

# User languages and self-reported skill level
u_df = pd.read_csv('tatoeba/user_languages.csv', sep='\t', header=None)
u_df.columns = ['user_lang', 'skill', 'user', 'details']
u_df = u_df.query('user_lang == "gle"')    # filter for ga only
u_df.loc[u_df.skill=='\\N', 'skill'] = -1   #

Merge

Merge all files to our Irish file

# ga to translation links
df = df.merge(l_df, left_on='id', right_on='id1')
# merge english
df = df.merge(en_df[['en_id','en']], left_on='id2', right_on='en_id')
# merge tags
#df = df.merge(t_df, left_on='id', right_on='id', how='left')
# merge users and skill level
df = df.merge(u_df, left_on='username', right_on='user', how='left')
df.loc[df.skill.isna(), 'skill'] = -1
df = df[['id', 'en_id', 'lang', 'ga', 'en', 'ga_len','skill', 'details']]
df.head()

Looking at the self-reported skills distribution shows that most people haven't reported their Irish skill level

sns.distplot(df.skill, kde=False)
plt.title('Self-reported skill distribution');

Save the file and we're done!

df.to_csv('processed_data/tatoeba_en-ga_20200612.csv')

A few more samples

df.sample(50)

	id	lang	ga	username	date_added	date_modified	ga_len
0	557291	gle	Cá bhfuil críochfort na mbus?	niq	2010-10-10 13:17:41	2010-10-10 13:17:41	5
1	557299	gle	Nuair a dhúisigh mé, bhí brón orm.	niq	2010-10-10 13:20:49	2010-10-10 13:20:49	7
2	557533	gle	Tosaíonn an t-oideachas sa bhaile.	niq	2010-10-10 14:39:19	2010-10-10 14:39:19	5
3	557579	gle	Táim i ngrá leat.	niq	2010-10-10 14:48:53	2010-10-25 12:30:14	4
4	557591	gle	Glanaimid ár rang tar éis scoile.	niq	2010-10-10 14:52:42	2010-10-10 15:10:57	6

	id	en_id	lang	ga	en	ga_len	skill	details
0	557291	35406	gle	Cá bhfuil críochfort na mbus?	Where is the bus terminal?	5	-1	NaN
1	557299	1361	gle	Nuair a dhúisigh mé, bhí brón orm.	When I woke up, I was sad.	7	-1	NaN
2	557533	19122	gle	Tosaíonn an t-oideachas sa bhaile.	Education starts at home.	5	-1	NaN
3	557579	1434	gle	Táim i ngrá leat.	I love you.	4	-1	NaN
4	934942	1434	gle	Tá grá agam duit.	I love you.	4	-1	NaN

	id	en_id	lang	ga	en	ga_len	skill	details
1016	3602540	703243	gle	Táim ag labhairt le mo mhac léinn.	I'm speaking with my student.	7	-1	NaN
1590	5599832	1357603	gle	Déanta.	Done.	1	-1	NaN
1702	6319314	2014783	gle	Cén fáth a mbeimis ag iarraidh pionós a chur ort?	Why would we want to punish you?	10	4	NaN
1055	3603017	3603008	gle	Tá gabhlóg anseo.	There is a fork here.	3	-1	NaN
141	871635	5152872	gle	Chonaic sé seanchara an tseachtain seo caite n...	Last week he saw an old friend whom he hadn't ...	12	3	NaN
397	2610940	2604279	gle	Céard is ábhar taighde don tSoivéideolaí?	What does a Sovietologist study?	6	-1	NaN
649	3128067	1079842	gle	Tá mé ag léamh an nuachtán.	I'm reading the newspaper.	6	-1	NaN
409	2712366	2705597	gle	Ní chanaim.	I do not sing.	2	-1	NaN
363	2150800	1476581	gle	Tá sé an-dorcha.	It's very dark.	3	-1	NaN
1671	6319282	2014752	gle	Dúirt Tom go raibh comhluadar uaidh.	Tom said he wanted some company.	6	4	NaN
911	3601095	1784975	gle	Níl a fhios agam cén fáth.	I don't know why.	6	-1	NaN
243	873503	873502	gle	Sin í an bhean a bhfanann siad léi.	That is the woman they stay with.	8	3	NaN
1265	3944944	3868719	gle	Cén teanga atá á labhairt aige?	What language is he speaking?	6	-1	NaN
1313	3961159	463294	gle	Is duine é seo.	This is a person.	4	-1	NaN
1934	8239950	273600	gle	Go raibh maith agat roimh ré.	Thanks in advance.	6	3	NaN
1753	7075957	989164	gle	Léim an leabhar.	I read the book.	3	0	NaN
1687	6319298	2014768	gle	Nílimid ag iarraidh ach rudaí a dhíol leat.	We just want to sell you things.	8	4	NaN
407	2712330	2684430	gle	An bhfuil mé do chara?	Am I your friend?	5	-1	NaN
362	2150798	1615217	gle	Tá sé an-tirim.	It's very dry.	3	-1	NaN
299	874891	874890	gle	Nach rabhthas sásta?	Weren't they satisfied?	3	-1	NaN
543	5599829	1053192	gle	Bígí cúramach!	Careful!	2	-1	NaN
987	3602422	2361385	gle	Níl teileafón agam.	I don't have a telephone.	3	-1	NaN
1929	8239940	772806	gle	Níl a fhios ag aon duine cá bhfuil sé.	Nobody knows where it is.	9	3	NaN
948	3601175	2549673	gle	Tháinig sibh ar ais.	You came back.	4	-1	NaN
1539	5516952	5516950	gle	Sin bealach amháin le breathnú air is docha.	That's one way of looking at it, I suppose.	8	1	NaN
1215	3896264	2700686	gle	Níl aon fhadhb ann.	There is no problem.	4	-1	NaN
410	2712386	4969010	gle	Tá an leabhar ar an sheilf.	The book is on the shelf.	6	-1	NaN
681	3233354	2002544	gle	Tá an seomra dorcha.	The room is dark.	4	-1	NaN
198	871759	871758	gle	Conas mar a rinne tú é?	How did you do it?	6	-1	NaN
523	2715102	2659060	gle	Scríobh sí litir.	She wrote a letter.	3	-1	NaN
1363	4445009	2474700	gle	Bhí mé díomách sin.	I was so disappointed.	4	3	Níl Gaeilge líofa agam, ach tá a fhios agam a ...
1722	6319355	1126729	gle	Nuair a thugaim cuairt ar mo gharmhac, tugaim ...	When I go to see my grandson, I always give hi...	14	4	NaN
662	3128092	2297248	gle	Tá sé ag scríobh leabhair.	He's writing a book.	5	-1	NaN
1895	7804899	7926273	gle	Amach leat!	Out you go!	2	3	Caint as Cúige Uladh
855	3599611	3599609	gle	Tá Spáinnis aici.	She knows Spanish.	3	-1	NaN
1090	3603161	2363944	gle	Is cailín mé.	I am a girl.	3	-1	NaN
1456	4773304	5127613	gle	Caithfidh mé péire bróg nua a cheannach.	I must buy a new pair of shoes.	7	4	Irish teacher for 20+ years.
1827	7801402	429220	gle	Ádh mór!	Good luck!	2	3	Caint as Cúige Uladh
1688	6319299	2014769	gle	Ba mhaith linn rud a phlé le Tom.	We want to have a word with Tom.	8	4	NaN
1760	7290675	7290677	gle	Gheofá bainne a bhaint as na ba.	You could get milk from the cows.	7	5	NaN
1587	5599801	393357	gle	Mas é bhur dtoil é.	Please.	5	-1	NaN
700	3335788	16255	gle	Cad tá uait?	What are you looking for?	3	-1	NaN
1824	7801398	348091	gle	Tar isteach.	Come in.	2	3	Caint as Cúige Uladh
1953	8290368	1192601	gle	Tá ríomhaire uaim.	I want a computer.	3	3	Caint as Cúige Uladh
1126	3604289	60147	gle	Is liomsa an teach seo.	This house is mine.	5	-1	NaN
949	3601177	3419582	gle	Tá mo dheartháir níos láidre ná mé.	My brother is stronger than me.	7	-1	NaN
1580	5599784	39996	gle	Tá brón orm...	Sorry...	3	-1	NaN
451	2714581	1814	gle	Tá tart orm.	I'm thirsty.	3	-1	NaN
845	3599522	3591019	gle	Ní ach socrú sealadach é.	It's only a temporary fix.	5	-1	NaN
138	871633	5152871	gle	Tá fear ag an doras atá ag iarraidh caint leat.	There's a man at the door who's asking to spea...	10	3	NaN