Tatoeba
A collection of sentences and translations, crowdsourced, collaborative, open and free
Available for Download ✅
⚠️ Always check the license of the data source before using the data ⚠️
- Main page: https://tatoeba.org/eng
- Data Browse Link: https://tatoeba.org/eng/downloads
- Kaggle notebook showing how to Download: https://www.kaggle.com/alvations/how-to-get-parallel-sentences-from-tatoeba
- Github: https://github.com/Tatoeba/tatoeba2
- Format: .tsv and .csv
Brief Description
Tatoeba is a large database of sentences and translations. Its content is ever-growing and results from the voluntary contributions of thousands of members.
Tatoeba provides a tool for you to see examples of how words are used in the context of a sentence. You specify words that interest you, and it returns sentences containing these words with their translations in the desired languages. The name Tatoeba (for example in Japanese) captures this concept.
Other Notes
Getting a parallel Irish-English corpus involves downloading and joining up a number of different files like the Irish sentences file, English sentences file, a Links file that maps one to the other and a Users file that provides the skill level of the person who added the translation.
- Lines of text: 1,973
- GA Word count: 10,352
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('tatoeba/gle_sentences_detailed.tsv', sep='\t', header=None)
df.columns = ['id', 'lang', 'ga', 'username', 'date_added', 'date_modified']
df['ga_len'] = df.ga.str.split().str.len()
df.head()
Load remaining necessary files. All these files can be downloaded from the Tatoeba downloads page: https://tatoeba.org/eng/downloads
# english sentences
en_df = pd.read_csv('tatoeba/eng_sentences_detailed.tsv', sep='\t', header=None)
en_df.columns = ['en_id', 'lang', 'en', 'username', 'date_added', 'date_modified']
# translation links files
l_df = pd.read_csv('tatoeba/links.csv', sep='\t', header=None)
l_df.columns = ['id1', 'id2']
# tags file - Not super helpful for Irish as not many tags
# t_df = pd.read_csv('tatoeba/tags.csv', sep='\t', header=None)
# t_df.columns = ['id', 'tag']
# User languages and self-reported skill level
u_df = pd.read_csv('tatoeba/user_languages.csv', sep='\t', header=None)
u_df.columns = ['user_lang', 'skill', 'user', 'details']
u_df = u_df.query('user_lang == "gle"') # filter for ga only
u_df.loc[u_df.skill=='\\N', 'skill'] = -1 #
# ga to translation links
df = df.merge(l_df, left_on='id', right_on='id1')
# merge english
df = df.merge(en_df[['en_id','en']], left_on='id2', right_on='en_id')
# merge tags
#df = df.merge(t_df, left_on='id', right_on='id', how='left')
# merge users and skill level
df = df.merge(u_df, left_on='username', right_on='user', how='left')
df.loc[df.skill.isna(), 'skill'] = -1
df = df[['id', 'en_id', 'lang', 'ga', 'en', 'ga_len','skill', 'details']]
df.head()
Looking at the self-reported skills distribution shows that most people haven't reported their Irish skill level
sns.distplot(df.skill, kde=False)
plt.title('Self-reported skill distribution');
Save the file and we're done!
df.to_csv('processed_data/tatoeba_en-ga_20200612.csv')
A few more samples
df.sample(50)