Latin Parallel Translation Dataset

101k aligned Latin-English sentence pairs, a first of its kind

Use the dataset here!


The cornerstone of any Machine Learning model is a dataset. And, unfortunately, prior to this project, there was no suitable dataset for this type of research at all. There are hundreds of English translations available for ancient Roman works, but we cannot train a model by saying “This is the Aeneid, figure it out.” These need to be aligned at the sentence level, so that the modle can learn the specific mappins between source and translation and attempt to create a functional translation model.

In this project, I collected the first aligned parallel translation dataset between Latin and English. This was made of 101k translation pairs, 34k of which are sourced from the Latin Vulgate, aligned by the specific book and verse, and the remainder are sourced from various Loeb Classical Library volumes, aligned by hand.

The dataset has been published to Huggingface for easy replication and usage in further research. This dataset is for research purposes only. Each sample is annotated with the index and file (and therefore author/work) that the sample is from. If you find errors, please feel free to submit a PR to fix them.

Distribution of Dataset by Author.

References

2023

  1. Latin English Parallel Translations
    Gil Rosenthal
    2023
  2. columns.jpg
    Machina Cognoscens: Neural Machine Translation for Latin, a Case-Marked Free-Order Language
    Gil Rosenthal
    2023