Latin Parallel Translation Dataset

Use the dataset here!

The cornerstone of any Machine Learning model is a dataset. And, unfortunately, prior to this project, there was no suitable dataset for this type of research at all. There are hundreds of English translations available for ancient Roman works, but we cannot train a model by saying “This is the Aeneid, figure it out.” These need to be aligned at the sentence level, so that the modle can learn the specific mappins between source and translation and attempt to create a functional translation model.

In this project, I collected the first aligned parallel translation dataset between Latin and English. This was made of 101k translation pairs, 34k of which are sourced from the Latin Vulgate, aligned by the specific book and verse, and the remainder are sourced from various Loeb Classical Library volumes, aligned by hand.

The dataset has been published to Huggingface for easy replication and usage in further research. This dataset is for research purposes only. Each sample is annotated with the index and file (and therefore author/work) that the sample is from. If you find errors, please feel free to submit a PR to fix them.

Distribution of Dataset by Author.

Latin Parallel Translation Dataset

Use the dataset here!

References

2023