AIneid

Neural methods have brought a revolution in automated Machine Translation processes, with most highly-spoken languages having robust training datasets and near-human performance. However, these methods have lacked the same effect in Case-Marked Free-Order languages.

A free-order language is one that has no specific word order, i.e. the subject, verb, and object can be anywhere in the sentence without violating the rules of the grammar. Case-marked means that additional information about the word, such as the number and function, are encoded in morphological features of the word, such as case or conjugation. Latin is one of these!

For my Master’s Thesis, I created the first Parallel Translation Dataset consisting of roughly 100k pairs, and evaluated its performance in Neural Machine Translation, with novel methods of preprocessing to encode morphology, and new investigations into transfer learning. I achieved a best performance BLEU of 22.4 on the test dataset, which beats the current State of The Art Google Translate model by over 4.2 BLEU, and published my pre-processing pipelines for further research usage.

Further, I have created a freely available tool hosted on the Huggingface platform to interact with these models, and understand their text processing pipelines.

On the left, the AIneid homepage. On the right, the "About" page that walks a user through how the NMT process is performed.

Read the report here!

Use the dataset here!

Use the tool here!

References

2023