Why is Google Translate so bad for Latin? A longish answer. : r/latin Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/latin icon
r/latin icon
Go to latin
r/latin
A banner for the subreddit

This is a community for discussions related to the Latin language.


Members Online

Why is Google Translate so bad for Latin? A longish answer.

Why Latin Google Translate is so bad.

Google Translate for Latin is completely statistical. It has no model of grammar, syntax, or meaning. All it does its correlate sequences of up to five consecutive words in texts that have been manually translated into two or more languages.

More precisely, it has built up a hidden Markov model from all the manual translations that have been fed into it. Google calls this the Phrase-Based Machine Translation Model, or PBMTM. In November 2016, Google updated Translate to use the improved Neural Machine Translation Model for some languages, which does not work like this - but Latin is not among them.

Here's how the PBMTM works, roughly speaking. It assumes that people speak by randomly choosing one word after another, with the probabilities determined by the previous word spoken. For example, if you said "two", there is a certain probability that the next word will be "or". If you just said "or", there is a certain probability that the next word will be "butane". You can calculate an estimate of these probabilities by looking at all those texts which you fed in earlier. You can then generate random but just slightly coherent gibberish according to these probabilities:

Two

Two or

Two or butane

Two or butane gas

...

Two or butane gas can be attacked

If you use a contextual 'window' of more words - say, the previous three, four, or five - to find the next one, the resulting gibberish will look more likely to have been written by a schizophrenic than by an aphasic. Here's an example, with the contextual window highlighted.

I heard about

I heard about six

I heard about six today

I heard about six today, we

I heard about six today, we seek

...

Notice how each bolded phrase could have been in a real sentence: "Yesterday

That's a Markov model. The "hidden" part adds some complexity, which I'll defer to the end of this post. The basic idea is: Google Translate's PBMTM tries to choose the most probable next word, based on estimates of probability derived from five-word sequences from the corpus it possesses of texts in the input and target language (not only actual texts, but also many crowd-sourced translations).

What is Google Translate good for, then?

What Google Translate is most reliable for is translating documents produced by the United Nations between the languages in use there. This is because UN documents have provided a disproportionately large share of the manually translated texts from which Google Translate draws its five-word sequences, as UN documents are readily available in many different languages (as, for example, all official United Nations documents, meeting records and correspondence at the UN Headquarters is translated into at least Arabic, Chinese, English, French, Russian, and Spanish).

Witness what happens when I type this in:

À l'exception de ce qui peut être convenu dans les accords particuliers de tutelle conclus conformément aux Articles 77, 79 et 81 et plaçant chaque territoire sous le régime de tutelle, et jusqu'à ce que ces accords aient été conclus, aucune disposition du présent Chapitre ne sera interprétée comme modifiant directement ou indirectement en aucune manière les droits quelconques d'aucun État ou d'aucun peuple ou les dispositions d'actes internationaux en vigueur auxquels des Membres de l'Organisation peuvent être parties.

It gives me:

Except as may be agreed upon in the special guardianship agreements concluded in accordance with Articles 77, 79 and 81 and placing each territory under the trusteeship system, and until such agreements have been concluded, This Chapter shall not be construed as directly or indirectly modifying in any way the rights of any State or any people or the provisions of international instruments in force to which Members of the Organization may be parties.

Perfect! (Almost).

This is one reason why its Latin translations tend to be so poor: it has a very thin corpus of human-made translations of Latin to base its hidden Markov models on—oh, and it's using hidden Markov models.

So, until the United Nations starts doing its business in Latin, Google Translate's statistical model is not going to do a very good job. And even then, don't expect much unless you're translating text pasted directly from UN documents.

Further details on translation for the curious.

A hidden Markov model adds "states". The speaker is assumed to randomly transition from one "state" to another, and each state has its own set of probabilities for what word it will "emit". Thus a hidden Markov model is a statistical guess about what are the most likely states, transition probabilities, and emission probabilities that would produce a given set of sequences—assuming that they were produced in this random fashion.

Google Translate therefore calculates: "Given that the author in language A just said (up to) these five words, what is the most likely state that the author is in? OK, now, from the corresponding state in language B, what is the most likely word to output next?"

Here's an illustration of the five-word contextual window. If we enter the following:

Pants, as you expected, were worn.

Pants were worn.

Pants, as you expected, are worn.

The Latin translations (with manual translations back to English), are:

Anhelat quemadmodum speravimus confecta. (He is panting just as we hoped accomplished.)

Braccas sunt attriti. (The trousers have been worn away [like "attrition"]).

Anhelat, ut spe teris. (He is panting, just as, by hope, you are wearing [something] out.)

Notice that the first and third sentences border on ungrammatical nonsense. There aren't any five-word sequences in Google Translate's English database that line up well with "pants as you expected were/are," so it's flailing. Notice that in the third sentence, by the time it got to "worn", it had forgotten which sense of "pants" it chose at the start of the sentence. Or rather, it didn't forget, because it never tracked it. It only tracked five-word sequences. It gives the second sentence some sort of meaning, but even then, it is still very wrong - not only does it give 'worn' the wrong meaning (since, as I said before, it makes no semantic links between 'pants' and 'worn' that would imply the other definition of 'to wear), but it fails entirely to have the gender of the noun and verb agree, or to have the subject in the correct case.

So, whether the sentence makes sense sort of affects whether the translation means anything, but it's worse than that. What matters the most is exact, word-for-word matching with texts in the database.

Entering Latin into Google Translate (with words changed from the first sentence shown in bold):

Abraham vero aliam duxit uxorem nomine Cetthuram.

Quintilianus vero aliam duxit uxorem nomine Cetthuram.

Abraham vero aliam duxit uxorem nomine Iuliam.

Abraham vero canem duxit uxorem nomine Fido.

English output:

And Abraham took another wife, and her name was Keturah.

Quintilian, now the wife of another wife, and her name was Keturah.

And Abraham took another wife, and the name of his wife, a daughter named Julia.

And Abraham took a wife, and brought him to a dog by the name of Fido.

The Vulgate and the ASV translation (or similar) are among Google Translate's source texts, so it is very good at directly translating them - notice, though, what happens when the input is off by as little as one word. No longer able to spot the similarity, the software begins to translate smaller sentence fragments instead of the whole sentence: for example the fragment "uxorem nomine Cetthuram" is translated in both of the above sentences where it appears as "another wife, and her name was Keturah" despite the change of context.

This method of translation is why Google Translate works relatively well for more analytic languages , where strict word order is the most important for meaning, but dreadfully for the more synthetic languages such as Latin, where inflections define meaning. This is because in such analytic languages, subsequent words are more likely to have a semantic link that will allow retention of meaning when reproduced in the target language.

The Neural Machine Translation Model

The Neural Machine Translation Model has moved beyond simple statistical models for translation, using instead machine learning and neural networks.

According to a Google blog post on the subject:

At a high level, the Neural system translates whole sentences at a time, rather than just piece by piece. It uses this broader context to help it figure out the most relevant translation, which it then rearranges and adjusts to be more like a human speaking with proper grammar.

Even more advanced is Google's Zero-Shot Translation with Google’s Multilingual Machine Translation System, which can be thought of as translating the input phrases to its own semantic computer 'interlingua', and then to the output language. This is what permits the 'zero-shot translation' between language pairs it has ever analysed before. The example cited in the Google report demonstrated reasonable Korean-Japanese translation having only ever been trained on Japanese-English and Korean-English sentence pairs. Unfortunately, neither of these look to be coming to Latin any time soon.


If you want to see Markov chains working in English on reddit, r/SubredditSimulator is a subreddit entirely filled with Markov chain bots (but not hidden Markov chain models) seeded by the contents of subreddits. The post titles there are created by a Markov chain length (contextual 'window' of words) of two, meaning that every sequence of three words existed at some point in the subreddit from which the bot took its source. The comments are the created by the same method, except for the longer ones, which have a Markov chain length of three.


TL;DR. Just read it. In summary, though, Google's translation system for Latin does not have grammatical or semantic analysis at any level, but just a statistical model of the most likely word to appear next based on analysis of the corpus of works it has in both the input language and Latin. This works relatively well for languages like English, where word order is the most important thing, but is dreadful for languages like Latin.

So, NEVER use Google Translate for Latin if you want any sort of actual translation.

As the sidebar says:

Google Translate is always wrong, always. Don't even bother turning to Google Translate before asking us for help with a translation.


Entirely plagiarised from Ben Kovitz's excellent post on the Latin StackExchange, and then slightly extended.


Share
Sort by:
Best
Open comment sort options

This should be stickied.

u/funnyflywheel avatar

*Hoc viscandum est.

More replies
u/Amenemhab avatar
Edited

I'm not sure if being a synthetic language really is a problem for Google Translate. A priori the most parsimonious explanation for GT sucking at Latin is the lack of data. We would have to compare to modern synthetic languages like Russian to be sure, but in theory with sufficient data nothing should stop GT from recovering the correct inflexions most of the time, in simple sentences, by matching your sentence to a similar construct in the data (I say most because it's not like GT doesn't suck at analytic languages).

(Edit: sidenote, it's generally thought to be mathematically impossible to get a correct model of the syntax with an HMM because the syntax is mathematically too complicated -- higher in the Chomsky hierarchy -- than what an HMM can model. That said this also applies to analytic languages, and shouldn't stop your model from handling simple sentences.)

The Neural Machine Translation Model has moved beyond simple statistical models for translation, using instead machine learning and neural networks.

Not really. The basic principle of a recurrent neural network is essentially the same as an HMM, except:

  1. The window is unbounded.

  2. The model lives in a more complicated space with lots of parameters.

Also, HMMs count as machine learning.

RNNs are subject to the same sort of problems as HMMs that you point out, ie they'll often fail to capture long-distance dependencies unless they're trained on tremendous amounts of data, they'll overfit the training data, etc etc.

All this for karma? Damn son take it!

u/Quidfacis_ avatar

All it does its correlate sequences of up to five consecutive words in texts that have been manually translated into two or more languages.

That sort of system ought to be perfect for a dead language, though. Dump all the Cicero, Livy, Lucretius, Vergil, and Oxford Latin Course into a database and we're good.

We're not exactly inundated with brand new Latin to translate.

u/Cherubin0 avatar

I don't think this would be enough. The UN produces a lot of translations each year and this translations are of very high quality and very accurate. I am not certain, but I would guess that the volume each year of the UN is bigger than the classical corpus. The big problem of Google Translate is that it is a one size fits all solution, so it need an extreme volume of data to work well. Because the more powerful a model is the more data it needs. It would be much better for Latin if someone would make a very constrained model that works only for Latin like languages. Then the needed data would be very small.

u/AllanBz avatar

Dump all the Cicero, Livy, Lucretius, Vergil, and Oxford Latin Course into a database and we're good.

What makes you think that the Google folks haven't done so and used that to create the language models they use?

That sort of system ought to be perfect for a dead language, though.

Perhaps. But it will be bad at translating novel English sentences to Latin.

u/Quidfacis_ avatar

Perhaps. But it will be bad at translating novel English sentences to Latin.

Definitely. But there are only three groups of people who ever need to translate English into Latin.

  • Latin Professors

  • Latin Students

  • Popes

We could make a category for "idiots who want shitty Latin tattoos", but why dignify them with a category?

u/AllanBz avatar

We could make a category for "idiots who want shitty Latin tattoos", but why dignify them with a category?

Entertainment?

u/Mysterious_Advice154 avatar

some Authors name things in Greek and Latin, especially if they’re creating something like a nation where the cultural language is formed after Latin or Greek, say, a language that is, effectively, Latin or Greek under a different name. Then, would Google Translate work? For one or two word names?

more replies More replies
More replies
More replies
More replies
u/Amenemhab avatar

I don't think you realise how little data literature is, compared to the gazillions of legal proceedings and technical notices google has for modern languages.

More replies
[deleted]
[deleted]

So if it were fed the Loeb corpus would Google Translate improve?

Update: Today, or at least very recently, Google Translate has launched neural machine translation for Latin!

Yeah and it's pretty bloody excellent now too. You can see it change as you add more of the sentence in. Individual words it gives you a very out of context translation to English but put the whole sentence and then paragraph in and it changes its translation. Still not as good as my latin tutor, but MUCH MUCH better than it used to be.

More replies
u/ECBicalho avatar

a little off topic :p

It seems to me that the addresses for the UN website are broken.

There are two occurrences for the same page, one in English and one in French, it now displays the 404 error.

https://www.un.org/fr/sections/un-charter/chapter-xii/index.html

Maybe it should be replaced by

https://www.un.org/fr/about-us/un-charter/chapter-12

Or, if you want to point to the exact part of the text...

https://www.un.org/fr/about-us/un-charter/chapter-12#:~:text=%C3%80%20l%27exception%20de,peuvent%20%C3%AAtre%20parties.

Alternatively, out of curiosity, a copy of the originally intended page directly from the web archive may eliminate this problem for good.

https://web.archive.org/web/20180127031204/https://www.un.org/fr/sections/un-charter/chapter-xii/index.html

u/NehNeh2137 avatar

so how could I get a translation, propably no one's gonna answer me but I'm curious, I wanted to translate just two sentences and now that I learned from mine experience and from this comment that Google is bad at translating I don't know what to do

It's been 6 years since the initial posting of this thread. In an above response from about a year ago it was said that GT has been updated and is working far better now than it was back 6 years ago. I also can validate that nowadays GT does a far better job than it was 6 years ago. you're sure to encounter some situations where you won't get the exact translation you were looking for but from my understanding that's do to Latin syntax and modern English syntax Arnt exactly 1to1 as a number of words or terms used in the modern era have no direct translation do to such word or phrases not existing in the exact same context or meaning we would use with them today.

Overall though YES you can use GT for really accurate translations into Latin and vice versa. Hope i was able to clear this up for you.

u/NehNeh2137 avatar

yeah I had one time a problem with the translation of some sentences, but overall thank you for your response

More replies
More replies

is it still bad? this was 6 ya

It is far better nowadays. sometimes just need to account for words or phrases that don't have direct translations from a modern English format but this can be easily accounted for with minor rewording/phrasing using older English formatting/spelling.

General example not directly tied to actual translations could such as instead of saying "my butt hurts" you could try "my posterior hurts" etc.. Elsewise you can sometimes have a perfect sentence with one or 2 nonsensical words in place of what you actually wanted to say. Hope this more than answers your question.

More replies
u/Mysterious_Advice154 avatar

Okay, but would it work for singular nouns or present tense verbs?