The Dangers of Google Translate

Or

ONCE AGAIN INTO THE BREACH?

Frank Mayers, Advocate & Notary

Even in a world of Wikileaks and computer hacking, confidentiality still plays an important (maybe even a heightened) role in the work of many professionals, whether by law, ethics or custom. This includes attorney-client privilege, doctor-patient confidentiality, a journalist protecting his source, a Catholic priest in the confession booth, banking secrecy laws and many more instances.

Could the use of ‘Google Translate’ by such a person or body, owing a duty to maintain secrecy or confidentiality, infringe upon such confidentiality?

To answer this question one needs to understand: a) How does Google Translate work and; b) what the terms of use of Google Translate are.

How does ‘Google Translate’ Work?

In his book on translation, Is That a Fish in Your Ear? – Translation and the Meaning of Everything[1]  David Bellos explains, “It doesn’t deal with meaning at all. Instead of taking a linguistic expression as something that requires decoding, Google Translate (GT) takes it as something that has probably been said before [i.e., human translation]. It uses vast computing power to scour the Internet in the blink of an eye looking for the expression in some text that exists alongside its paired translation.”

What this effectively means is that the text, sentence or phrase you insert into Google Translate is divided up into ‘translation memory’ units or segments using a translation software standard format known as ‘XLIFF’. Google Translate then ‘translates’ the said unit, not by trying to work out some complex rule of grammar, but rather trying to pair it as closely as possible with a similar unit appearing in its vast memory base. More often than not the match will not be a precise and exact match but rather as close a proximate match as possible, and this is called a ‘fuzzy’ match. The memory cache is then increased even further by the use of pivot languages. Thus a sentence translated from say Hebrew into English may find its way into Spanish if Google’s database has a) a translation of the sentence – or similar – from Hebrew into English and b) a translation of the sentence – or similar – from English into Spanish. This obviously would increase the ‘fuzziness’ of the translation. This ‘fuzziness’ is also indicative of the true infinity of language and the ‘grasping at straws’ attempts of accurate machine translation where – despite Google’s almost infinite database of human translated sentences – use of computer based software as sophisticated as Google’s is still very much ‘hit and miss’.

Where does Google Translate get all these translation units from?

Well – essentially from us. We ‘pay’ for the use of the ‘free’ service given to us by Google by providing information. This may be done willingly and intentionally, such as the providing of all legislative documents of the European Union with its myriad of languages (i.e., a plethora of translated documents) to the Internet and it may be given inadvertently (as we will see below).

This is what Google Translate’s Terms of Use states: “By submitting or creating your content through the Service, you grant Google permission to use your content…” However a rejoinder is added stating, “… provided that Google will not disclose the subject matter of your content or make your content available on a standalone basis to any third parties without your consent.”

What does Google mean by its use of the term “a standalone basis…”? The answer is provided shortly afterwards in the terms: “If Google displays your content to an end user, it will do so only according to the sharing rules below, and only on a translation unit basis.” The term ‘translation unit’ – which I have already explained and described above – is defined by Google Translate’s terms as having the meaning assigned to it in “the [above mentioned] XLIFF standard”.

Thus, for example “The cat sat on the mat.”, would be one translation unit. “The cat sat on the mat. The cat watched TV.”, would be two translation units, separated in this case by a full stop. “The cat sat on the mat, whilst watching TV” would also be one ‘translation unit’ but “Whilst sitting on the mat the cat watched: TV” would most probably be two translation units, separated in this case by the colon.

My understanding of this is what I call the ‘paper shredder effect’. Let me explain:

Even though you may present an entire document (e.g., an agreement) to Google Translate for translation, Google Translate will not keep the data as an entire document, but rather will ‘divide’ (or ‘shred’) the document into its many ‘translation units’ and keep the translation units in its memory base, in a ‘shredded and scrambled’ form.

In translation software, this might look something like this:

translate-article

(By the way, the text shown in the above diagram is taken from the latest Google privacy terms for the APIs (Applied Interface programs) and indeed may be even more open to breach of confidentiality than the specific terms referred to above for the use of Google Translate (which in itself is an API).)

Thus – if I were to put this entire document into Google Translate, and on the assumption that no (human) translation exists of this document in Google’s data base, many sentences would come out translated ‘fuzzy’ or not translated at all. On the other hand, a common sentence such as “the cat sat on the mat” would almost certainly come out with a perfect match translation.

On the other hand – if I ‘submit’ this article to Google in both Hebrew and English, a request to translate one of the sentences from Hebrew into English will come out with a ‘full match.’ For example, the terms of use themselves of Google Translate will come out in perfect matches as the information obviously exists on Google’s data base after such terms were originally translated by a human translator.

Thus, although Google might not present a submitted document in its full form, it would somehow know how to ‘un-shred’ the document and ‘stick’ the ‘shredded’ parts (Oliver North-like) back together.

So where is the danger for a professional placing his work through Google Translate?

The risk may be far-fetched in that someone ‘accidently’ stumbles on a segment or translation unit placed into Google’s data base which is both confidential and recognizable. In other words, the user recognizes the confidential material. However – how ‘far-fetched’ indeed is this possibility?

Wide use of boilerplates (i.e., standard texts) and common and standard usage of ‘cut and paste’ features, where only minimum or basic changes are made to the text, such as the changing of names, means that the chances of stumbling on a recognizable translation unit or confidential segment might not be that far-fetched after all.

This is especially so when translating from a language such as Hebrew, which has a relatively small number of users.

Let’s look at the following examples: Lawyer One might translate (using Google Translate), inter alia, the following sentence from a confidential will:

“I, Mrs. Cohen leave all my worldly goods to my children – John, George, Paul and Peter.”

On the other hand, Mr. Cohen’s lawyer, unknowing of Mrs. Cohen’s ‘confidential’ will is also translating Mr. Cohen’s will using Google Translate. However, unlike Mrs. Cohen, Mr. Cohen has not forgotten his spouse in his will and has written as follows:

“I, Mr. Cohen leave all my worldly goods to my children and my wifeMrs. Cohen, John, George, Paul and Peter.”

In this case, because of the great similarity of the two sentences in the original language (let’s say Hebrew) there is a chance that the ‘fuzzy’ translation that would ‘pop up’ as a suggestion for Mr. Cohen’s translation would be the translation from Mrs. Cohen’s will – the will in which she forgot her husband!

Moreover, it will be stated that the mere transmitting of the information to an unauthorized third party (in this case Google) without the client/patient’s/provider’s consent, whether in full or in part, may very well, in itself, be in breach of the said confidentiality. This is regardless of whether or not any part of the said document may ‘pop up’ in the future in some search. To put it in layman’s terms – ‘once it’s out there it’s gone forever’. In other words – once the information is put on the web – in whatever form – the provider has no control over it.

This threat is intensified even further where many translators use translation memory software with public memory bases, oftentimes even making use of Google Translate itself as a backup. The effect of this is that even when a professional innocently sends work to a translator, without being in breach of confidentiality (as an authorized third party), that translator may be irresponsibly making use of a public database and thus disseminating the information publicly and placing even the unwitting professional (vicariously) at risk of being in breach of confidentiality/privilege.

[1]           David Bellos, Is That A Fish in Your Ear? – Translation and the Meaning of Everything (Faber & Faber, New York, Kindle Edition, 2012.)

Please leave details and one of the office’s legal translators will contact you shortly.