Wednesday, September 13, 2017

Data Security Risks with Generic and Free Machine Translation

Together with all the news of catastrophic hurricane activity in the US, that we are being bombarded with recently, we also are seeing stories of serious data security breaches of privileged information, even in the world of business translation. As always, the security and privacy of the data can only be as good as the data security practices and the sophistication of the technological implementations, and thus the knee-jerk response of pointing to “bad” MT technology per se, and assigning blame at MT use practices is not quite fair. It is indeed possible to make MT services available for public and/or corporate community uses safe and secure if you know what you are doing and are careful. It may seem obvious to some, but effective use of any technology requires both competence and skill, and we see too many cases of inept implementations that result in sub-optimal outcomes.
Generally, even professional translation work done entirely by humans requires source content to be distributed to translators and editors across the web to enable them to perform their specific tasks. The manner in which sensitive data or confidential information requiring translation can leak is twofold. First, information can be stolen "in transit" by transferring or accessing it over unsecured public Wi-fi hot spots or by storing it on unsecured cloud servers. Such risks have already been widely publicized and it is clear that weak processes and lax oversight are responsible for most of these data leakage cases.

Less considered, however, is what online machine translation providers do with the data users input. This risk was publicized by Slator last week, when employees of Norwegian state-run oil giant Statoil had “discovered text that had been typed in on [] could be found by anyone conducting a [Google] search.”

Slator reported that: “Anyone doing the same simple two-step Google search will concur. A few searches by Slator uncovered an astonishing variety of sensitive information that is freely accessible, ranging from a physician’s email exchange with a global pharmaceutical company on tax matters, late payment notices, a staff performance report of a global investment bank, and termination letters. In all instances, full names, emails, phone numbers, and other highly sensitive data were revealed.”

In this case, the injured parties apparently have little, or no recourse, as the “Terms of Use” policies of the MT supplier clearly stated that privacy is not guaranteed: “cannot and do not guarantee that any information provided to us by you will not become public under any circumstances. You should appreciate that all information submitted on the website might potentially be publicly accessible.”
Several others in the translation industry have pointed out other examples of the risks and have named other risky MT and shared data players involved with translation data.

Translation technology blogger Joseph Wojowski has written in some detail on the Google and Microsoft terms of use agreements in this post a few years ago. The information he presents is still quite current. From my vantage point these two MT services are the most secure and reliable “free” translation services available on the web today and a significant step above offerings like and many others. However, if you are really concerned about privacy these too have some risk, as the following analysis points out.

His opening statement is provocative and true at least to some extent:
“An issue that seems to have been brought up once in the industry and never addressed again are the data collection methods used by Microsoft, Google, Yahoo!, Skype, and Apple as well as the revelations of PRISM data collection from those same companies, thanks to Edward Snowden. More and more, it appears that the [translation] industry is moving closer and closer to full Machine Translation Integration and Usage, and with interesting, if alarming, findings being reported on Machine Translation’s usage when integrated into Translation Environments, the fact remains that Google Translate, Microsoft Bing Translator, and other publicly-available machine translation interfaces and APIs store every single word, phrase, segment, and sentence that is sent to them.”

The Google Terms of Service

Both Google and Microsoft very clearly state that any (or at least some) data used on their translation servers is viable for further processing and re-use, generally by machine learning technologies. (I would be surprised if any single individual does actually sit and watch this MT user data stream, even though it may be technically possible to do.) Their terms of use are considerably better than the one at who might as well have reduced it to: “User Beware: Use at your risk and we are not liable for anything that can go wrong in any way whatsoever.” Many people around the world use Google Translate daily, but very few of them are aware of the Google Terms of Service. Here is the specific legalese from the Google Translate Terms of Use Agreement that I include here, as it good to see it as specifically as possible to properly understand the potential risk.
When you upload, submit, store, send or receive content to or through our Services, you give Google (and those we work with) a worldwide license to use, host, store, reproduce, modify, create derivative works (such as those resulting from translations, adaptations or other changes we make so that your content works better with our Services), communicate, publish, publicly perform, publicly display and distribute such content. The rights you grant in this license are for the limited purpose of operating, promoting, and improving our Services, and to develop new ones. This license continues even if you stop using our Services.(Google Terms of Service – April 14th 2014 accessed on September 11th 2017.
Some other highlights from the Google TOS which basically IMO mean, if something goes wrong, tough shit, and if you can somehow prove it is our fault, we only owe you what you paid us unless you can somehow prove it was reasonably foreseeable. The terms are even less favorable if you use the MT service for “Business Use”:



The Microsoft Terms of Service

Microsoft is a little better, and they are more forthcoming about their use of your data in general, but you can judge for yourself. Heavy users even have a way to bypass the possibility of their data being used or analyzed at all with a paid, volume subscription. Heavy use is defined as 250 million characters per month or more, which by my calculations is anywhere from 30 million to 50 million words per month. Here are some key selections from the Microsoft Translator Terms of Use statement.
"Microsoft Translator does not use the text or speech audio you submit for translation for any purpose other than to provide and improve the quality of Microsoft’s translation and speech recognition services. For instance, we do not use the text or speech audio you submit for translation to identify specific individuals or for advertising. The text we use to improve Translator is limited to a sample of not more than 10% of randomly selected, non-consecutive sentences from the text you submit, and we mask or delete numeric strings of characters and email addresses that may be present in the samples of text. The portions of text that we do not use to improve Translator are deleted within 48 hours after they are no longer required to provide your translation. If Translator is embedded within another service or product, we may group together all text samples that come from that service or product, but we do not store them with any identifiers associated with specific users. We may keep all speech audio indefinitely for product improvement purposes. We do not share the text or speech audio samples with third parties without your consent or as otherwise described, below.

We may share or disclose personal information with other Microsoft controlled subsidiaries and affiliates, and with suppliers or agents working on our behalf to assist with management and improvement to the Translator service.

In addition, we may access, disclose and preserve information when we have a good faith belief that doing so is necessary to:
  1. comply with applicable law or respond to valid legal process from competent authorities, including from law enforcement or other government agencies; (Like PRISM for the NSA)
  2. protect our customers, for example to prevent spam or attempts to defraud users of the services, or to help prevent the loss of life or serious injury of anyone;"
And for those LSPs and Enterprises who customize (train) the MSFT Translator Baseline engines with their own TM data the following terms additionally apply:
"The Microsoft Translator Hub (the “Hub”) is an optional feature that allows you to create a personalized translation system with your preferred terminology and style by submitting your own documents to train on, or using community translations. The Hub retains and uses submitted documents in full in order to provide your personalized translation system and to improve the Translator service. After you remove a document from your Hub account we may continue to use it for improving the Translator service."
Again, if you are a 50 million words per month kind of user, you can choose (opt-out) for your data not to be used for anything else.

Blogger Joseph Wojowski concludes after his review of these tacit agreements, that translators need to be wary, even though he mentions that there are real and meaningful productivity benefits, in some cases, for translators by using MT.

“In the end, I still come to the same conclusion, we need to be more cognizant of what we send through free, public, and semi-public Machine Translation engines and educate ourselves on the risks associated with their use and the safer, more secure solutions available when working with confidential or restricted-access information.”

Invisible Access via Integration

If your source data already exists on the web anyway, some may say what is the big deal anyway? Some MT use cases I am aware of that focus on translating technical support knowledge bases or eCommerce product listings may not care about these re-use terms. But the larger risk is that once translation infrastructure is connected to MT via an API, users may inadvertently start sending out less suitable documents out for MT without understanding the risks and potential data exposure. For a random unsophisticated user in a translation management system (TMS), it is quite possible to inadvertently send out and translate an upcoming earnings announcement, internal memos to staff about emerging product designs, or other restricted data to an MT server that is governed by these terms of use. In global enterprises, there is an ongoing need to translate many types of truly confidential information. 

Memsource presented research recently on MT usage from within their TMS environment across their whole user base and showed that about 40 million segments are being translated/month via Microsoft Translator and Google through their API. Given that the volume barely meets the opt-out limits, we have to presume all the data is reused and analyzed. A previous post in the ATA Chronicle by Jost Zetzsche (page 26 in December 2014 issue) showed that almost 14,000 translators, were using the same “free” MT services in Memsource. If you add Trados and other TM and TMS systems that have integrated API access to these public MT systems, I am sure the volume of MT use is significant. Thus, if you care about privacy and security, the first thing you might need to do is address these MT API integrations that are cloaked within widely used TM and TMS products. While there are many cases where it might not matter, it would be good for users to understand the risks when it does.
Human error, often inadvertent, is a leading cause of data leakage
 Common Sense Advisory's Don DePalma writes that "employees and your suppliers are unconsciously conspiring to broadcast your confidential information, trade secrets, and intellectual property (IP) to the world.” CSA also reports that in a recent survey of Enterprise localization managers, 64% of them say their fellow employees use free MT frequently or very frequently. 62% also told Common Sense Advisory that they are concerned or very concerned about “sensitive content” (e-mails, text messages, project proposals, legal contracts, merger and acquisition documents) being translated. CSA points out two risks:
  1. Information seen by hackers or geeks in transit across non-secure web connections
  2. Look at the Google TOS section described above to see how and what Google can do even when you are not using the services anymore.
The problem is compounded because, while it may be possible to enforce usage policies within the firewall, suppliers and partners may lack the sophistication to do the same. This is especially so in an ever expanding global market. Many LSPs and their translators are now using MT through the API integration interfaces mentioned above. CSA lists these issues as follows:
  • Service providers may not tell clients that they use MT.
  • Most buyers haven’t caught up yet with data leakage.
  • Subcontractors might not follow the agreed-upon rules.
  • No matter what anyone says, linguists can and will use MT when it is convenient regardless of stated policies.
Within the localization groups, there may be some ways to control this. As CSA again points out by:
  • Locking down content workflows (e.g. turn off MT access within TMS systems)
  • Finding MT providers that will comply with your data security provisions
However, the real risk is in the larger enterprise, outside the localization department, where the acronym TMS is unknown. It may be possible to some extent, to anonymize all translation requests through specialized software, or block all free translation requests, or force them through special gateways that rinse the data before it goes out beyond the firewall. While these anonymization tools might be useful, they are still primitive, and much of the risk can be mitigated by establishing a corporate controlled MT capability that provides universal access to all employees and remains behind the firewall.

In addition to the Secure Corporate MT Service described above, I think we will also see much more use of MT in e-discovery applications. Both in litigation related applications and broad corporate governance and compliance applications. Here is another opinion on the risks of using Generic MT services in the corporate litigation scenario.

Considering Secure MT Deployment Options

Many global organizations are now beginning to realize the information leakage risk presented by unrestricted use and access to free MT. While one way to address this leakage risk is to build your own MT systems, it has also become clear to many that most DIY (Do It Yourself) systems tend to be inferior in terms of output quality to these free generic systems. When users are aware of the quality advantage of Free MT, they will often double-check on these "better" systems, thus defeating the purpose of private and controlled access on DIY systems. Controlled, optimized (for the corporate subject domain) and secure MT solutions from vendors who have certified competence in doing this, seems to me is the most logical and cost effective way to proceed to solve this data leakage problem. “On Premise” systems make sense for those who have the IT staff to do the ongoing management and are available, and able, to protect and manage MT servers at both the customer and the vendor end of the equation. Many large enterprises have this kind of internal IT competence, but very few LSPs do.

It is my opinion that the vendors that allow both on premise and scalable private clouds to be setup are amongst the best options available in the market today. Some say that a private cloud option provides both professional IT management and verifiable data security and is better for those with less qualified IT staff. Most MT vendors tend to provide cloud based solutions today, and for adaptive MT this may be the only option. There are few MT vendors that can do both cloud-based and on premise, and even fewer that can do both competently. MT vendors who tend to provide non-cloud solutions infrequently are less likely to provide reliable and stable offerings. Setting up a corporate MT server that may have hundreds or thousands of users is a non-trivial affair. Like most things in life, it takes repeated practice and broad experience in multiple different user scenarios to do both on premise and cloud solutions well. Thus, one would expect that those vendors who have a large and broad installed base of on premise and private cloud installations (e.g. more than 10 varying types of customers) are preferred to those who do it as an exception and have an installed base of less than at least 10 customer sites.  There are two companies who names start with S that I think meet these requirements best in terms of broad experience and widely demonstrated technical competence. As we head into a world where Neural MT is more pervasive, I think it is likely that Private Clouds will assume more importance in future and be a preferred option to actually having your own IT staff manage GPU or TPU or FPGA arrays and servers on site. However, it is still wise to ask your MT vendor to provide complete details on data security provisions in  their cloud offering.

What seems more and more certain is that MT definitely provides great value in keeping global enterprises actively sharing and communicating, and the need for better, more secure MT solutions has a bright future. And what incidents like this latest fiasco show is, that broadly available MT services are valuable enough for any globally focused enterprise to explore more seriously and carefully, rather than leave it to naïve users to find their own way to using risky “free” solutions that undermine corporate privacy and expose high-value confidential data to anyone who knows how to use a search engine or has basic hacking skills.

Tuesday, September 12, 2017

LSP Perspective: Applying the Human Touch to MT, Qualitative Feedback in MT evaluation

In all the discussion on MT that we hear, we do not often hear much about the post-editors and what could be done to enhance and improve the often negatively viewed  PEMT task.  Lucía Guerrero provides useful insights on her direct experience in improving the work experience for post-editors. Interestingly over the years I have noted that, strategies to improve the post-editor experience can often make mediocre MT engines viable, and  failure to do so can make good engines fail in fulfilling the business promise. I cannot really say much beyond what Lucía says here other than restate what she is saying in slightly different words. The keys to success seem to be:
  1. Build Trust by establishing transparent and fair compensation and forthright work related communication
  2. Develop ways to involve post-editors in the MT engine refinement and improvement process involvement
  3. Demonstrate that  the Feedback Cycle does in fact improve the work experience on an ongoing basis

Post-editing has become the most common practice when using MT. According to Common Sense Advisory (2016), more than 80% of LSPs offer Machine Translation Post-Editing (MTPE) services, and one of the main conclusions from a study presented by Memsource at the 2017 Conference of the European Association for Machine Translation (EAMT) states that less than 10% of the MT done in Memsource Cloud was left unedited. While it is true that a lot of user-generated content is machine-translated without post-editing (we see it every day at eBay, Amazon, Airbnb, to mention just a few), whether it is RBMT, SMT, or NMT, post-editors are still needed to improve the raw MT output.

Quantitative Evaluation Methods: Only Half the Picture

While these data show that they are key, linguists are often excluded from the MT process, and only required to participate in the post-editing task, with no interaction “in process.” Human evaluation is still seen as “expensive, time consuming and prone to subjectivity.” Error annotation takes a lot of time, compared to automated metrics such as BLEU or WER, which are certainly cheaper and faster. These tools provide quantitative data usually obtained by automatically comparing the raw MT to a reference translation, but the post-editor’s evaluation is hardly ever taken into account. Shouldn’t that be important if the post-editor’s role is here to stay?

While machines are better than we are at spotting differences, humans are better at assessing linguistic phenomena, categorizing them and giving detailed analysis.

Our approach at CPSL is to involve post-editors in three stages of the MT process:
  • For testing an MT engine in a new domain or language combination
  • For regular evaluation of an existing MT engine
  • For creating/updating post-editing guidelines
Some companies use the Likert scale for collecting human evaluation. This method involves asking people – normally the end-users, rather than linguists – to assess raw MT segments one by one, based on criteria such as adequacy (how effectively has the source text message been transferred to the translation?) or fluency (does the segment sound natural to a native speaker of the target language?).

For our evaluation purposes, we find it more useful to ask the post-editor to fill in a form with their feedback, correlating information such as source segment, raw MT and post-edited segment, type and severity of errors encountered, and personal comments.

Turning Bad Experiences Into Rewarding Jobs

One of the main issues I often have to face when I manage an MT-based project is the reluctance of some translators to work with machine-translated files due to bad previous post-editing experiences. I have heard many stories about post-editors being paid based on an edit distance that was calculated from a test that was not even close to reality, or post-editors never being asked for their evaluation of the raw MT output. They were only asked for the post-edited files, and sometimes, the time spent, but just for billing purposes. One of our usual translators even told me that he received machine-translated files that were worse than Google Translates results (NMT had not yet been implemented). All these stories have in common the fact that post-editors are seldom involved in the system improvement and evaluation process. This can turn post-editing into an alienating job that nobody wants to do a second time.

To avoid such situations, we decided to create our own feedback form for assessing and categorizing error severity and prioritizing the errors. For example, errors such as incorrect capitalization of months and days in Spanish, word order problems in questions in English, punctuation issues in French, and other similar errors were given the highest priority by our post-editors and our MT provider was asked to fix them immediately. The complexity of the evaluation document can vary according to need. It can be as detailed as the Dynamic Quality Framework (DQF) template or be a simple list of the main errors with an example.

Post Editor Feedback Form

However, more than asking for severity and repetitiveness, what I really want to know is what I call ‘annoyance level,’ i.e. what made the post-editing job too boring, tedious or time-consuming – in short, a task that could lead the post-editor to decline a similar job in the future. These are variables that quantitative metrics cannot provide. Automated metrics cannot provide any insight on how to prioritize error fixing, either by error severity level or by ‘annoyance level.’ Important errors can go unnoticed in a long list of issues, and thus never be fixed.

I have managed several MT-based projects where the edit distance was acceptable (< 30%) and the post-editors’ overall experience, to my surprise was still unpleasant. In such cases, the post-editor came back to me saying that certain types of errors were so unacceptable for them that they didn’t want to post-edit again. Sometimes this opinion was related to severity and other times to perception, i.e. errors a human would never make. In these cases, the feedback form helped detect the errors and turned a previously bad experience into an acceptable job.

It is worth noting that one cannot rely on one single post- editor's feedback. The acceptance threshold can vary quite a lot from one person to another, and post-editing skills are also different. Thus, the most reasonable approach is to collect feedback from several post-editors, compare their comments and use them as a complement to the automatic metrics.

We must definitely make an effort to include the post-editors’ comments as a variable when evaluating MT quality, to prioritize certain errors when optimizing the engines. If we have a team of translators whom we trust, then we should also trust them when they comment on the raw MT results. Personally, I always try my best to send machine-translated files that are in good shape so that the post-editing experience is acceptable. In this way, I can keep my preferred translators (recycled as post-editors) happy and on board, willing to accept more jobs in the future. This can make a significant difference not only in their experience but also in the quality of the final project.

5 Tips for Successfully Integrating Qualitative Feedback into your MT Evaluation Workflow

  1. Devise a tool and a workflow for collecting feedback from the post-editors.
It doesn’t have to be a sophisticated tool and the post-editors shouldn’t have to fill in huge Excel files with all changes and comments. It’s enough to collect the most awkward errors; those they wouldn’t want to fix over and over again. However, if you don’t have the time to read and process all this information, a short informal conversation on the phone from time to time can also be of help and give you valuable feedback about how the system is working.

  1. Agree to fair compensation
Much has been said about this. My advice would be to rely on the automatic metrics but to include the post- editor's feedback in your decision. Therefore, I usually offer hourly rates when language combinations are new and the effort is harder, and per word rates when the MT systems are established and have stable edit distances. When using hourly rates, you can ask your team to use time-tracking apps in their CAT tools or ask them to report the real hours spent. To avoid last-minute surprises, for full PE it is advisable to indicate a maximum number of hours based on the expected PE speed, and ask them to inform you of any deviation, whereas for light post-editing you may want to indicate a minimum amount of hours to make sure the linguists are not leaving anything unchecked.
  1. Never promise the moon
If you are running a test, tell your team. Be honest about the expected quality and always explain the reason why you are using MT (cost, deadline…).
  1. Don’t force anyone to become a post-editor
I have seen very good translators becoming terrible post-editors; either they change too many things or too few, or simply cannot accept that they are reviewing a translation done by a machine. I have also seen bad translators become very good post-editors. Sometimes a quick chat on the phone can be enough to check if they are reluctant to use MT per se, or if the system really needs further improvement before the next round.
  1. Listen, listen, listen
We PMs tend to provide the translators with a lot of instructions and reference material and make heavy use of email. Sometimes, however, it’s worth it to arrange short calls and listen to the post-editors’ opinion of the raw MT. For long-term projects or stable MT-based language combinations, it is also advisable to arrange regular group calls with the post-editors, either by language or by domain.

And… What About NMT Evaluation?

According to several studies on NMT, it seems that the errors produced by these systems are harder to detect than those produced by RBMT and SMT, because they occur at the semantic level (i.e. meaning). NMT takes context into account and the resulting text flows naturally; we no longer see the syntactically awkward sentences we are used to with SMT. But the usual errors are mistranslations, and mistranslations can only be detected by post-editors, i.e. by people. In most NMT tests done so far, BLEU scores were low, while human evaluators considered that the raw MT output was acceptable, which means that with NMT we cannot trust BLEU alone. Both source and target text have to be read and assessed in order to decide if the raw MT is acceptable or not; human evaluators have to be involved. With NMT, human assessment is clearly even more important, so while the translation industry works on a valid approach for evaluating NMT, it seems that qualitative information will be required to properly assess the results of such systems.


 Lucía Guerrero is a senior Translation and Localization Project Manager at CPSL, in the translation industry since 1998. In the past, she has managed localization projects for Apple Computer and translated children’s and art books. At CPSL she is specialized in international and national institutions, machine translation and post-editing.

About CPSL:  ------------------------------------------------------------------

CPSL (Celer Pawlowsky S.L.) is one of the longest-established language services providers in the translation and localization industry, having served clients for over 50 years in a range of industries including: life sciences, energy, machinery and tools, automotive and transport, software, telecommunications, financial, legal, electronics, education and government. The company offers a full suite of language services – translation and localization, interpreting and multimedia related services such as voice-over, transcription and subtitling.

CPSL is among a select number of language service suppliers that are triple quality certified, including ISO 9001, ISO 17100 and ISO 13485 for medical devices. Based in Barcelona (Spain) and with production centers in Madrid, Ludwigsburg (Germany) Boston (USA) and a sales office in the United Kingdom,the company offers language integrated solutions on both sides of the Atlantic Ocean, 24/7, 365 days a year.

CPSL has been the driving force behind the new ISO 18587 (2017) standard, which sets out requirements for the process of post-editing machine translation (MT) output. Livia Florensa, CEO at CPSL, is the architect of the standard, which has just been published. As Project Leader for this mission, she played a key role, being responsible for the proposal and for drafting and coordinating the development of the standard. ISO 18587 (2017) regulates the post-editing of content processed by machine translation systems, and also establishes the competences and qualifications that post-editors must have. The standard is intended for use by post-editors, translation service providers and their clients.

Find out more at:

The following is Storify Twitter coverage by Lucia from the EAMT conference earlier this year.

Tuesday, September 5, 2017

Neural MT Technology Looking for a Business Model

This is a guest post by Gábor Ugray, who drills into the possible rationale and potential logic of the recent DeepL NMT announcement. DeepL have provided very little detail on why, or even what they have done beyond providing some BLEU scores, so at this point, all of us can only speculate on the rationale and business outlook for DeepL with this limited information. 

Yes, they do seem to have slightly better scoring MT engines, but many of us know that these scores can vary from month to month, and can often be completely disingenuous and misleading. Even the great "Do No Evil" Google is not immune to piling up the bullshit with MT scores. This is the way of the MT warrior, it seems, especially if you have Google lineage. 

However, from my perspective the quality of the Linguee offering and track record sets this announcement apart. Here is somebody who apparently understands how to organize translation data with metadata on large scale. I imagine they already have some understanding of what people do with TM and how subsets of big data can be re-purposed and extracted to solve unique translation problems. They know that their window into translation memory on the web has 10 Billion page views, which by language industry standards is huge, but on Google scale is just your average day. This traffic and steadily growing audience suggests to me that they have the potential to be a disruptor. The kind of disruptor who can provide a new platform that makes any translation task easier (as well as more efficient and effective), by providing highly integrated translation memory and MT on a new kind of workbench that makes it easier for those millions who are just now entering the "translation market". Probably not so much for those who use words like localization, Simship, TQA, MQM, and have sunk costs in TMS and project management systems. Like Gábor's comments this is only my speculation. Who knows? Maybe DeepL just want to be a new and better Google Translate.

The initiatives that are making "real money" from automated translation are generally involved in translating large volumes of content that either directly generate revenue in new markets (Alibaba, eBay,, Amazon), or give you insight into what somebody is interested in across the world, across languages, so that the right products (ads) can be presented to them (Facebook, Baidu, Yandex, Google, Microsoft). Today these MT initiatives are ALREADY translating 99%+ of ALL words translated on the planet. MT in these hands is ALREADY generating huge economic value outside of the translation industry. Just because it does not show up as a line item in industry research that localization people look at, does not mean it is not happening.

It is interesting to also note that Rory Cowan of Lionbridge recently said that "Machine translation has been the classic dark horse, of course, waiting for its hour of glory." I presume Rory thinks this is true for the "translation industry", or possibly, Rory was asleep at the helm while all the fast boats (Facebook, Baidu, Yandex, Google, Microsoft) sailed by and snatched the overall translation market away from him, or maybe this happened while Rory drew S-curves and drank sherry. People who are asleep often don't realize that others may stay awake, and do stuff that could change their situation when they finally awaken. Rory seems to not even be aware, that unlike him, even in the translation industry, some have made real progress with MT. Today, SDL is already processing 20+ Billion words/month with MT, in addition to the 100 Million words/month via traditional human-only approaches, which are also probably assisted by MT. Is it any surprise that Rory is being replaced, and that Lionbridge never reached a market value equal to their revenue?

While Google may not care about the "translation industry" as we know it, I think we are approaching the day when we will see more evidence of disruption, as companies like DeepL, Amazon, Microsoft, Alibaba, and others in China start affecting the traditional translation business. Disruption does not generally happen in one fell swoop. It sort of creeps up slowly and then when a critical mass is reached, it acquires a pretty forceful momentum when the incumbents are routed. Look at the following industries: newspaper advertising, video rental, cell phones and the "old" retail industry to see how they changed and how disruption always creeps up on you. Here are some quotes by other S-curve enthusiasts, (just for fun lets call them Rorys) collected by CBI Insights for Rory and his best friends to ponder:

“The development of mobile phones will follow a similar path to that followed by PCs,” said Nokia’s Chief Strategy Officer Anssi Vanjoki, in a German interview (translated through Google Translate). “Even with the Mac, Apple attracted a lot of attention at first, but they have remained a niche manufacturer. That will be their role in mobile phones as well.”

Microsoft CEO Steve Ballmer had this to say about the iPhone’s lack of a physical keyboard: “500 dollars? Fully subsidized? With a plan? I said that is the most expensive phone in the world. And it doesn’t appeal to business customers because it doesn’t have a keyboard. Which makes it not a very good email machine.”

RIM’s co-CEO, Jim Balsillie, wrote off the iPhone almost completely:“It’s kind of one more entrant into an already very busy space with lots of choice for consumers … But in terms of a sort of a sea-change for BlackBerry, I would think that’s overstating it.”

Here’s IBM’s chairman Louis V. Gerstner minimizing how Amazon might transform retail and internet sales all the way back in 1999.“ is a very interesting retail concept, but wait till you see what Wal-Mart is gearing up to do,” he said. Gerstner noted that last year IBM’s Internet sales were five times greater than Amazon’s. Gerstner boasted that IBM “is already generating more revenue, and certainly more profit, than all of the top Internet companies combined.” The Reality: Today AMZN is 4X the market value of IBM and its stock has appreciated 982% vs.23% for IBM over the last 10 years.

 “Neither RedBox nor Netflix are even on the radar screen in terms of competition,” said Blockbuster CEO Jim "Rory" Keyes, speaking to the Motley Fool in 2008. “It’s more Wal-Mart and Apple.”

“There is no reason anyone would want a computer in their home,” said Ken "Rory" Olsen, founder of Digital Equipment Corporation, 1977.

This guy should change his name to Rory: “Our guests don’t want the Airbnb feel and scent,” said Christopher Norton, EVP of global product and operations at the Four Seasons, speaking to Fast Company a couple of years ago. He went on to explain that his customers expect a “level of service that is different, more sophisticated, detailed, and skillful.”

And again Steve "Rory" Ballmer: “Google’s not a real company. It’s a house of cards.”

The emphasis below is all mine. 


Hello! I am Neural M. Translation. I’m new around here. I am intelligent, witty, sometimes wildly funny, and I’m mostly reliable. My experiences so far have been mixed, but I haven’t given up on finding a business model for lifelong love. 

That’s the sort of thing that came to my mind when I read the reports about DeepL Translator[1], the latest contender in the bustling neural MT space. The quality is impressive and the packaging is appealing, but is all the glittering stuff really gold? And how will the company make money with it all? Where’s the business in MT, anyway? 

I don’t have an answer to any of these questions. I invite you to think together.

Update: I already finished writing this post when I found this talk (in German) by Gereon Frahling, the sparkling mind behind Linguee, from May 2017. It contains the best intro to neural networks I’ve ever heard, and sheds more light on all the thinking and speculation here. I decided not to change my text, but let you draw your own conclusions instead.

Neural MT scanning the horizon for opportunities
Scanning the horizon for opportunities

Scratching the DeepL surface

If you’ve never heard of DeepL before, that’s normal: the company changed its name from Linguee on August 24, just a few days before launching the online translator. The website[2] doesn’t exactly flood you with information beyond the live demo, but the little that is there tells an interesting story. First up are the charts with the system’s BLEU scores:
Source: DeepL

I’ll spare you the bickering about truncated Y axes and just show what the full diagram on the left looks like:

Source: @kvashee
 This section seems to be aimed at folks who know a bit about the MT field. There are numbers (!) and references to research papers (!), but the effect is immediately destroyed because DeepL’s creators have chosen not to publish their own work. All you get is the terse sentence, Specific details of our network architecture will not be published at this time. So while my subjective judgement confirms that the English/German translator is mighty good, I wonder why the manipulation is needed if your product really speaks for itself. Whence the insecurity?

Unfortunately, it gets worse when the text begins to boast about the data center in Iceland. What do these huge numbers even mean?

My petaFLOPS are bigger than your petaFLOPS

The neural in NMT refers to neural networks, an approach to solving computationally complex tasks that’s been around for decades. It involves lots and lots of floating-point calculations: adding and multiplying fractional numbers. Large-scale computations of this sort have become affordable only in the last few years, thanks to a completely different market: high-end graphics cards, aka GPUs. All those incredibly realistic-looking monsters you shoot down in video games are made up of millions of tiny triangles on the screen, and the GPU calculates the right form and color of each of them to account for shape, texture, light and shadow. It’s an amusing coincidence that these are just the sort of calculations needed to train and run neural networks, and that the massive games industry has been a driver of GPU innovation.

One TFLOPS or teraFLOPS simply means the hardware can do 1,000 billion floating-point calculations per second. 5.1 petaFLOPS equals 5,100 TFLOPS. NVIDIA’s current high-end GPU, the GTX 1080 Ti, can do 11 TFLOPS.

How would I build a 5.1 PFLOPS supercomputer? I would take lots of perfectly normal PCs and put 8 GPUs in each: that’s 88 TFLOPS per PC. To get to 5.1 PFLOPS, I’d need 58 PCs like that. One such GPU costs around €1,000, so I’d spend €464k there; plus the PCs, some other hardware to network them and the like, and the supercomputer costs between €600k and €700k.

The stated reason for going to Iceland is the price of electricity, which makes some sense. In Germany, 1kWh easily costs up to €0.350 for households; in Iceland, it’s €0.055[3]. Germany finances its Energiewende largely by overcharging households, so as a business you’re likely to get a better price, but Iceland will still be 3 to 6 times cheaper.

A PC with 8 GPUs might use 1500kW when fully loaded. Assuming they run at 50% capacity on average, and 8760 hours in a year, the annual electricity bill for the whole datacenter will be about €21k in Iceland.

But is this really the world’s 23rd largest supercomputer, as the link to the Top 500 list [4] claims? If you look at the TFLOPS values, it sounds about right. But there’s a catch. TFLOPS has been used to compare CPUs, which can do lots of interesting things, but floating-point calculations are not exactly their forte. A GPU, in turn, is an army of clones: it has thousands of units that can do the exact same calculation in parallel, but only a single type of calculation at a time. Like calculating one million tiny triangles in one go; or one million neurons in one go. Comparing CPU-heavy supercomputers with a farm of GPUs is like comparing blackberries to coconuts. They’re both fruits and round, but the similarity ends there.

Business Model: Free (or cheap) Commodity MT

There are several signs suggesting that DeepL intends to sell MT as a commodity to millions of end users. The exaggerated, semi-scientific and semi-technical messages support this. The highly professional and successful PR campaign for the launch is another sign. Also, the testimonials on the website come from la Reppublica (Italy), RTL Z (Holland), TechCrunch (USA), Le Monde (France) and the like – all high-profile outlets for general audiences.

Right now there is only one way to make money from mass content on the Internet: ads. And that in fact seems to have been the company’s model so far. Here’s Linguee’s traffic history from Alexa:
With 11.5 million monthly sessions, almost 3 page views per session, and a global rank of around 2,500, is a pretty valuable site. Two online calculators place the domain’s value at $115k and $510k.[5][6]

If you assume an average ad revenue[7] of $1 per thousand impressions (PTM), that would yield a monthly ad revenue of $34,500. This kind of mileage, however, varies greatly, and it may be several times higher, especially if the ads are well targeted. Which they may (or may not) be on a site where people go to search in a sort of dictionary.

Let’s take a generous $3 PTM. That yields an annual revenue of $1.2mln, which I’ll just pretend is the same in euros. Assuming a 3-year depreciation for the initial data center investment and adding the energy cost, you’d need an extra traffic of about 2 million sessions to break even, ignoring wages and development costs. That could be achievable, I guess.

Why commodity MT is a bad idea nonetheless

What speaks against this approach? Let’s play devil’s advocate.

Barrier to switch. It’s hard to make people switch from a brand (Google) that is so strong it’s become a generic term like Kleenex. Is the current quality edge enough? Can you spin a strong enough David-versus-Goliath narrative? Raise enough FUD about privacy? Get away with a limited set of languages?

Too big a gamble. Public records show that Linguee/DeepL had a €2.4mln balance sheet in 2015, up from €1.2 the year before. Revenue numbers are not available, but the figures are roughly in line with the ballpark ad income that we just calculated. From this base, is it a sensible move to blow €700k on a top-notch data center, before your product even gains traction? Either the real business model is different, or the story about the data center is not completely accurate. Otherwise, this may be a rash step to take.

Unsustainable edge. Even if you have an edge now, and you have a dozen really smart people, and you convince end users to move to your service and click on your ads, and your investment starts to pay off, can you compete with Google’s generic MT in the long run?

I think you definitely cannot. Instead of explaining this right away, let’s take a closer look at the elephant in the room.

Google in the MT Space

First, the humbling facts. From a 2016 paper[8] we can indirectly get some interesting figures about Google’s neural MT technology. One thing that stands out is the team size: there are 5 main authors, and a further 26 contributors.

Then there’s the training corpus: for a major language pair such as English/French, it’s between 3.5 billion and 35 billion sentence pairs. That’s 100-1000 times larger than the datasets used to train the state-of-the-art systems at the annual WMT challenge.[9] Systran’s 2016 paper[10] reports datasets ranging from 1 million to over 10 million segments, but billions are nowhere to be found in the text.

In this one respect, DeepL may be the single company that can compete with Google. After all, they spent the past 8 years crawling, classifying and cleansing bilingual content from the web.

Next, there is pure engineering: the ability to optimize these systems and make them work at scale, both to cope with billion-segment datasets and to serve thousands of request per second. A small company can make a heroic effort to create a comparable system right now, but it cannot be competitive with Google over several years. And every company in the language industry is homeopathically small in this comparison.

Third, there is hardware. Google recently announced its custom-developed TPU,[11] which is basically a GPU on steroids. Compared to the 11 TFLOPS you get out of the top-notch GTX 1080 Ti, a single TPU delivers 180 TFLOPS.[12] It’s not for sale, but Google is giving away 1,000 of these in the cloud for researchers to use for free. Your data center in Iceland might be competitive today, but it will not be in 1, 2 or 3 years.

Fourth, there is research. Surprisingly, Google’s MT team is very transparent about their work. They publish their architecture and results in papers, and they helpfully include a link to an open-source repository[13] containing the code and a tutorial. You can grab it and do it yourself, using whatever hardware and data you can get your hands on. So, you can be very smart and build a better system now, but you cannot out-research Google over 1, 2 or 3 years. This is why DeepL’s hush-hush approach to technology is particularly odd.

But what’s in it for Google?

It’s an almost universal misunderstanding that Google is in the translation business with its MT offering. It is not. Google trades in personal information about its real products: you and me. It monetizes it by selling to its customers, the advertisers, through a superior ad targeting and brokering solution. This works to the extent that people and markets move online, so Google invests in growing internet penetration and end user engagement. Browsers suck? No problem, we build one. There are no telephone cables? No problem, we send balloons. People don’t spend enough time online because they don’t find enough content in their language? No problem, we give them Google Translate.
If Google wants to squash something, it gets squashed.[14] If they don’t, and even do the opposite by sharing knowledge and access freely, that means they don’t see themselves competing in that area, but consider it instead as promoting good old internet penetration. 

Sometimes Google’s generosity washes over what used to be an entire industry full of businesses. That’s what happened to maps. As a commodity, maps are gone. Cartography is now a niche market for governments, military, GPS system developers and nature-loving hikers. The rest of us just flip open our phones; mass-market map publishers are out of the game.

That is precisely your future if you are striving to offer generic MT.

Business Model: Professional Translation

If it’s not mass-market generic translation, then let’s look closer to home: right here at the language industry. Granted, there are still many cute little businesses around that email attachments back and forth and keep their terminology in Excel sheets. But generally, MT post-editing has become a well-known and accepted tool in the toolset of LSPs, next to workflow automation, QA, online collaboration and the rest. Even a number of adventurous translators embrace MT as one way to be more productive.

The business landscape is still pretty much in flux, however. We see some use of low-priced generic MT via CAT tool plugins. We see MT companies offering customized MT that is more expensive, but still priced on a per-word, throughput basis. We see MT companies charging per customized engine, or selling solutions that involve consultancy. We see MT companies selling high-ticket solutions to corporations at the $50k+ level where the bulk of the cost is customer acquisition (golf club memberships, attending the right parties, completing 100-page RFPs etc.). We see LSPs developing more or less half-baked solutions on their own. We see LSPs acquiring professional MT outfits to bring competence in-house. We even see one-man-shows run by a single very competent man or woman, selling to adventurous translators. (Hello, Terence! Hello, Tom!)

There are some indications that DeepL might be considering this route. The copy on the teaser site talks about translators, and the live demo itself is a refreshingly original approach that points towards interactive MT. You can override the initial translation at every word, and if you choose a different alternative, the rest of the sentence miraculously updates to complete the translation down a new path.

The difficulty with this model is that translation/localization is a complex space in terms of processes, tools, workflows, and interacting systems. Mike Dillinger showed this slide in his memoQfest 2017 keynote[15] to illustrate how researchers tend to think about MT:
He immediately followed it up with this other slide that shows how MT fits into a real-life translation workflow:
Where does this leave you, the enthusiastic MT startup that wants to sell to the language industry? It’s doable, but it’s a lot of work that has nothing to do with the thing you truly enjoy doing (machine translation). You need to learn about workflows and formats and systems; you need to educate and evangelize; you need to build countless integrations that are boring and break all the time. And you already have a shark tank full of competitors. Those of us who are in it think the party is good, but it’s not the kind of party where startup types tend to turn up.

What does that mean in numbers? I looked at the latest available figures of Systran.[16] They are from 2013, before the company was de-listed from the stock exchange. Their revenue then was $14.76mln, with a very flatly rising trend over the preceding two years. If your current revenue is around €2mln, that’s a desirable target, but remember that Systran is the company selling high-ticket solutions through a sales force that stays in expensive hotels and hangs out with politicians and CEOs. That sort of business is not an obvious match for a small, young and hungry internet startup.

By far the most interesting scenario, for me at least, is what companies like eBay or are doing. They are not in the translation business, and they both have internal MT departments.[17][18] The funny thing is, they are not MTing their own content. Both companies are transaction brokers, and they translate the content of their customers in order to expand their outreach to potential buyers. With eBay it’s the product descriptions on their e-commerce site; with, it’s the property descriptions.

For these companies, MT is not about “translation” in the traditional sense. MT is a commercial lubricant.

The obvious e-commerce suspect, Amazon, has also just decided to hop on the MT train.[19] The only surprising thing about that move is that it’s come so late. The fact that they also intend to offer it as an API in AWS is true Amazon form; that’s how AWS itself got started, as a way to monetize idle capacity when it’s not Black Friday o’clock.

So we’ve covered eBay, and Amazon. Let’s skip over the few dozen other companies with their own MT that I’m too ignorant to know. What about the thousands of smaller companies for whom MT would make sense as well, but that are not big enough to make it profitable to cook their own stew?

That’s precisely where an interesting market may be waiting for smart MT companies like DeepL.

Speed Dating Epilogue

Will Neural M. Translation, the latest user to join TranslateCupid, find their ideal match in an attractive business model? The jury is still out.

If I’m DeepL, I can go for commoditized generic MT, but at this party, I feel like a German Shepherd that accidentally jumped into the wrong pool, one patrolled by a loan shark. It feels so wrong that I even get my metaphors all mixed up when I talk about it.

I can knock on the door of translation companies, but they mostly don’t understand what I’m talking about in my pitch, and many of them already have a vacuum cleaner anyway, thank you very much.
Maybe I’ll check out a few interesting businesses downtown that are big enough so they can afford me, but not so big that the goons at the reception won’t even let me say Hi.

Will it be something completely different? I am very excited to see which way DeepL goes, once they actually reveal their plans. Nobody has figured out a truly appealing business model for MT yet, but I know few other companies that have proven to be as smart and innovative in language technology as Linguee/DeepL. I hope to be surprised.

Should we brace for the language industry’s imminent “disruption?” If I weren’t so annoyed by this brainlessly repeated question, I would simply find it ridiculous. Translation is a $40bln industry globally. Let’s not talk about disruption at the $1mln or even $100mln level. Get back to me when you see the first business model that passes $1bln in revenues with a robust hockey-stick curve.
Or, maybe not. If machine learning enables a new breed of lifestyle companies with a million bucks of profit a year for a handful of owners, I’m absolutely cool with that.




  Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at and tweets as @twilliability.

Monday, August 28, 2017

The Evolution in Corpus Analysis Tools

This is a guest post by Ondřej Matuška, the Sales & Marketing Manager of Lexical Computing, a company that develops a corpus and language data analysis product called Sketch Engine

I was first made aware of Sketch Engine by Jost Zetzsche's newsletter (276th Edition of the Tool Box) a few weeks ago. As relatively clean text corpora proliferate and grow in data volume, it becomes necessary to use new kinds of tools to understand this huge volume of text data, which may or may not be under consideration for translation. These new tools help us to understand how to accurately profile the most prominent linguistic patterns in large collections of textual language data and extract useful knowledge from these new corpora to help in many translation related tasks. For those of us in the MT world, there have always been student-made (mostly by graduate students in NLP and computational linguistic programs)  tools that were used and needed to understand the corpus for better MT development strategies, and to get text data ready for machine learning training processes. Most of these tools would be characterized as not being "user-friendly", or to put it more bluntly as being too geeky. As we head into the world of deep learning, the need for well-understood data that is used for training or leverage any translation task can only grow in importance. 

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities. It's the data where the real value is. 

I am often asked what kinds of tools translators should learn to use in future, and I generally feel that they should stay away from Moses and other MT development tool-kits like Tensorflow, Nematus, OpenNMT, and focus on the data analysis and preparation aspects since this ability would add value to any data-driven machine learning approach used. Something that is worth remembering is, that despite the hype, deep learning algorithms are commodities. It's the data that's the real value. These MT deep learning development tools (algorithms) are likely to evolve rapidly in the near-term, and we can expect only the most capable and well-funded groups will be able to keep up with the latest developments.(How many LSPs do you think have tried all four open source NMT platforms? Or know what CNN is? My bet is that only SDL has.)  Even academics complain about the rate of change and new developments in Neural MT algorithmic research, and thus LSPs and translators are likely to be at a clear disadvantage in pursuing Neural MT model development.  Preparing data for machine learning processes will become an increasingly important and strategic skill for those involved with business translation work.  This would mean that the following skills would be valuable IMO. They are all somewhat closely linked in my mind:
  • Corpus Analysis & Profiling Tools like Sketch Engine
  • Corpus Modification Tools i.e. Advanced Text Editors, TextPipe and other editors that enable pattern level editing on very large (tens of millions of sentences)  text data sets
  • Rapid Error Detection & Correction Tools to go beyond traditional conceptions of PEMT
  • MT Output Quality Assessment Methodology & Tools
  • Training Data Manufacturing capabilities that evolve from a deeper understanding of the source and TM corpus enabled by tools like Sketch Engine.
These are all essential tools in undertaking 5 million and 100+ million word translation projects that are likely to become much more commonplace in future.Clearly, many translators will want nothing to do with this kind of work, but as MT use expands, these kinds of tools and skills become much more valuable and many would argue that understanding patterns in linguistic big data also has great value for any kind of translation task.

Jost Zetsche has provided a nice overview of what Sketch Engine does below:
  • Word sketches: This is where the program got its name, and it's what Kilgarriff (co-founder) brought to the table. A word sketch is a summary of a word's grammatical and collocational behavior (collocational refers to the analysis of how often a word co-occurs with other words or phrases). Since the data in the corpora is lemmatized (i.e., words are analyzed so they can be brought back to their base or dictionary form), the results are a lot more meaningful than what most of our translation environment tools provide when they're unable to relate different forms of one word to each other. Another word sketch option that Sketch Engine offers is the comparison of word sketches of similar words.
  • Thesaurus: The ability to retrieve a detailed list or a graphical word cloud with similar words, including links to create reports on word sketch differences for those terms to understand the exact differences in actual usage.
  • Concordance: Searches for single words, terms, or even longer phrases. Since the data in the supported languages is tagged, it's also possible to search for specific classes of words or specific classes of words that surround the word in question.
  • Parallel corpus: Retrieval of bilingual sets of words or phrases within the contexts. Presently this is available only for on-screen data viewing, but it will soon be offered as downloadable data. This is especially helpful when uploading your own translation memories (see below).
  • Word lists: The possibility of creating lists of words and the number of occurrences, either as lemmas (the base form of each word) or in each word form.
  • Creating your own corpus: For translators, this likely is the most exciting feature. You can either upload your own translation memories or you can use the tool's own search engine mechanism (which relies on Microsoft Bing) to create a list of bilingual websites that contain the terms that are relevant to your field. You can download many websites containing certain terms to build a corpus. However, you cannot have them automatically align with a translated version of that website through Sketch Engine. You can perform any of the functions mentioned earlier but it is also possible to run a keyword search on the user-created corpus, identify the terms that are relevant, and download that into an Excel or TBX file. This feature presently is available for Czech, Dutch, English, French, German, Chinese, Italian, Japanese, Korean, Polish, Portuguese, Russian, and Spanish. The bilingual version of this is just around the corner.
Many years ago I thought that the evolution  from TM to other "more intelligent" language data analysis and manipulation tools would happen much faster, but things change slowly in a highly fragmented industry like the translation industry. I  think tools like Sketch Engine, together with much more compelling MT capabilities, finally signal that a transition is now beginning, and could potentially build momentum. 

P.S. Interestingly, the day after I published this the ATA also published a post on Corpus Analysis that focuses on open source tools.

As almost always the emphasis below is mine.


Deploying NLP and Text Corpora in Translation

Natural Language Processing (NLP) is a discipline which has lots to offer to translators and translation, yet translation rarely makes use of the possibilities. This might be partly due to the fact that NLP tools are difficult to use without a certain level of IT skills. This is what the Sketch Engine team realized 13 years ago and built Sketch Engine, a tool which makes NLP technology accessible to anyone. Sketch Engine started as a corpus query and corpus management tool which over time developed a variety of features that address the needs of new users from outside of the linguistic camp such as translators.

Term Extraction

Term extraction is the first area where NLP can become extremely useful.The traditional approach tends to be n-gram based, n-gram being a sequence of any n words. In a nutshell, a term extraction tool will find the most frequent n-grams in the text and these will be presented to the user as term candidates. The user will then proceed to the next step: manual cleaning. It is not uncommon to receive a list which contains more non-terms than terms, therefore manual cleaning became a natural next step. Some term extraction tools introduced lists of stop words and the user can even indicate whether the word is a hard stop word or whether the stop word is allowed only in certain positions within the term. While this led to improvement, the output still contains lots of noise and manual cleaning still remains a vital step in the process.

At Sketch Engine, we decided to direct efforts towards term extraction with a view to achieving much cleaner results by exploiting our NLP tools and our multibillion-word general text corpora.

The main difference between Sketch Engine and traditional term extraction tools is that each text uploaded to Sketch Engine is tagged and lemmatized. The system thus knows whether the word is a verb, noun, adjective etc. and also knows which words are declined or conjugated forms of the same base form called lemma. Sketch Engine can look separately for work as noun and work as verb and can also treat different forms of nouns (cases, plural/singular) or verbs (tenses, participles) as the same word if required. This is something that was to be exploited in the term extraction.

For each language with term extraction support (16 languages as of August 2017), we developed definitions telling Sketch Engine what a term in that language can look like. For example, Sketch Engine knows that a term in English will most likely take the form of (noun+)noun+noun or adjective+noun while in Spanish, most likely, noun+adjective(+adjective) or noun+de+noun. The full rules are more complex than listed here. This will immediately disqualify any phrases that contain a verb or do not contain a noun at all.

In addition to the format of the phrase, Sketch Engine also makes use of its enormous general text corpora which it uses to check whether the phrase that passed the check of format is more frequent in the text in question compared to general language. During this check, each phrase is treated as one unit and occurrences of the same phrase are searched and counted in the general text and compared. Lemmatization plays an important role here so that plurals and singulars or different cases can be counted as the same phrase. The combination of the format check and frequency comparison leads to exceptionally clean results. Here are term candidates as extracted from texts about photography. No manual cleaning applied, list presented as it comes out of Sketch Engine.  

The quality of extraction can be checked immediately by using the new dedicated term extraction interface to Sketch Engine called OneClick Terms

Overall Language Quality

While a great deal of the translation business relates to terminology, it is not the terms themselves that constitute the majority of text. There is a lot of language in between which may not always be completely straightforward to translate. Translators are used to working with concordances in their CAT tools where it is the translation memory (TM) that serves as the source of data. The TM is sufficient for terminology work but might not be as useful for the language in between. TMs are usually rather small and the concordance does not find enough occurrences to judge which usage is typical. This is where general text corpora come in handy. The word ‘general’ refers to the fact that these corpora were designed to contain the largest possible variety of text types and topics. A general text corpus will, therefore, contain even very specialized texts heavy in terminology as well as common neutral text from various sources. Sketch Engine contains multibillion-word corpora in many languages. The largest corpus is English with a size of 30 billion words, that is 30,000,000,000!

Languages with a corpus of 500+ million words


English        33,100
German       19,900
Russian       18,300
French        12,400
Spanish       11,000
Japanese     10,300
Polish            9,700
Arabic           8,300
Italian           5,900
Czech            5,100
Catalan          4,800
Portuguese     4,600
Turkish          4,100
Swedish            3,900
Hungarian         3,200
Romanian          3,100
Dutch                3,000
Ukrainian           2,700
Danish               2,400
Chinese simp      2,100
Chinese trad       2,100
Greek                2,000
Norwegian         2,000
Finnish              1,700
Croatian            1,400
Slovak                1,200
Hebrew      1,100
Slovenian    1,000
Lithuanian  1,000
Hindi             900
Bulgarian       800
Latvian          700
Estonian        600
Serbian         600
Korean          600
Serbian          600
Persian          500
Maltese         500


A corpus of this size will return thousands of hits for most words or phrases and millions in the case of frequent ones. Such a concordance is impossible for a human to process. This is why we developed an advanced feature, called the word sketch, that will cope with this amount of information and will present the results in a compact and easy to understand format. The word sketch is a one-page summary of word combinations (collocations) that the word keeps. It will give the user an instant idea about how the word should be used in context. The collocations are presented in groups reflecting the syntactic relations. An example of a word sketch might look like this:

Two million occurrences of ‘contract’ were found in the corpus and processed into this summary above, of collocations, which the user can understand in seconds. It gives a clear picture of what adjectives or verbs are the typical collocations the word keeps allowing the user to use the word naturally as a native speaker would. This information is computed automatically without any manual intervention meaning that the user can generate it for any word in the language including rare words. It is highly recommended to use large corpora to get information this rich. A minimum size is around 1 billion words. A smaller corpus will also produce a word sketch but not with as much information and a corpus below 50 million words is not likely to produce anything useful especially for less frequent words. The largest preloaded corpora in Sketch Engine are recommended for use with the word sketch.

Word choice - Thesaurus

I am sure everyone has been in a situation when they want to say something but the right word would not spring to mind. One can usually think of a similar word, just not the right one. This is when a thesaurus useful. Traditional printed and hand-made thesaurus content is limited by space or money, and often both. The combination of NLP and distributional semantics led to algorithms that can generate thesaurus entries automatically. The idea of a computer identifying similar words by computations often leads to skepticism but the results are surprisingly usable. How does an algorithm discover words similar in meaning? Distributional semantics claims that words which appear in similar contexts are also similar in meaning. Therefore to find a synonym for a noun, Sketch Engine will compare the word sketches for all nouns found in the corpus. The ones with the most similar word sketch will be identified as synonyms or similar words. Here is an example of what Sketch Engine will offer if you need a word similar to authorization:

The synonyms are sorted by the similarity score calculated from the similarity of word sketches of each word. The top of the list (the first column) is the most valuable. The list contains certain words which are not very good synonyms and they are listed because the collocations they form are similar to the collocations of authorization. This, however, still keeps the list very useful because the thesaurus functionality will be used by somebody with a decent knowledge of the language and these words serve as suggestions from which the user will pick the most suitable one.

For words which cannot have synonyms, the thesaurus will produce a list of words belonging to the same category or the same topic. This is the thesaurus for stapler:


This type of a thesaurus entry might help recall a word from the same category.

Examples in Context - Concordance

Sketch Engine features also the concordance with a simple as well as complex search options where the user can search both their own texts as well preloaded corpora. The options allow for searching by exactly the text typed but also by lemma (the base form of the word which will find also all derived forms) or restricting the search by part of speech or grammatical categories such as the tense of the verb. It even allows for searching for lexical or grammatical patterns without specifying concrete words. This interesting concordance shows examples of sequences of nouns joined by the preposition of. This is something I actually had to look up recently to check how many of’s I can use in a row. While the concordance itself did not answer the question directly, I could see that it is normal to use use 3 of’s as long as the expression consists of numbers and units of measurement, which is how I originally used it in my sentence and the concordance helped me check I was right.

Translation Lookup - Parallel Corpora

Sketch Engine also contains parallel multilingual corpora which can be used for translation lookup. Again, both simple and complex search criteria can be applied both on the first and second language. This will make it possible for the user learn about situations when a word is not translated by the most obvious equivalent. For example, this searches looks for the word vehicle in English and matching Spanish segments not containing vehículo to discover the cases when it might need to be translated differently.
This is especially valuable to users who do not have any TM or the TM is not large enough to provide the required coverage. Users with a TM can upload it to Sketch Engine to gain access to the advanced searching tools.

Building Specialized Domain Corpora

Sketch Engine has a built-in tool for automated corpus building. The user does not need any technical knowledge to build a corpus. It is enough to upload their own data (texts, documents) and if the user does not have any suitable data, Sketch Engine will automatically find them on the internet, download them and convert them to a corpus. It only takes minutes to build a 100,000-word specialized corpus. 

The first option is obvious – the user uploads their texts and documents and Sketch Engine will lemmatize them and tag them and the corpus is ready.

If the user has no suitable texts or their length is not sufficient, the use can provide a few keywords that define the topic. For example, the keywords that define tooth care could be: tooth, gums, cavity, care. Sketch Engine will use these keywords to create web search queries and will interact with Bing. Bing will find pages which correspond to the web searches and will return the urls back to Sketch Engine where the content of the urls will be downloaded, cleaned, tagged and lemmatized and converted to a corpus. The whole procedure only takes a few minutes. This is a great tool for anyone who needs a reliable sample of specialized language to explore how terms and phrases are used correctly and naturally.

Free Sketch Engine trial

A free 30-day Sketch Engine giving access to the complete functionality and preloaded corpora in many languages is available from the Sketch Engine website:


Ondřej Matuška -  Sales and Marketing Manager

Ondřej oversees sales and marketing activities and external communication. He is the main point of contact for anyone seeking information about Sketch Engine and is also keen to support existing users so that they can make the most of Sketch Engine.