Pages

Thursday, May 18, 2017

Terminology Usage in Machine Translation

This is a guest post by Christian Eisold from berns language consulting. For many who have been using MT seriously already, it is clear that efforts made to implement the use of the correct terminology are a very high-value effort, as this is an area that is also problematic with a purely human translation effort. In fact, I believe that there is evidence that suggests MT (when properly done) outputs much more consistent translations in terms of consistent terminology. In fact, in most expert discussions on the tuning of MT systems for superior performance, there is a very clear understanding that better output starts with work focused on ensuring terminological consistency and accuracy.  The way to enforce terminological consistency and accuracy has been well known to RBMT practitioners and is also well understood by expert SMT practitioners. Christian provides an overview of how this is done across different use scenarios and vendors below. 

He also points to some of the challenges of implementing correct terminology in NMT models, where the system controls are just beginning to be understood.  Patents are a domain where there is a great need for terminology work and is also a domain that has a huge vocabulary, which is supposedly a weakness of NMT. However, given that the WIPO is using NMT for their Japanese and Chinese patent translations, and seeing the many options available in the SYSTRAN PNMT framework, I think we may have already reached a point where this is less of an issue when you work with experts. NMT has a significant amount of research that is driving ongoing improvements, thus we should expect to see it continually improve over the next few years.

As is the case with all the new data-driven approaches, we should understand that investments in raising data quality i.e. building terminological consistency and rich, full-featured termbases, will have a long-term productivity yield and allow long-term leverage. Tools and processes that allow this to happen, are often more important in terms of the impact they might have than whether you use RBMT, SMT or NMT.





ter·mi·nol·o·gy
ˌtərməˈnäləjē/
noun
noun: terminology; plural noun: terminologies
  1. the body of terms used with a particular technical application in a subject of study, theory, profession, etc.

    "the terminology of semiotics"

    synonyms:phraseology, terms, expressions, words, language, lexicon, parlance, vocabulary, wording, nomenclature; More
    informallingo, -speak, -ese

    "medical terminology"



----------------------------------------------------------



Currently, the world of machine translation is struck by the impact of deep learning techniques which deal with the matter of optimizing networks between the sentences in a training set for a given language pair. Neural MT (NMT) has made its way from a few publications in 2014 up to the current state, where several practical applications of professional MT services use NMT in a more and more sophisticated way. The method, which is thought to be a revelation in the field of MT for developers and users alike, does hold many advantages compared to classical statistical MT. Besides the improvements like increased fluency, which is obviously a huge step in the direction of human-like translation output, it is well known that NMT has its disadvantages when it comes to the usage of terminology. The efforts made in order to overcome these obstacles show the crucial role that terminology plays in MT customization. At berns language consulting (blc) we are concerned with and focus on terminology management as a basis for language quality optimization processes on a daily basis. We sense that there is a growing interest for MT integration in various fields of businesses, so we took a look into current MT systems and the state of terminology integration in these systems.

Domain adaptation and terminology


Terminology usage is closely tied to the field of domain adaptation, which is a central concept of engine customization in MT. Domain being a special subject area that is identified by a distinctive set of concepts expressed in terms and namings respectively, using terminology is the key to adaptation techniques in the existing MT paradigms. The effort that has to be undertaken to use terminology successfully differs greatly from paradigm to paradigm. Terminology integration can take place at different steps of the MT process and is applied on different stages of MT training.

Terminology in Rule-Based MT


In rule-based MT (RBMT) specific terminology is handled within separate dictionaries. In applications for private end users, the user can decide to use one or several dictionaries at once. If just one dictionary is used, the term translation depends on the number of target candidates in the respective entry. For example, the target candidates for the German word ‘Fehler’ could be ‘mistake’ or ‘error’ when a general dictionary is used. If the project switches to the use of a specialized software dictionary, the translation for ‘Fehler’ would most likely be ‘bug’ (or ‘error’ if the user wants it to be an additional translation). If more than one dictionary is used, they can be ranked to get translations from the higher ranked dictionaries first if the same source entries are present in more than one of them. Domains can be covered by a specific set of terms gathered from other dictionaries available and by adding own entries to existing dictionaries. For them to function properly with the linguistic rules, it is necessary to choose from a variety of morpho-grammatical features that define the word. While most of the features regarding nouns can be chosen with a little knowledge of case morphology and semantic distinctions, verb entries can be hard to grasp with little knowledge in the verbal semantics.

While RBMT can be sufficient to translate in general domains like medicine, its disadvantages lie in the incapability to adapt to new or fast-changing domains appropriately. If it comes to highly sophisticated systems, tuning the system to a new domain requires trained users and an enormous amount of time. Methods that facilitate fast domain adaption in RBMT are terminology extraction and the import of existing termbases. While these procedures will help the fast integration of new terminology, for the system to get sufficient output quality it is necessary to check new entries for correctness.



Terminology in SMT


Basically, there are 2 ways of integrating terminology in SMT: 1) prior to translation and 2) at runtime. Integration before translation is the preferred way and can be done, again, in two ways: 1) implicitly by using terminology in the writing process of the texts to be used for training and 2) explicitly by adding terms to the training texts after the editing process.

Implicit integration of terminology is the standard case which is simply a more or less controlled byproduct of the editing process before texts are qualified for MT usage. Because SMT systems rely on distributions of source words to target words, it is necessary for the source words and the target words to be used in a consistent manner. If you think of translation projects with multiple authors, the degree of term consistency strongly depends on the usage of tools that are able to check terminology, namely authoring tools that are linked to a termbase. Because the editing process normally follows the train of thought in non-technical text types, there is no control of the word forms used in a text. To take advantage of the terminology at hand, the editing process would need a constant analysis of word form distribution as a basis for completing inflectional paradigms of every term in the text. Clearly, such a process would break the natural editing workflow. In order to achieve compliance with the so-called controlled language in the source texts, authors in highly technical domains can rely on the help of authoring tools together with termbases and an authoring memory. Furthermore, contexts would have to be constructed in order to embed missing word forms of the terms in use. In order to do a fill-up of missing forms, it´s necessary to do an analysis of all the word forms used in the text, which can be done by applying lemmatizers on the text. In the translation step, target terms can be integrated with the help of CAT tools, which is the standard way of doing translations nowadays.

Explicit integration, on the other hand, is done by a transfer of termbase entries into the aligned training texts. This method simply consists of the appendix of terms at the end of the respective file for each language involved in the engine training. Duplication of entries lead to higher probabilities for a given pair of terms, but in order to be sure about the effects of adding terms to the corpus, one would have to do an analysis of term distributions in the corpus, as they will interfere with the newly added terms. The selection of term translation also depends on the probability distributions in the language model. If the term in question was not seen in the language model, it’ll get a very low probability. Another problem with that approach concerns inflection. As termbase entries are nominative singular forms, just transferring them to the corpus may only be sufficient for translation into target languages which are not highly inflected. In order to cover more than the nominative case, word forms have to be added to the corpus until all cells of their inflectional paradigms are filled. Because translation hypotheses (candidate sentences) are based on counts of words in context, for the system to know which word form to choose for a given context, it is necessary to add terms in as many contexts as possible. This is actually done by some MT providers to customize engines for a closed domain.

Whereas terminology integration in the described way is part of the statistical model of the resulting engine, runtime integration of terminology is done by using existing engines together with different termbases. This approach allows for a quick change of target domains to translate in without the need for a complete retraining of the engine. In the Moses system, which in most cases will be the system in mind when SMT is discussed, the runtime approach can be based on the option to read xml-annotated source texts. In order to force term translations based on a termbase, one has to identify inflected word forms corresponding to termbase entries in the source texts first. Preferably this step is backed up by lemmatizer, which derives base forms from inflected forms. Again, it has to be kept in mind, that translating into highly inflected languages, additional morphological processing has to be done.

There are more aspects to domain adaptation in SMT than I discussed here, for example, the combination of training texts with different language models in the training step or the merging of phrase tables of different engines etc. Methods differ from vendor to vendor and depend strongly on the architecture that has proven to be the best for the respective MT service.

Terminology Resources


Terminology doesn’t come from nothing – it grows, lives and changes with time and depending on the domain it applies to. So, if we speak of domain adaption there is always the adaptation of terminology. There is no domain without a specific terminology, because domains are, in fact, terminological collections (concepts linked to specific namings within a domain) embedded in a distinctive writing style. To meet the requirements of constantly growing and changing namings in everyday domains (think of medicine or politics), MT engines have to be able to use resources which contain up-to-date terminology. While some MT providers view the task of resource gathering as the customer’s contribution to the process of engine customization, others provide a pool of terminology the customer can choose from. Others do extensive work to ensure the optimal integration of terminology in the training data by checking for term consistency and completeness in the training data. One step in the direction of real-time terminology integration was made by the TaaS project, which came into being as a collaboration of Tilde, Kilgray, the Cologne University of Applied Sciences, the University of Sheffield and TAUS. The cloud-based service offers a few handy tools, e.g. for term extraction and bilingual term alignment, which can be used by MT services to dynamically integrate inflected word forms for terms that are unknown to the engine in use. In order to do so, TaaS is linked to online terminology repositories like IATE the EuroTermBank and TAUS. At berns language consulting, we spoke with the leading providers of MT services and most of them agree that up-to-date terminology integration by online services will become more and more important in flexible MT translation workflows.

Quality improvements – what can MT users do?


Summing up term integration methods, what can we do to improve terminology usage and translation quality with MT systems? Apart from the technical aspects that are best left to the respective tool vendors, the MT user can do a lot to optimize MT quality. Terminology management is the first building block in this process. Term extraction methods can help to build mass from scratch when there are no databases yet. As soon as a termbase is created, it should be used together with authoring tools and CAT tools to produce high-quality texts with consistent terminology. Existing texts that do not meet the requirements of consistent terminology should be checked for term variances as a basis for normalization, which aims at the usage of one and the same term for one concept in the text. Another way to improve term translations is by constantly feeding back post-edits of MT translations into the MT system, which could happen in a collaborative way by a number of reviewers or by using a CAT tool together with MT Plug-ins. The process of terminology optimization in translation workflows strongly depends on the systems and interfaces in use, so solutions may change from customer to customer. As systems constantly improve – especially in the field of NMT – we’re looking forward to all the changes and opportunities this brings to translation environments and translation itself.


======================



Christian Eisold has a background in computational linguistics and is a consultant at berns language consulting since 2016.


At blc he supports customers with translation processes, which involves terminology management, language quality assurance and MT evaluation.

Monday, May 8, 2017

Artificial Intelligence in the Language Industry: We’re Asking the Wrong Questions

This is an interesting guest post by Gábor Ugray on the potential of AI in the translation business.  We hear something about artificial intelligence almost every day now and are continually told that it will change our lives. AI is indeed helping to solve complex problems that even a year ago were virtually unthinkable. Mostly, these are problems where big data and massive computing can come together to produce new kinds of efficiencies and even production solutions. However, there are dangers and risks too, and it is wise to be aware of some of the basic driving forces that underly these problems. As we have seen with self-driving cars, sometimes things don't quite work as you would expect. These mishaps and unintended results can happen when we barely understand what and how the computer "understands". Machine learning is not perfect learning and much of what is learned through deep neural nets, in particular, is kind of mysterious, to put it nicely. 

We have seen that many in the translation industry have more often misused, or abusively used MT to bully translators to accept lower rates, and accept demeaning work, then used where it actually makes sense. We are just beginning to emerge into a stage where we see the more informed and appropriate use of MT, in the very recent past, however, many translators have already been bloodied.  Is AI the new monster we will use to terrorize the translator, or is it a potential work assistant that actually enhances and improves the translation work process? This will depend on us and what we do, and it is good to see Gábor's perspective on this as he is one of the architects of how this might unfold.

Gábor warns us about some key issues related to AI and points us towards asking the right questions to guide enduring and positive change and deployment. We should understand the following:
  • AI is almost completely dependent on training data and we know data is often suspect.
  • Improperly used, there is a risk of inadvertent or deliberate dehumanization of work as in early PEMT use.
 Neural networks are closed systems. The computer is learning something out of a data set in an intelligent but incomprehensible and obscure way to a human eye and human mind. But Google claims they are able to visualize the produced data as described in the zero-shot translation post  where they say:
Within a single group, we see a sentence with the same meaning but from three different languages. This means the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations. We interpret this as a sign of existence of an interlingua in the network. 
Is this artificial intelligence or is this just another over-the-top-claim of "magical" scientific success? If we cannot yet define intelligence for humans, how can we even begin to do so for machines? AI is more than often not much more than optimized data-driven task systems, which can be very impressive, but can we really say this is intelligence? A few are quite wary about this whole AI trend. Here is some discussion on shit really going down, driven by AI  which has gone awry.

So hopefully here is a question that makes sense to Gabor: "What needs to happen to make AI-based technology trustworthy and useful in the "language industry"? 

I do basically believe that technology wisely used can indeed improve the human condition but we are surrounded by examples of how things can go wrong without some forethought and these questions that Gábor points to indeed are worth asking. For those who want to dig deep into the big picture on AI, I recommend this article, though I have some reservations about the second part.

As the BBC said recently: Machines still have a long way to go before they learn like humans do – and that’s a potential danger to privacy, safety, and more.

============

I was honored when Kirti asked me if I would contribute to eMpTy Pages about TMS and intelligent data technologies. I’ve been thinking about this for nearly two months now until I finally realized what’s been holding me back. I find it difficult to attach to most of the ongoing discourse about AI, and that’s because I believe the wrong questions are being asked.

Those questions usually revolve around: What part of life can I disrupt through AI? How can my business benefit from AI? Or, if you prefer the fear angle: Will my company be disrupted out of existence if I don’t jump on the AI train in time? Will my job be made obsolete by thinking machines?

My concern is different. But I won’t tell you until the end of this post.



It’s only as good as your data


I found Kirti’s remark in his recent intro very insightful: “Machine learning” is a fancy way of saying “finding patterns in data.”

That resonates with the way I think about MT, whether it’s the statistical or neural flavor. In simple terms, MT is a tool to extrapolate from an existing corpus to get leverage for new content. If you think about it, that’s what translation memory does, too, but it stops at fuzzy matches, concordance searches, and some amount of sub-segment leverage.

Statistical MT goes far beyond that, but at a higher cost: it needs more data and more computation. Neural MT ups the ante yet again: it needs another order of magnitude more computational power and data. The concept has been around for decades; the “deep learning” aka “neural network” explosion of the past few years has one simple reason. It took until now for both the data and the computational capacity to become available and affordable.


The key point is, AI is machine-supported pattern extraction from large bodies of data, and that data has to come from somewhere. Language data comes in the form of human-authored and human-translated content. No MT system learns a language. They process text to extract patterns that were put in there by humans.

And data, when you meet it out in the wild, is always dirty. It’s inconsistent, in the wrong format, polluted with stuff you don’t want in there. Just think of text from a pair of aligned PDFs, with the page numbers interrupting in all the wrong places, OCR errors, extra line breaks, bona fide typos and the rest.


So, even on this elementary level, your system is only as good as your data, not to mention the quality of the translation itself. And this is not specific to the translation industry: the job reality of every data scientist is 95% gathering, cleaning, pruning, formatting and otherwise massaging data before the work can even begin.

Do we have the scale?

AI, MT and machine learning are often used synonymously with automation, but in reality, they are far from that. As Kirti explained in another intro, in order to get results with MT you need technical expertise, structure, and processes beyond technology per se. All of these involve human effort and skills, and pretty expensive skills too.

So the question is: at what point does an LSP or an enterprise get a positive return on such an investment? How much content must first be produced by humans; what is the cost of training the MT system; what is the benefit per million words (financial, time or otherwise)? How many million words must be processed before you’re in the black?

No matter how I look at it, this is an expensive, high-value service. It doesn’t scale in a human-less way as the software does.

Does the translation industry have the same economy of scale that a global retailer or a global ad broker disguised as a search engine does? Clearly, a number of technology providers, Kilgray among them, are thriving in this market. But I also think it’s delusional to expect the kind of hockey-stick disruption that is the stuff Silicon Valley startup dreams are made of.

Let’s talk about the weather

I have been focusing mostly on MT, but that’s misleading. I do think there are many other ways machine learning will contribute to how we work in the translation industry. Most of these ways are as-yet uncharted, which I think is a consequence of the industry’s market constraints.

I’ll zoom out from our industry now. I checked how many results Google finds if I search for a few similar phrases.

-- AI in weather forecasting: 1.29M
-- AI in language processing: 14.2M
-- AI in police: 69.7M

Of the three, I’d say without hesitation that weather forecasting yields itself best to advanced AI. Huge amounts of data: check. Clear feedback on success: check. Much room for improvement: check. And yet, going by what’s written on the Internets, that’s not what society thinks.


There is a near-universal view that technology is somehow neutral and objective, which I think is blatantly false. Technology is the product of a social and economic context, and it is hugely influenced by society’s shared beliefs, mythologies, fears, and desires.

Choose your mythology wisely

I am on odd one: in addition to AdBlock and Privacy Badger, my browser deletes all history when I close it, which is multiple times a day. At first, I just noticed the cookie messages that kept returning. Then I started getting warning emails every time I logged in to Twitter or Google. Finally, my password manager screwed me completely, requesting renewed email verification every time I launched the browser.

These are all well-meaning security measures, with sophisticated fraud detection algorithms in the back. But they work on the assumption that you leave a rich data trail. It is by cutting that trail that you realize how pervasive the big data net already is around you in your digital life. For a different angle on the same issue, read Quinn Norton’s poignant Love in the Time of Cryptography.
Others have written about the way machine learning perpetuates biases that are encoded, often in subtle ways, in their training datasets. In a world where AI in police outscores the weather and language, that’s a scary prospect.

With all of this I mean to say one thing. Machine learning, data mining, AI – whatever you want to call it, in conjunction with today’s abundance of raw digital data, this technology has the potential to be dehumanizing in an unprecedented way. I’m not talking conveyor-belt slavery or machines-will-take-my-job anxiety. This is more subtle, but also more far-reaching and insidious.

And we, as engineers and innovators, have an outsized influence on how today’s nascent data-driven technologies will impact the world. The choice of mythology is ours.


UX is the lowest-hanging fruit

After this talk about machine learning and big data on a massive scale, let’s head back to planet Earth. To my view of a translator’s workstation, to be quite precise.

Compared to even a few years ago, there is a marvelous wealth of specialized information available online. There are pages of search results just for terminology sites. There are massive online translation memories to search. There are online dictionaries and very active discussion boards.
Without the need to name names, the user experience I get from 99% of these tools is between cringe-worthy and offending. (Linguee being one notable exception to this.)



Here is one reason why I have a hard time enthusing about cutting-edge AI solutions for the language industry. Almost everywhere you look, there are low-hanging fruits in terms of user experience, and you don’t need clusters of 10-kilowatt GPUs to pluck them. I think it’s misguided to go messianic until we get the simple things right.

Two corollaries here. One, I myself am guilty as charged. Kilgray software is no exception. We pride ourselves that our products are way better than the industry average, but they; too, have a ways to go still. Rest assured, we are working on it.

Two, user experience also happens in the context of market constraints. All of the dismal sites I just checked operate on one of two models: ad revenues, or no revenues. I have bad news for you. These models make you the product, not the customer. This is not specific to the translation industry. The world at large has yet to figure out a non-pathological way to monetize online content.



Value in human relationships

I’ve been talking to a lot of folks recently whose job is to make complex translation projects happen on time and in good quality. Now it may be that my sample is skewed, but I saw one clear pattern emerging in these conversations.

I wasn’t told about standardized workflows. I didn’t hear about machine learning to pick the best vendor from a global pool of X hundred thousand translators. I didn’t perceive the wish to shave another few percent off the price by enhanced TM leverage.

The focus, invariably, was human relationships. How do I build a long-term working relationship based on trust with my vendors? How do I do the same with my own clients? How do I formulate the value that I add as a professional, which is not churning through 10% more words per day, but enabling millions of additional revenue from a market that was hitherto not accessible?

Those are not yet the questions I’m asking about AI, but they are closing in on my point. In a narrow sense, I see technology as an enabler: a way to reduce the drudge so humans have more time left for the meaningful stuff that only humans can do.

Fewer clicks to get a project on track? Great. More relevant information at the fingertips of translators and project managers? Awesome. Less time wasted labeling and organizing data, finding the right resources, finding the right person to answer your questions? Absolutely. Finding the right problems to work on, where your effort has the greatest impact? Prima.

AI has its place in that toolset. But let’s not forget to get the basics right, like building software with empathy for the end user.

The right question

Whether or not AI will be part of our lives is not a question. Humans have a very elastic brain, and whatever invention you give us, we will figure out a use for it and even improve on it.

I argued that technology is not a Platonic thing of its own, but the product of a specific social and economic context. I also argued that if you instrumentalize big data and machine learning within the wrong mythology, it has a disturbing potential to dehumanize.

But these are not inescapable forces of nature. The mythology we write for AI is a matter of choice, and the responsibility lies with us, engineers and innovators.

The right question is: 
How do I use AI responsibly? 
Is empathy at the center of my own engineering work?

No touchy-feely idealism here; let’s talk enlightened self-interest.

As a technology provider, I can create products with the potential to dehumanize work and encroach on privacy. That may give me a short-term advantage in a race to the bottom, but it will not lead to a sustainable market for my company. Or I can create products that help my customers differentiate themselves through stronger relationships, less drudge, and added value to their clients. Because I’m convinced that these customers are the ones who will be successful in the long run, I am betting on building technology for them.


That means engaging with customers (then engaging some more) to learn what problems they face every day, instead of worrying about the AI train. If the solution involves AI, great. But more likely it’ll be something almost embarrassingly simple.




 --------------------

Gábor Ugray is co-founder of Kilgray, creators of the memoQ collaborative translation environment and TMS. He is now Kilgray’s Head of Innovation, and when he’s not busy building MVPs, he blogs at jealousmarkup.xyz and tweets as @twilliability.

Thursday, May 4, 2017

"Specializing" Neural Machine Translation in SYSTRAN

We see that Neural MT continues to build momentum and that already most people agree that generic NMT engines outperform generic phrase-based SMT engines. In fact, in recent DFKI research analysis, generic NMT even outperforms many domain-tuned Phrase-based SMT systems. Both Google and Microsoft now have an NMT foundation for many of their most actively used MT languages. However, in the professional use of MT where the MT engines are very carefully tuned and modified for a specific business purpose, PB-SMT is still the preferred model for now. It has taken many years, but today many practitioners understand the SMT technology better, and some even know how to use the various control levers available to tune an MT engine for their needs. Today, most customized PB-SMT systems involve building a series of models in addition to the basic translation memory derived translation model, to address various aspects of the automated translation process. Thus, some may also add a language model to improve target language fluency, and a re-ordering model to handle word reordering issues. Thus a source sentence could pass through several of these sub-models that address different linguistic challenges before delivering a final target language translation. In May 2017, it may be fair to say most production MT systems in the enterprise context are either PB-SMT or RBMT systems.


One of the primary criticisms of NMT today (in the professional translation world) is that NMT is very difficult to tune and adapt for the specific needs of business customers who want greater precision and accuracy in their MT system output. The difficulty, it is generally believed, is based on three things:
  1. The amount of time taken to tune and customize an NMT system,
  2. The mysteriousness of the hidden layers that make it difficult to determine what might fix specific kinds of error patterns,
  3. The sheer amount/cost of computing resources needed in addition to the actual wall time.

However, as I have described before, SYSTRAN is perhaps uniquely qualified to address this issue as they are the only MT technology developer that has deep expertise in all of the following approaches: RBMT, PB-SMT, Hybrid RBMT+SMT and now Neural MT. Deep expertise means they have developed production-ready systems using all these different methodologies. Thus, the team there may see clues that regular SMT and RBMT guys either don't see or don't understand. I recently sat down with their product management team to dig deeper into how they will enable customers to tune and adapt their Pure Neural MT systems, and they provided a clear outline of several ways by which they already address this problem today, and described several more sophisticated approaches that are in late stage development and expected to be released later this year.

The critical requirement from the business use context for NMT customization is the ability to tune a system quickly and cost-effectively, even when large volumes of in-domain data are not available.



What is meant by Customization and Specialization?


It is useful to use the correct terminology when we discuss this, as confusion rarely leads to understanding and effective response or action. I use the term customization (in many places across various posts on this blog) as a broad term to point to the tuning process that is used to adjust an MT system for a specific and generally limited or focused business purpose, to what I think they call a "local optimum" in math. Remember in any machine learning process it is all about the data. I use the term customization to mean the general MT system optimization process which could be any of the following scenarios: Your Data + My Data, All My Data, Part of Your Data + My Data, and really, there is a very heavy SMT bias in my own use of the term customization.

SYSTRAN product management prefers to use the term "specialization" to mean any process which may perform different functions but which all consist of adapting an NMT model to a specific translation context (domain, jargon, style). From their perspective, the term, specialization better reflects the way NMT models are trained than customization because NMT models acquire new knowledge rather than being dedicated to production use in only one field. This may be a fine distinction to some, but it is significant to somebody who has been experimenting with thousands of NMT models over an extended period. The training process and its impact for NMT are quite different, and we should understand this as we go forward.

The Means of NMT Specialization - User Dictionaries

A user dictionary (UD) is a set of source language expressions associated with specific target language expressions, optionally with Part Of Speech (POS) information. This is sometimes called a glossary in other systems. User dictionaries are used to ensure a specific translation is used on some specific parts of a source sentence within a given MT system. This feature allows end-users to integrate their own preferred terminology for critical terms. In SYSTRAN products, UDs have been historically available and used extensively for their Rule-Based Systems. User Dictionary capabilities are also implemented in SYSTRAN SMT and Hybrid (SPE - Statistical Post-Editing) systems.

SYSTRAN provides four levels of User Dictionary capabilities in their products as described below:
  • Level 1: Basic Pattern Matching: The translation source input is simply matched against the exact source dictionary entries and the preferred translation target string is output.
  • Level 2: Level 1 + Homograph Disambiguation: Sometimes, a source word may have different part-of-speech (POS) associations. A Level 2 UD entry is selected according to its POS in the translation input, and then the right translation is provided. An example:
    • “lead” as a noun: Gold is heavier than lead > L'or est plus lourd que le plomb
    • “lead” as a verb: He will lead the meeting > Il dirigera la reunion
  • Level 3: Morphology: A single base form from an UD entry may match different inflected forms of the same term, thus the translation is produced by the proper inflection of the target form of the UD entry according to matched POS and grammatical categories.
  • Level 4: Dependence: This allows a user to define rules to translate a matched entry depending on other conditionally matched entries.
As of April 2017, only the Level 1 User Dictionary capabilities are available for the PNMT systems, but the other options are expected to be released during this year.  We can see below, that even the Level 1 basic UD tuning functionality, has an impact on the output quality of an NMT system.




The Means of NMT Specialization - Domain Specialization & Hyper-Specialization


The ability to do “domain specialization” of an MT system is always subject to the availability of in-domain training corpora. But often, sufficient volumes of the right in-domain corpora are not available. In this situation, apparently, NMT has a big advantage over SMT. Indeed, it is possible with NMT systems to take advantage of the mix of both generic data and in-domain data. This is done by taking an already trained generic NMT engine and “overtrain” it on the sparse in-domain data that is available. (To some extent this is possible in SMT as well, but generally, much more data is required to overpower the existent statistical pattern dominance.)

According to the team: "In this way, it is possible to obtain a system that has already learned the general (and generic) structure of the language, and which is also tuned to produce content corresponding to the specific domain. This operation is possible because we can iterate infinitely on the training process of our neural networks. That means that specializing a generic neural system is just continuing a usual training process but on a different kind of training data that are representative of the domain. This training iteration is far shorter in time than a whole training process because it is made on a smaller amount of data. It typically takes only a few hours to specialize a generic model, where it may have taken several weeks to train the generic model itself at the first time."


They also pointed out how this "over-training" process is especially good for narrow domains and what they call "hyper-specialization".  "The quality produced by the specialization really depends on the relevance of the in-domain data. When we have to translate content from a very restrictive domain, such as printer manuals for example, we speak of “hyper-specialization” because a very small amount of data is needed to well represent the domain and the specialized system is rather only good for this restrictive kind of content – but really good for it."

For those who want to understand how data volumes and training iterations might impact the quality produced by the specialization, SYSTRAN provided this research paper that describes some specific experiments in some detail and shows that using more in-domain training data produces better results.



The Means of NMT Specialization - Domain Control

This was an interesting feature and use-case that they presented to me to handle special kinds of linguistic problems. Their existing tuning capabilities are positioned to be able to be extended to handle automated domain identification and special style and politeness-level issues in some languages. Domain Control is an approach that allows a single training and model to be developed that can be used to handle multiple domains, with some examples given below.

There are languages where the translation of the same source may be different given the politeness context or perhaps a different style is required.  The politeness level is especially important for the Korean and Japanese (and Arabic too) languages.This feature is one of many control levers that SYSTRAN provides customers who want to tune their systems to very specific use-case requirements.

"To address this, we train our model adding metadata about the politeness context of each training example. Then, at run-time, the translation can be “driven” to produce either formal or informal translation based on the context. Currently, this context is manually set by the user, but in the near future, we could imagine including a politeness classifier to automatically set this information."

They pointed out that this is also interesting because the same concept can be applied to domain adaptation. "Indeed, we took corpora labeled with different domains (Legal, IT, News, Travel, …) and feed it to a neural network training. This way, the model not only benefits from a large volume of training examples coming from several different domains, but it also learns to adapt the translation it generates to the given context.

Again, as of today, we only use this kind of model with the user manually setting the domain he/she wants to translate, but we’re already working on integrating automatic domain detection to give this information to the model at run-time. Thus a single model could handle multiple domains."

As far as domain control is concerned, training data tagging is a step taken during the training corpus preparation process. These tags/metadata are basically annotations. They denote many kinds of information such as politeness, style, domain, category, anything that might be suitable to drive the translation in a particular direction or another. Thus, this capability could also be used to handle the multiple domains and content types that you may find on an eCommerce site like Amazon or eBay.

The annotations may be produced either by humans/annotators or via automated systems using NLP processes. However, as this is done during training preparation, this is not something that SYSTRAN clients will typically do. Then, at run-time, the model needs the same kind annotations for the source content to be translated, and again, this information may be provided either by a user action (e.g: the user selects a domain) or by an automated system (automatic domain detection).
  There is some further technical description on Domain Control in this paper.

The following graphic puts the available adaptation approaches in context and shows how these different approaches vary in terms of effort/time investment and what impact the different strategies may have on the output quality. Using multiple approaches at the same time also adds further power and control to the overall adaptation scenario. The methods of tuning shown here are not incremental or sequential. They can all be used together as needed and as is possible. For those who have been following the discussion with Lilt on their "MT evaluation" study, can now perhaps understand why an instant BLEU snapshot based evaluation is somewhat pointless and meaningless. Judging and ranking MT systems comparatively on BLEU scores like Lilt has done, without using tuning tools properly, is misleading and even quite rude. Clients who can adapt their MT systems to their specific needs and requirements will always do so, and will often use all the controls at their disposal.   SYSTRAN makes several controls available in addition to the full (all my data + all your data training) customization described earlier.  I hope that we will soon start hearing from clients and partners who have learned to operate and use these means of control and are willing to share their experience so that our MT initiatives continue to gain momentum.