Saturday, February 27, 2010

Rule-based MT vs. Statistical MT: Does it Matter?

One of the current debates in the MT community is RbMT vs. SMT. While I do have a clear bias that favors SMT, I have tried to be fair and have written many times on this subject. I agree that it cannot be said that one approach is definitely ALWAYS better than the other. There are many successful uses of both. In fact, at this point in time there may be more examples of RbMT successes since it has been around longer.

However, there is clear evidence that SMT continues to gain momentum and is increasingly the preferred approach. RbMT has been around for 50 years and the MT engines we see around are in many cases the result of decades of investment and research. SMT is barely 5+ years old in terms of being commercially available since Kevin Knight began his research at USC in 2000 and is only just beginning to become available in the market.
The people best suited to answer the question of which approach is better are those who have explored both RbMT & SMT paradigms deeply, to solve the same problem. Unfortunately there are very few of these people around. The only ones I know for sure that have this knowledge are the Google Translate and Microsoft Live Translate teams and they have both voted in favor of SMT.
Today, RbMT still makes sense when you have very little data, or where you have a good foundation rules engine already in place, that has been tested and is a good starting point for customization. Some say they also perform better on languages with very large structural and morphological differences.Combinations like English <> Japanese, Russian, Hungarian, Korean still seem to often do better with RbMT. It is also claimed by some that RbMT systems are more stable and reliable than SMT systems. I think this is probably true with systems built from web-scraped or dirty data but the story with clean data is quite different. SMT systems built with clean data are stable, reliable and much more responsive to small amounts of corrective feedback.

What most people still overlook is that the free online engines are not a good representation of the best output possible with MT today. The best systems come after focused customization efforts, and the best examples for both RbMT and SMT are carefully customized in domain systems that are built for very specific enterprise needs rather than for general web user translation.

It has also become very fashionable to use the word “hybrid” of late. For many this means using both RbMT and SMT at the same time. However, this is more easily said than done. From my viewpoint, characterizing the new Systran system as a hybrid engine is misleading. It is an RbMT engine that applies a statistical post-process on the RbMT output to improve fluency. Fluency has always been a problem for RbMT and this post-process is an attempt to improve the quality of the raw RbMT output.Thus this approach is not a true hybrid from my point of view. In the same way, linguistics are being added to SMT engines in different ways to handle issues like word order and dramatically different morphology which have been a problem for pure data-based SMT approaches. I think most of us agree that statistics, data and linguistics (rules and concepts) are all necessary to get better results, but there are no true hybrids out there today.
RbMTvsSMT Table
I would also like to present my case for the emerging dominance of SMT with some data that I think we can mostly agree, is factual and true and not just a matter of my opinion.
Fact 1: Google used Systran RbMT system as their translation engines for many years before switching to SMT. The Google engines are general purpose baseline systems (i.e. non domain focused). Most people will agree that Google compares favorably with Babelfish which is a RbMT engine. I am told they switched because they saw a better-quality future and continuing evolution with SMT which CONTINUES TO IMPROVE as more data becomes available and corrective feedback is provided. Most people agree that the Google engines have continued to improve since they switched to SMT.
Fact 2: Most of the widely used RbMT systems have been developed over many years (decades in some cases) while none of the SMT systems are much over 5 years old and are still in infancy in 2010. 
Fact 3: Microsoft switched from a Systran RbMT engine to an SMT approach for all their public translation engines in the MSN Live portal as well. I presume for similar reasons as Google. They also use a largely SMT based approach to translate millions of words in their knowledge bases into 9 languages which is perhaps the most widely used corporate MT application in the world today. The Microsoft quality also continues to improve.
Fact 4: Worldlingo switched from a RbMT foundation to SMT to get broader language coverage and attempt to reverse a loss of traffic (mostly to Google)
Fact 5: SMT providers have been able to easily outstrip RbMT providers in terms of language coverage and we are only at the beginning of this trend. Google had a base of 25 languages while they were RbMT based but now have over 45 language pairs that can go into any other language and apparently now have over 1,000 language combinations with their SMT engines.
Fact 6: The Moses Open Source SMT training system has been downloaded over 4,000 times in the last year. TAUS considers it “the most accessed MT system in the world today.” Many new initiatives are coming forth from this exploration of SMT by the open source community and we have not yet really seen the impact of this in the marketplace.

Google and Microsoft have placed their bets. Even IBM, which still has a legacy RbMT offering, has their Arabic and Chinese speech systems linked to an SMT engine that they have developed. So now, we have three of the largest IT companies in the world focused on SMT-based approaches. 

However, this is perhaps just relevant for the public online free engines. Many of us know that customized, in-domain focused systems are different and for enterprise use, the kind of system that matters most. How easy is it to customize an SMT vs RbMT engine?
Fact 7: Callison-Burch, Koehn et al have published a paper (funded by Euromatrix) where they compared 6 European languages engines as baselines and after domain tuning with TM data for SMT, dictionaries for RbMT. They found that Czech, French, Spanish and German to English all had better domain results with SMT. Only the Eng>Ger domain had better results on domain focused systems with RbMT. However, he did find that RbMT had better baselines in many cases than they had since they do not have the data resources that Google or Microsoft have and whose baseline systems are much better.
Fact 8: Asia Online has been involved with patent domain focused systems in Chinese and Japanese. We have produced higher quality translations than RbMT systems which have been carefully developed with almost a decade of dictionary and rules tuning. The SMT systems were built over 3-6 months and will continue to improve. It should be noted that in both cases Asia Online is using linguistic rules in addition to raw data-based SMT engine development.
Fact 9: The intellectual investment from the computational linguistics and NLP community is heavily biased towards SMT maybe by as much as a factor of 10X. This can be verified by looking at the focus of major conferences on MT in the recent past and in 2010. I suspect that this will mean continued advance and progress in the quality of SMT based approaches.

Some of my personal bias and general opinion on this issue:
-- If you have a lot of bilingual matching phrase pairs (100K+) you should try SMT and in most cases you will get better results than a RbMT especially if you spend some time providing corrective feedback in an environment like Asia Online. I think man-machine collaborations are much more easily engineered in SMT frameworks. Corrective feedback can be immediately useful and can leverage the engine quality very quickly.
-- SMT systems will continue to improve as long you have clean data foundations and continue to provide corrective feedback and retrain these systems periodically after “teaching” it what it is getting wrong.
-- SMT will win the English to German quality game in the next 3 years or sooner.
-- SMT will become the preferred approach for most of the new high value markets like Brazilian Portuguese, Chinese, Indic Languages, Indonesian, Thai, Malaysian and major African markets.
-- SMT will continue to improve significantly in future because: Open Source + Academic Research + Growing Data on Web + Crowdsourcing Feedback are all at play with this technology

SMT systems will improve as more data becomes available, bad data is removed and as pre and post processing technologies around these systems improve. I also suspect that the future systems will be some variation of SMT + Linguistics (which includes rules) rather than data-only based approaches. I also see that humans will be essential to driving the technology forward and that some in the professional industry will be at the helm, as they do in fact understand how to manage large scale translation projects better than most.

I have also covered this in some detail in a white paper that can be found in the L10NCafe or on my LinkedIn profile and there is much discussion about this subject in the Automated Language Translation group in LinkedIn where you can also read the views of others with differing opinions. I recommend the entries from Jordi Carrera in particular, as he is an eloquent and articulate voice for RbMT technology. One of the best MT systems I know is an RbMT system that has source analysis and cleanup, integrated and largely automated post-editing at PAHO. The overall process flow is what makes it great, not that it is based on RbMT.
 So does it matter what approach you use? If you have a satisfactory, working RbMT engine then there is probably no reason to change. I would suggest that SMT makes more sense for most long-term initiatives where you want to see the system continually improve. Remember in the end the real objective is to get high volumes of content translated faster in the most accurate way possible and both approaches can work with the right expertise, even though I do prefer SMT and do believe that it will dominate in future.

1 comment:

  1. This debate is apparently a hot current issue as the latest Multilingual magazine covers it some detail with an article from Lori Thicke (who is now on the board of the magazine) that covers this in some detail.

    This is a quote from the article:
    Says Yanishevsky, “clearly, hybridization will be the development of the future for both SMT and RBMT engines. However, fundamentally, we believe that it is faster and more efficient to hybridize with rule-based underpinnings than with SMT underpinnings since it is easier to graft statistics onto rules rather than vice versa. The hybridization of our MT engine will linguistically smooth an already robust and quality output."

    I think that the enterprise decision will be made on how much control each approach gives to skilled and un-skilled users, and the ability to show ongoing improvements. I believe that the evidence already suggests that SMT will win even for Russian and German in time. I have already seen it done for Japanese.

    I will perhaps provide some counter opinions in a future blog entry but it is good to see this growing debate.

    Perhaps we are indeed heading into an MT "perfect storm".