{"id":363,"date":"2018-01-13T02:21:50","date_gmt":"2018-01-13T01:21:50","guid":{"rendered":"https:\/\/rosetta.vn\/translate\/evaluation-of-machine-translation-quality\/"},"modified":"2018-01-13T02:21:50","modified_gmt":"2018-01-13T01:21:50","slug":"evaluation-of-machine-translation-quality","status":"publish","type":"post","link":"https:\/\/rosetta.vn\/translate\/evaluation-of-machine-translation-quality\/","title":{"rendered":"Evaluation of machine translation quality"},"content":{"rendered":"<blockquote><p><b>\ufffcPractical MT Evaluation For Translators<\/b><br \/>\nSubjective judgements reveal nothing about a translator\u2019s experience<br \/>\nTom Hoar\t August 30, 2017\tComments<br \/>\n\ufffc<br \/>\nThank you, Isabella Massardo, for your article, Who Is A Translator\u2019s New Best Friend?, as part of our Blogging Translator Review Program. When she invited me to write a guest post, I decided to clarify Slate\u2018s evaluation scores and demonstrate how they help you compare your engine\u2019s performance to another engine, like Google\u2019s. This article is a re-post of my guest post on her blog: <a href=\"http:\/\/massardo.com\/blog\/mt-evaluation\/\">http:\/\/massardo.com\/blog\/mt-evaluation\/<\/a><\/p>\n<p>This re-post also includes two appendices that made the article too long for Isabella\u2019s blog. The first is a glossary of our score terms. The second shows twelve (12) example source-target segments plus the output translations from Isabella\u2019s engine and Google.<\/p>\n<p>The Need For MT Evaluation<br \/>\nSubjective observations of machine translation (MT) linguistic quality are simple and easy for 35-40 words in a few example segments, but they reveal nothing about long-term translation quality or the translator\u2019s experience across several projects of 10,000 words each.<\/p>\n<p>A truly objective, accurate and automated evaluation of MT linguistic quality is beyond today\u2019s state of the art. In fact, this deficit is what leads to the poor quality of MT output in the first place. This doesn\u2019t mean MT is useless because translators are using MT every day.<\/p>\n<p>What are MT evaluations good for if they can\u2019t accurately report a translation\u2019s quality?<br \/>\n\ufffc<\/p>\n<p>Slate\u2019s evaluation scores do not tell you about the quality of an engine\u2019s translations. Instead, Slate focuses on describing engine criteria that can be measured objectively. Here, I generically refer to these criteria as an engine\u2019s \u201clinguistic performance.\u201d The scores indicate how an engine might reduce or increase a translator\u2019s workload compared to another engine. With objective evaluation scores, you can better predict how an engine might affect your work efficiency in the long term.<\/p>\n<p>So, let\u2019s look at the best practices of MT evaluation. Then, I\u2019ll review Isabella\u2019s engine scores with a focus on how they relate to her client\u2019s work. Finally, I\u2019ll compare Google\u2019s output from the same evaluation segments with Isabella\u2019s engine results.<\/p>\n<p>Evaluation Best Practices<br \/>\nCurrent MT evaluation best practices require an evaluation set with 2,000-3,000 source-target segment pairs. The source segments represent the variety of work that the translator is likely to encounter. The target segments represent the desired reference translations.<\/p>\n<p>The evaluation process uses the MT engine you\u2019re evaluating to create \u201ctest\u201d segments from the evaluation set\u2019s source segments. It then measures each \u201ctest\u201d segment against its respective \u201creference\u201d and assigns a \u201ccloseness\u201d score. These are like fuzzy match scores, but for target-to-test segments not source-to-TM segments. The process accumulates individual scores, like an average, to describe how the engine performed with that evaluation set. A performance descriptions for one engine has some value, but it\u2019s much more valuable to compare descriptions of one evaluation set from different engines to tell us which engine performs better.<\/p>\n<p>Measuring Isabella\u2019s Engine<br \/>\nIsabella reported she started with three .tmx files and 250,768 segment pairs from the same client since 2003. Her Engine Summary (image below) shows Slate built Isabella\u2019s engine from 119,053 segments after it removed 131,715 segment pairs (53%) for technical reasons. You can learn more about translation memory preparation on our support site.<\/p>\n<p>\ufffc<\/p>\n<p>Slate randomly removed and set aside 2,353 segment pairs that represent Isabella\u2019s 14 years of work as the evaluation set leaving only 116,700 pairs to create the engine\u2019s statistical models. During the evaluation process, the source segments are like a new project from the engine\u2019s viewpoint. That is, the engine is not recalling segments that were used to build it. This evaluation strategy gives a 95% confidence that the engine will perform similarly when Isabella gets a new project from this client.<\/p>\n<p>Isabella\u2019s Engine vs Google<br \/>\nBefore I could compare the performance of Isabella\u2019s engine to Google, Isabella graciously granted me permission to translate her evaluation set\u2019s 2,353 source segments using Google Translate. Here are Google\u2019s evaluation scores side-by-side with Isabella\u2019.<\/p>\n<p>Evaluation Set<br \/>\nSegment count\t2,353<br \/>\nAverage segment length (words per segment)\t16.5<br \/>\nEvaluation Scores\tGoogle Translate<br \/>\nen-it\tIsabella\u2019s<br \/>\nen-it-ns_test<br \/>\nDate\t2017-08-11\t2017-07-29<br \/>\nEvaluation BLEU score (all)\t33.07\t69.33<br \/>\nEvaluation BLEU score (1.0 filtered)\t32.47\t61.82<br \/>\nQuality quotient\t4.33%\t29.75%<br \/>\nEdit Distance per line (non-zero)\t42\t32<br \/>\nExact matches count\t102\t700<br \/>\nEdit Distance entire project\t93,605\t52,856<br \/>\nAverage segment length (exact matches)\t4.7\t11.4<br \/>\nThis Engine Summary table includes a variety of scores, but these are the three that I rely on the most: the Average sentence length, the Quality quotient, and the Evaluation BLEU score (1.0 filtered).<\/p>\n<p>The average segment length of source segments in the evaluation set tells us if Isabella\u2019s translation memories are heavily weighted with terms, such as from a termbase. Isabella\u2019s 16.5 average above is normal and the translation memories likely include a good balance of short and long segments. If the average were very small (for example 4 or 5 words), the engine will work poorly with long sentences.<\/p>\n<p>The quality quotient (QQ) score means its likely that Isabella will simply review up to 30% of segments as exact matches when she works with her engine and her client\u2019s future projects. Exact matches with this engine are 7 times more likely than if she did the same work with Google.<\/p>\n<p>The evaluation BLEU score (filtered) represents the amount of typing and\/or dictation work Isabella will need to do when her engine fails to suggest an exact match. Her engine\u2019s score of 61.8 indicates her engine\u2019s segments are likely to require less work than segments from Google with a score of 32.5. It\u2019s important to note that this evaluation set\u2019s Google BLEU score is comparable to Google scores with other published evaluation scores.<\/p>\n<p>Putting It All Together<br \/>\nIsabella described her translation memories as client-specific with mostly her translations, those of a trusted colleague and some from unknown colleagues. She said, \u201cAll in all, a great mess\u201d because they contain some terminological discrepancies, long convoluted segments, and other one-word long segments. She created her engine on her 4-year-old laptop computer in less than a day without any specialized training.<\/p>\n<p>Isabella\u2019s evaluation set is a representative subset of the TMs that Slate created to build the engine. The evaluation set\u2019s scores show that her engine significantly outperforms Google Translate in every measured category. Furthermore, because of how Slate created the evaluation set and her translation memories are primarily her work specific to her client, she has a 95% likelihood of experiencing similar performance with future work from that client.<\/p>\n<p>\ufffc<\/p>\n<p>When Isabella works on projects with Slate, her engine is likely to give her 7 of year 10 segments that require changes (the converse of the QQ). Like many users, she might find these suggestions overwhelming because she\u2019s accustomed to the CAT hiding the suggestions from poor fuzzy matches. Still, 70% represents much less work than the 96% she would likely receive from Google. With a little practice, it\u2019s easy and fast to trash segments that require radical changes and start from scratch.<\/p>\n<p>There\u2019s no way to predict how her engine will perform with work from other clients or other subject matter. The nature of the statistical machine translation technology tells us that the performance will degrade as a project\u2019s linguistic contents diverge from her engine corpus\u2019 contents. Isabella\u2019s engine could drop significantly for projects with disparate linguistic content. Fortunately, Isabella controls her engine and Slate gives her some tools to clean up the \u201cgreat mess,\u201d for example by forced terminology files to resolve the terminological discrepancies.<\/p>\n<p>This was her first engine and she can experiment to her heart\u2019s content. She can create as many engines as she likes. She can mix various translation memories and compare their performance, much like I compared her engine to Google in this article. Furthermore, she can experiment without any additional cost. If she has translation memories for five clients, she can create one engine for each of them or one that combines all. I look forward to hearing about her experiments.<\/p>\n<p>When using Google Translate, Isabella needs to wait for Google to update and improve their engine. For example, her Google results reflect their recent update their en-it engine to NMT and these scores reflect those improvements. To Google\u2019s credit, it handles variations across different subjects better than Isabella\u2019s engine likely will. As Isabella pointed out, Google \u201chas been constantly improving since inception.\u201d So, across many different subjects, Google will continue to deliver 4% to 5% exact matches.<\/p><\/blockquote>\n<p>Source:&nbsp;<a href=\"https:\/\/slate.rocks\/practical-mt-evaluation-for-translators\/\" style=\"font-size: 16px;\">https:\/\/slate.rocks\/practical-mt-evaluation-for-translators\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\ufffcPractical MT Evaluation For Translators Subjective judgements reveal nothing about a translator\u2019s experience Tom Hoar August 30, 2017 Comments \ufffc Thank you, Isabella Massardo, for your article, Who Is A Translator\u2019s New Best Friend?, as part of our Blogging Translator Review Program. When she invited me to write a guest post, I decided to clarify&hellip;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[1],"tags":[4,23,24],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8jAij-5R","_links":{"self":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts\/363"}],"collection":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/comments?post=363"}],"version-history":[{"count":0,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts\/363\/revisions"}],"wp:attachment":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/media?parent=363"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/categories?post=363"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/tags?post=363"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}