{"id":587,"date":"2019-05-09T19:14:30","date_gmt":"2019-05-09T18:14:30","guid":{"rendered":"https:\/\/rosetta.vn\/translate\/?p=587"},"modified":"2019-05-09T19:14:30","modified_gmt":"2019-05-09T18:14:30","slug":"chuyen-du-lieu-dich-cu-thanh-bo-nho-dich","status":"publish","type":"post","link":"https:\/\/rosetta.vn\/translate\/chuyen-du-lieu-dich-cu-thanh-bo-nho-dich\/","title":{"rendered":"Chuy\u1ec3n d\u1eef li\u1ec7u d\u1ecbch c\u0169 th\u00e0nh b\u1ed9 nh\u1edb d\u1ecbch"},"content":{"rendered":"<p>\u0110\u1ed1i v\u1edbi ng\u01b0\u1eddi \/ nh\u00f3m \/ c\u00f4ng ty d\u1ecbch thu\u1eadt \u0111\u00e3 t\u1eebng d\u1ecbch nhi\u1ec1u m\u00e0 c\u00f3 c\u00e1c t\u00e0i li\u1ec7u \u0111\u01b0\u1ee3c d\u1ecbch kh\u00f4ng d\u00f9ng ph\u1ea7n m\u1ec1m <a href=\"https:\/\/rosetta.vn\/translate\/tools-for-translation\/cat\/\">CAT<\/a>, m\u00e0 v\u1ec1 sau chuy\u1ec3n sang d\u00f9ng CAT, th\u00ec c\u1ea7n chuy\u1ec3n c\u00e1c k\u1ebft qu\u1ea3 d\u1ecbch c\u0169 th\u00e0nh d\u1eef li\u1ec7u <a href=\"https:\/\/rosetta.vn\/translate\/tools-for-translation\/cat\/translation-memory\/\">b\u1ed9 nh\u1edb d\u1ecbch &#8211; translation memory (TM)<\/a>, nh\u1eb1m t\u1eadn d\u1ee5ng \u0111\u01b0\u1ee3c ch\u00fang trong nh\u1eefng d\u1ef1 \u00e1n d\u1ecbch m\u1edbi.<\/p>\n<p><strong>C\u00e1ch th\u1ee9c: d\u00f9ng c\u00e1c ph\u1ea7n m\u1ec1m \u0111\u1ec3 <a href=\"https:\/\/rosetta.vn\/translate\/tools-for-translation\/cat\/align\/\">Align<\/a> (gi\u00f3ng c\u00e2u) t\u1ef1 \u0111\u1ed9ng.<\/strong><\/p>\n<p>M\u1ed9t v\u00e0i kinh nghi\u1ec7m khi t\u00f4i d\u00f9ng ph\u1ea7n m\u1ec1m mi\u1ec5n ph\u00ed LF Aligner \u0111\u1ec3 chuy\u1ec3n d\u1eef li\u1ec7u d\u1ecbch c\u0169 th\u00e0nh b\u1ed9 nh\u1edb d\u1ecbch:<\/p>\n<p>+ Chia c\u1ee5m d\u1eef li\u1ec7u c\u1ea7n align ra th\u00e0nh c\u00e0ng nh\u1ecf c\u00e0ng t\u1ed1t, gi\u00fap gi\u1ea3m sai s\u1ed1 t\u00edch l\u0169y khi l\u00e0m align (c\u00e0ng v\u1ec1 sau c\u00e0ng c\u00f3 th\u1ec3 b\u1ecb l\u1ec7ch nhi\u1ec1u). V\u00ed d\u1ee5: c\u00f3 b\u1ea3n d\u1ecbch c\u1ee7a c\u1ea3 quy\u1ec3n s\u00e1ch th\u00ec t\u00e1ch ra th\u00e0nh t\u1eebng ch\u01b0\u01a1ng r\u1ed3i ch\u1ea1y ph\u1ea7n m\u1ec1m align t\u1eebng ch\u01b0\u01a1ng. Ch\u00fang t\u00f4i l\u00e0m align cho s\u00e1ch c\u1ee7a t\u1ee7 s\u00e1ch Nh\u1ea5t Ngh\u1ec7 Tinh m\u00e0 b\u1ea3n d\u1ecbch \u1ee9ng v\u1edbi b\u1ea3n g\u1ed1c theo t\u1eebng trang, khi \u0111\u00f3 c\u1eaft file PDF ra t\u1eebng trang. \u0110\u1ec3 \u0111\u1ee1 c\u00f4ng s\u1ee9c th\u00ec t\u00f4i c\u0169ng cho LF Aligner ch\u1ea1y t\u1ef1 \u0111\u1ed9ng c\u1ea3 th\u01b0 m\u1ee5c (ph\u1ea3i vi\u1ebft ch\u00fat code), ch\u1ee9 kh\u00f4ng ph\u1ea3i ch\u1ea1y t\u1eebng file.<\/p>\n<p>+ LF Aligner c\u00f3 th\u1ec3 nh\u1eadn file PDF \u0111\u1ec3 l\u00e0m align. Tuy nhi\u00ean, n\u00ean th\u00eam b\u01b0\u1edbc convert PDF sang DOCX m\u00e0 kh\u00f4ng d\u00f9ng tr\u1ef1c ti\u1ebfp file PDF, \u0111\u1ec3 tr\u00e1nh l\u1ed7i g\u00e2y ra do v\u0103n b\u1ea3n c\u00f3 2 c\u1ed9t, ho\u1eb7c do c\u00f3 text n\u1eb1m trong h\u00ecnh \u1ea3nh, b\u1ea3ng bi\u1ec3u. \u0110\u00f3 l\u00e0 v\u00ec LF aligner d\u00f9ng c\u00f4ng c\u1ee5 pdftotext \u0111\u1ec3 l\u1ea5y text ra t\u1eeb file PDF m\u00e0 kh\u00f4ng ph\u00e2n bi\u1ec7t \u0111\u01b0\u1ee3c v\u0103n b\u1ea3n c\u00f3 nhi\u1ec1u c\u1ed9t, v\u00e0 ch\u1ed7 n\u00e0o c\u00f3 h\u00ecnh \u1ea3nh. Th\u00e0nh ra v\u1edbi nh\u1eefng ch\u1ed7 n\u00e0o c\u00f3 h\u00ecnh \u1ea3nh th\u00ec text trong h\u00ecnh s\u1ebd l\u1eabn l\u1ed9n v\u00e0o text trong \u0111o\u1ea1n v\u0103n xung quanh, ch\u1ea5t l\u01b0\u1ee3ng align s\u1ebd th\u1ea5p.<\/p>\n<p>+ C\u00f3 DOCX th\u00ec c\u0169ng c\u00f3 th\u1ec3 cho LF aligner ch\u1ea1y tr\u00ean DOCX \u0111\u01b0\u1ee3c, tuy nhi\u00ean \u0111\u1ec3 ki\u1ec3m so\u00e1t \u0111\u01b0\u1ee3c nhi\u1ec1u h\u01a1n v\u00e0 trong tr\u01b0\u1eddng h\u1ee3p c\u00f3 kh\u1ea3 n\u0103ng l\u1eadp tr\u00ecnh, th\u00ec ta t\u1ef1 b\u00f3c t\u00e1ch DOCX ra c\u00e1c c\u00e2u \u0111\u1eb7t trong file TXT. T\u00f4i d\u00f9ng Python v\u1edbi th\u01b0 vi\u1ec7n docx, gi\u00fap b\u00f3c t\u00e1ch trong file DOCX ra. Trong qu\u00e1 tr\u00ecnh b\u00f3c t\u00e1ch n\u00e0y t\u00f4i c\u1eaft b\u1edbt m\u1ea5y \u0111o\u1ea1n text qu\u00e1 ng\u1eafn (v\u00ed d\u1ee5: ng\u1eafn h\u01a1n 20 k\u00fd t\u1ef1 th\u00ec lo\u1ea1i ra, v\u00ec ch\u1eafc kh\u00f4ng ph\u1ea3i l\u00e0 m\u1ed9t c\u00e2u), lo\u1ea1i b\u1ecf text trong c\u00e1c h\u00ecnh \u1ea3nh v\u00e0 b\u1ea3ng (th\u1ef1c ra l\u00e0 python-docx t\u1ef1 lo\u1ea1i b\u1ecf text trong c\u00e1c ph\u1ea7n \u0111\u00f3 khi ch\u1ea1y l\u1ec7nh truy c\u1eadp paragraph.text \u0111\u01a1n gi\u1ea3n, n\u1ebfu mu\u1ed1n l\u1ea5y text trong h\u00ecnh \u1ea3nh v\u00e0 tables th\u00ec c\u1ea7n d\u00f9ng l\u1ec7nh paragraph._element.xpath() \u0111\u1ec3 duy\u1ec7t nh\u1eefng th\u00e0nh ph\u1ea7n XML c\u00f3 t\u00ean &#8216;w:t&#8217; trong file DOCX).<\/p>\n<p>+ \u0110\u1ed1i v\u1edbi file PDF m\u00e0 v\u0103n b\u1ea3n kh\u00f4ng ch\u00e9p ra \u0111\u01b0\u1ee3c (v\u00ed d\u1ee5 l\u00e0 PDF t\u1eeb scan t\u00e0i li\u1ec7u, ch\u1ee5p h\u00ecnh): c\u00f3 th\u1ec3 d\u00f9ng ph\u1ea7n m\u1ec1m OCR \u0111\u1ec3 nh\u1eadn d\u1ea1ng v\u0103n b\u1ea3n, chuy\u1ec3n ch\u00fang th\u00e0nh d\u1ea1ng DOCX ho\u1eb7c XML (Google Cloud Vision hay ABBYY Cloud OCR c\u00f3 th\u1ec3 tr\u1ea3 ra file XML) \u0111\u1ec3 t\u1eeb \u0111\u00f3 l\u1ea5y text ra. T\u00f4i c\u00f3 vi\u1ebft ch\u01b0\u01a1ng tr\u00ecnh Python \u0111\u1ec3 x\u1eed l\u00fd d\u1eef li\u1ec7u t\u1eeb ph\u1ea7n m\u1ec1m OCR, nh\u1eb1m l\u1ea5y text ra v\u00e0 l\u00e0m align v\u1edbi LF Aligner, m\u00e3 ngu\u1ed3n \u1edf <a href=\"https:\/\/gitlab.com\/dichthuat\/tools\/ocr\">https:\/\/gitlab.com\/dichthuat\/tools\/ocr<\/a>.<\/p>\n<p>+ Sau khi d\u00f9ng ph\u1ea7n m\u1ec1m \u0111\u1ec3 l\u00e0m align, th\u00ec n\u00ean c\u00f3 ng\u01b0\u1eddi xem l\u1ea1i \u0111\u1ec3 lo\u1ea1i b\u1ecf b\u1edbt &#8220;r\u00e1c&#8221; trong k\u1ebft qu\u1ea3 thu \u0111\u01b0\u1ee3c. V\u1edbi m\u1ed9t quy\u1ec3n s\u00e1ch c\u00f3 kho\u1ea3ng 500 trang, m\u1ed7i trang cho l\u00e0 c\u1ee1 100 c\u00e2u, th\u00ec ta c\u00f3 th\u1ec3 ki\u1ebfm \u0111\u01b0\u1ee3c 50 ng\u00e0n c\u00e2u trong \u0111\u00f3, \u0111\u1ec3 l\u00e0m b\u1ed9 nh\u1edb d\u1ecbch (TM). Tuy nhi\u00ean kh\u1ea3 n\u0103ng m\u00e0 sau n\u00e0y d\u1ecbch, ta g\u1eb7p l\u1ea1i m\u1ed9t c\u00e2u trong b\u1ed9 nh\u1edb d\u1ecbch th\u00ec r\u1ea5t nh\u1ecf, c\u00f3 th\u1ec3 l\u00e0 1 ph\u1ea7n tri\u1ec7u (tr\u1eeb tr\u01b0\u1eddng h\u1ee3p d\u1ecbch phi\u00ean b\u1ea3n m\u1edbi c\u1ee7a quy\u1ec3n \u0111\u00f3, th\u00ec c\u00f3 80% kh\u1ea3 n\u0103ng l\u1eb7p l\u1ea1i). Th\u00e0nh ra, b\u1ecf nhi\u1ec1u c\u00f4ng s\u1ee9c \u0111\u1ec3 ch\u1ec9nh s\u1eeda c\u00e2u ch\u1eef trong TM l\u00e0 kh\u00f4ng \u0111\u00e1ng c\u00f4ng, khi ta l\u00e0m c\u00f4ng t\u00e1c &#8220;h\u1eadu ki\u1ec3m&#8221; cho TM th\u00ec ch\u1ec9 c\u1ea7n \u0111\u1ecdc l\u01b0\u1edbt t\u1eebng c\u1eb7p c\u00e2u, r\u1ed3i c\u1ee9 th\u1eb3ng tay m\u00e0 x\u00f3a c\u1eb7p c\u00e2u n\u00e0o kh\u00f4ng t\u01b0\u01a1ng \u1ee9ng.<\/p>\n<p>+ LF Aligner c\u00f3 xu\u1ea5t d\u1eef li\u1ec7u align ra c\u00e1c d\u1ea1ng XLS v\u00e0 TMX. \u0110\u1ec3 ch\u1ec9nh s\u1eeda b\u1eb1ng tay th\u00ec n\u00ean d\u00f9ng file XLS \u0111\u1ec3 m\u1edf b\u1eb1ng c\u00e1c ph\u1ea7n m\u1ec1m nh\u01b0 LibreOffice \/ OpenOffice hay MS Excel, c\u00e1c ph\u1ea7n m\u1ec1m n\u00e0y ph\u1ed5 th\u00f4ng, nhi\u1ec1u ng\u01b0\u1eddi d\u00f9ng th\u00e0nh th\u1ea1o. Sau \u0111\u00f3 d\u00f9ng ph\u1ea7n m\u1ec1m <a href=\"https:\/\/github.com\/heartsome\/tmxeditor8\">Heartsome TMX Editor 8<\/a> (mi\u1ec5n ph\u00ed, c\u00f3 c\u00e1c b\u1ea3n c\u00e0i \u0111\u1eb7t download \u1edf: <a href=\"https:\/\/onedrive.live.com\/redir?resid=A132898840765207%21108&amp;authkey=%21ADuuiyhuvMU30nU&amp;ithint=folder%2c.zip\">Microsoft OneDrive<\/a>,\u00a0<a href=\"https:\/\/www.dropbox.com\/sh\/15tz6sdr1ibp6s7\/AADNAUCxKleoM1IZVqGbrUOga\">Dropbox<\/a>) \u0111\u1ec3 convert t\u1eeb file XLS sang TMX (l\u00e0 \u0111\u1ecbnh d\u1ea1ng ph\u1ed5 bi\u1ebfn cho b\u1ed9 nh\u1edb d\u1ecbch).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u0110\u1ed1i v\u1edbi ng\u01b0\u1eddi \/ nh\u00f3m \/ c\u00f4ng ty d\u1ecbch thu\u1eadt \u0111\u00e3 t\u1eebng d\u1ecbch nhi\u1ec1u m\u00e0 c\u00f3 c\u00e1c t\u00e0i li\u1ec7u \u0111\u01b0\u1ee3c d\u1ecbch kh\u00f4ng d\u00f9ng ph\u1ea7n m\u1ec1m CAT, m\u00e0 v\u1ec1 sau chuy\u1ec3n sang d\u00f9ng CAT, th\u00ec c\u1ea7n chuy\u1ec3n c\u00e1c k\u1ebft qu\u1ea3 d\u1ecbch c\u0169 th\u00e0nh d\u1eef li\u1ec7u b\u1ed9 nh\u1edb d\u1ecbch &#8211; translation memory (TM), nh\u1eb1m t\u1eadn d\u1ee5ng \u0111\u01b0\u1ee3c&hellip;<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_mi_skip_tracking":false,"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true},"categories":[13],"tags":[52,60,59,61],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p8jAij-9t","_links":{"self":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts\/587"}],"collection":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/comments?post=587"}],"version-history":[{"count":1,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts\/587\/revisions"}],"predecessor-version":[{"id":588,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/posts\/587\/revisions\/588"}],"wp:attachment":[{"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/media?parent=587"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/categories?post=587"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rosetta.vn\/translate\/wp-json\/wp\/v2\/tags?post=587"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}