Blog Archive

Search This Blog

Thursday, December 13, 2018

Vector sentences representation for data selection in statistical machine translation

Publication date: Available online 13 December 2018

Source: Computer Speech & Language

Author(s): Mara Chinea-Rios, Germán Sanchis-Trilles, Francisco Casacuberta

Abstract

One of the most popular approaches to machine translation consists in formulating the problem as a pattern recognition approach. Under this perspective, bilingual corpora are precious resources, as they allow for a proper estimation of the underlying models. In this framework, selecting the best possible corpus is critical, and data selection aims to find the best subset of the bilingual sentences from an available pool of sentences such that the final translation quality is improved. In this paper, we present a new data selection technique that leverages a continuous vector-space representation of sentences. Experimental results report improvements compared not only with a system trained only with in-domain data, but also compared with a system trained on all the available data. Finally, we compared our proposal with other state-of-the-art data selection techniques (Cross-entropy selection and Infrequent ngrams recovery) in two different scenarios, obtaining very promising results with our proposal: our data selection strategy is able to yield results that are at least as good as the best-performing strategy for each scenario. The empirical results reported are coherent across different language pairs.



from Speech via a.sfakia on Inoreader https://ift.tt/2rALHpW

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Blog Archive

Pages

   International Journal of Environmental Research and Public Health IJERPH, Vol. 17, Pages 6976: Overcoming Barriers to Agriculture Green T...