氏名

ルパージュ イヴ

ルパージュ イヴ

職名

教授 (https://researchmap.jp/read0151042/)

所属

(大学院情報生産システム研究科)

連絡先

メールアドレス

メールアドレス
yves.lepage@waseda.jp

URL等

WebページURL

http://lepage-lab.ips.waseda.ac.jp/

研究者番号
70573608

本属以外の学内所属

学内研究所等

ことばの科学研究所

研究所員 2013年-2018年

ことばの科学研究所

研究所員 2012年-2013年

理工学術院総合研究所(理工学研究所)

兼任研究員 2018年-

学歴・学位

学歴

-1983年 Mines Saint-Etienne フランス グランドぜコール 工学研究科 情報学
-1985年 フランス国立グルノブル大学 情報学研究科 情報学 自然言語処理

所属学協会

フランス自然言語処理雑誌編集委員会 委員会

自然言語処理学会 会員

ATALA フランス自然言語処理学会

情報処理学会 会員

委員歴・役員歴(学外)

2008年-2016年Traitement automatique des langues (TAL) 編集委員会編集長

取材ガイド

カテゴリー
情報学
専門分野
自然言語処理、機械翻訳
キーワード
用例手法、類推関係

研究分野

キーワード

自動翻訳、多言語アラインメント、類推関係、言い換え、言語モデル、外国語ソフト

科研費分類

情報学 / 人間情報学 / 知能情報学

人文学 / 言語学 / 言語学

論文

A method of generating translations of unseen n-grams by using proportional analogy

Luo, Juan;Lepage, Yves

IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING11(3)p.325 - 3302016年-2016年

DOIWoS

詳細

ISSN:1931-4973

Extraction of Potentially Useful Phrase Pairs for Statistical Machine Translation

Luo Juan;Lepage Yves

Journal of Information Processing23(3)p.344 - 3522015年-2015年

CiNii

詳細

ISSN:1882-6652

概要:Over the last decade, an increasing amount of work has been done to advance the phrase-based statistical machine translation model in which the method of extracting phrase pairs consists of word alignment and phrase extraction. In this paper, we show that, for Japanese-English and Chinese-English statistical machine translation systems, this method is indeed missing potentially useful phrase pairs which could lead to better translation scores. These potentially useful phrase pairs can be detected by looking at the segmentation traces after decoding. We choose to see the problem of extracting potentially useful phrase pairs as a two-class classification problem: among all the possible phrase pairs, distinguish the useful ones from the not-useful ones. As for any classification problem, the question is to discover the relevant features which contribute the most. Extracting potentially useful phrase pairs resulted in a statistically significant improvement of 7.65 BLEU points in English-Chinese and 7.61 BLEU points in Chinese-English experiments. A slight increase of 0.94 BLEU points and 0.4 BLEU points is also observed for English-Japanese system and Japanese-English system, respectively.

Inflating a Small Parallel Corpus into a Large Quasi-parallel Corpus Using Monolingual Data for Chinese-Japanese Machine Translation

Yang Wei;Shen Hanfei;Lepage Yves

Journal of Information Processing25(0)p.88 - 992017年-2017年

CiNii

詳細

概要:

Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese-Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese-Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

The structure of unseen trigrams and its application to language models: A first investigation

Lepage, Yves; Gosme, Julien; Lardilleux, Adrien

2010 4th International Universal Communication Symposium, IUCS 2010 - Proceedingsp.273 - 2802010年12月-2010年12月 

DOIScopus

詳細

概要:In a series of preparatory experiments in 4 languages on subsets of the Europarl corpus, we show that a large number of unseen trigrams can be reconstructed by proportional analogy with trigrams having the lowest frequencies. We derive a very simple smoothing scheme from this empirical result and show that it outperforms Good-Turing and Kneser-Ney smoothing schemes on trigrams models in all 11 languages on the common multilingual part of the Europarl corpus, except Finnish. ©2010 IEEE.

Estimating the proximity between languages by their commonality in vocabulary structures

Lepage, Yves; Gosme, Julien; Lardilleux, Adrien

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)6562 LNAIp.127 - 1382011年04月-2011年04月 

DOIScopus

詳細

ISSN:03029743

概要:This article proposes a possible way of measuring proximity between languages: it consists in measuring the commonality of structures between the vocabularies of two languages. Experiments conducted on a multilingual lexicon of nine European languages acquired from the Acquis communautaire confirmed usual knowledge on the closeness or remoteness of these languages. © 2011 Springer-Verlag.

The true score of statistical paraphrase generation

Chevelu, Jonathan; Chevelu, Jonathan; Putois, Ghislain; Lepage, Yves

Coling 2010 - 23rd International Conference on Computational Linguistics, Proceedings of the Conference2p.144 - 1522010年12月-2010年12月 

Scopus

詳細

概要:This article delves into the scoring function of the statistical paraphrase generation model. It presents an algorithm for exact computation and two applicative experiments. The first experiment analyses the behaviour of a statistical paraphrase generation decoder, and raises some issues with the ordering of n-best outputs. The second experiment shows that a major boost of performance can be obtained by embedding a true score computation inside a Monte-Carlo sampling based paraphrase generator.

Ambiguity spotting using WordnNet semantic similarity in support to recommended practice for software requirements specifications

Matsuoka, Jin; Lepage, Yves

NLP-KE 2011 - Proceedings of the 7th International Conference on Natural Language Processing and Knowledge Engineeringp.479 - 4842011年12月-2011年12月 

DOIScopus

詳細

概要:Word Sense Disambiguation is a crucial problem in documents whose purpose is to serve as specifications for automatic systems. The combination of different techniques of Natural Language Processing can help in this task. In this paper, we show how to detect ambiguous terms in Software Requirements Specifications. And we propose a computer-aided method that signals the reader for possibly ambiguous usage of terms. The method uses compound term measure (C-value), WordNet semantic similarity (WordNet wup-similarity) and a proposed semantic similarity measure between sentences. © 2011 IEEE.

Fully-automatic marker-based chunking in 11 European languages and counts of the number of analogies between chunks

Takeya, Kota; Lepage, Yves

PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computationp.567 - 5762011年12月-2011年12月 

Scopus

詳細

概要:Analogy has been proposed as a possible principle for example-based machine translation. For such a framework to work properly, the training data should contain a large number of analogies between sentences. Consequently, such a framework can only work properly with short and repetitive sentences. To handle longer and more varied sentences, cutting the sentences into chunks could be a solution if the number of analogies between chunks is confirmed to be large. This paper thus reports counts of number of analogies using different numbers of chunk markers in 11 European languages. These experiments confirm that the number of analogies between chunks is very large: several tens of thousands of analogies between chunks extracted from sentences among which only very few analogies, if not none, were found. © 2011 by Kota Takeya and Yves Lepage.

Improving sampling-based alignment by investigating the distribution of N-grams in phrase translation tables

Luo, Juan; Lardilleux, Adrien; Lepage, Yves

PACLIC 25 - Proceedings of the 25th Pacific Asia Conference on Language, Information and Computationp.150 - 1592011年12月-2011年12月 

Scopus

詳細

概要:This paper describes an approach to improve the performance of sampling-based multilingual alignment on translation tasks by investigating the distribution of n-grams in the translation tables. This approach consists in enforcing the alignment of n-grams. The quality of phrase translation tables output by this approach and that of MGIZA++ is compared in statistical machine translation tasks. Significant improvements for this approach are reported. In addition, merging translation tables is shown to outperform state-of-the-art techniques. © 2011 by Juan Luo, Adrien Lardilleux, and Yves Lepage.

Generalizing sampling-based multilingual alignment

Lardilleux, Adrien; Yvon, François; Yvon, François; Lepage, Yves

Machine Translation27(1)p.1 - 232013年03月-2013年03月 

DOIScopus

詳細

ISSN:09226567

概要:Sub-sentential alignment is the process by which multi-word translation units are extracted from sentence-aligned multilingual parallel texts. This process is required, for instance, in the course of training statistical machine translation systems. Standard approaches typically rely on the estimation of several probabilistic models of increasing complexity and on the use of various heuristics, that make it possible to align, first isolated words, then, by extension, groups of words. In this paper, we explore an alternative approach which relies on a much simpler principle: the comparison of occurrence profiles in sub-corpora obtained by sampling. After analyzing the strengths and weaknesses of this approach, we show how to improve the detection of multi-word translation units and evaluate these improvements on machine translation tasks. © 2012 Springer Science+Business Media B.V.

Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Sun, Jing; Sun, Jing; Sun, Jing; Lepage, Yves; Lepage, Yves; Lepage, Yves

Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012p.351 - 3602012年12月-2012年12月 

Scopus

詳細

概要:Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand. © 2012 The PACLIC.

Analogy-based machine translation using secability

Kimura, Tatsuya; Matsuoka, Jin; Nishikawa, Yusuke; Lepage, Yves

Proceedings - 2014 International Conference on Computational Science and Computational Intelligence, CSCI 20142p.297 - 2982014年01月-2014年01月 

DOIScopus

詳細

概要:The problem of reordering remains the main problem in machine translation. Computing structures of sentences and the alignment of substructures is a way that has been proposed to solve this problem. We use secability to compute structures and show its effectiveness in an example-based machine translation. © 2014 IEEE.

Marker-based chunking in eleven European languages for analogy-based translation

Takeya, Kota; Lepage, Yves

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)8387 LNAIp.432 - 4442014年01月-2014年01月 

DOIScopus

詳細

ISSN:03029743

概要:An example-based machine translation (EBMT) system based on proportional analogies requires numerous proportional analogies between linguistic units to work properly. Consequently, long sentences cannot be handled directly in such a framework. Cutting sentences into chunks would be a solution. Using different markers, we count the number of proportional analogies between chunks in 11 European languages. As expected, the number of proportional analogies between chunks found is very high. These results, and preliminary experiments in translation, are promising for the EBMT system that we intend to build. © 2014 Springer International Publishing.

Improving the distribution of N-grams in phrase tables obtained by the sampling-based method

Luo, Juan; Lardilleux, Adrien; Lepage, Yves

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)8387 LNAIp.419 - 4312014年01月-2014年01月 

DOIScopus

詳細

ISSN:03029743

概要:We describe an approach to improve the performance of sampling-based sub-sentential alignment method on translation tasks by investigating the distribution of n-grams in the phrase tables. This approach consists in enforcing the alignment of n-grams. We compare the quality of phrase translation tables output by this approach and that of the state-of-the-art estimation approach in statistical machine translation tasks. We report significant improvements for this approach and show that merging phrase tables outperforms the state-of-the-art techniques. © 2014 Springer International Publishing.

Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus

Wang, Hao; Yang, Wei; Lepage, Yves

PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computingp.6 - 102014年01月-2014年01月 

DOIScopus

詳細

概要:© 2014 IEEE.State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is the case of Chinese-Japanese. In this paper, we propose to build an extension of an initial parallel corpus in the form of quasi-parallel sentences, instead of adding new parallel sentences. The extension of the initial corpus is obtained by using monolingual analogical associations. Our experiments show that the use of such quasi-parallel corpora improves the performance of Chinese-Japanese translation systems.

Inflating a training corpus for SMT by using unrelated unaligned monolingual data

Yang, Wei; Lepage, Yves

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)8686p.236 - 2482014年01月-2014年01月 

Scopus

詳細

ISSN:03029743

概要:© Springer International Publishing Switzerland 2014.To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.

Exploiting parallel corpus for handling out-of-vocabulary words

Luo, Juan; Tinsley, John; Lepage, Yves

27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27p.399 - 4082013年01月-2013年01月 

Scopus

詳細

概要:© 2013 by Juan Luo, John Tinsley, and Yves Lepage.This paper presents a hybrid model for handling out-of-vocabulary words in Japanese to- English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.

Analogies Between Binary Images: Application to Chinese Characters

Lepage, Yves

Studies in Computational Intelligence548p.25 - 572014年01月-2014年01月 

DOIScopus

詳細

ISSN:1860949X

概要:© Springer-Verlag Berlin Heidelberg 2014.The purpose of this chapter is to show how it is possible to efficiently extract the structure of a set of objects by use of the notion of proportional analogy.As a proportional analogy involves four objects, the very naïve approach to the problem, has basically a complexity of O(n4) for a given set of n objects. We show, under some conditions on proportional analogy, how to reduce this complexity to O(n2) by considering an equivalent problem, that of enumerating analogical clusters that are informative and not redundant. We further show how some improvements make the task tractable. We illustrate our technique with a task related with natural language processing, that of clustering Chinese characters. In this way, we re-discover the graphical structure of these characters.

Chinese word segmentation based on analogy and majority voting

Zheng, Zongrong; Wang, Yi; Lepage, Yves

29th Pacific Asia Conference on Language, Information and Computation, PACLIC 2015p.151 - 1562015年01月-2015年01月 

Scopus

詳細

概要:This paper proposes a new method of Chinese word segmentation based on proportional analogy and majority voting. First, we introduce an analogy-based method for solving the word segmentation problem. Second, we show how to use majority voting to make the decision on where to segment. The preliminary results show that this approach compares well with other segmenters reported in previous studies. As an important and original feature, our method does not need any pretraining or lexical knowledge.

Translation of unseen bigrams by analogy using an SVM classifier

Wang, Hao; Lyu, Lu; Lepage, Yves

29th Pacific Asia Conference on Language, Information and Computation, PACLIC 2015p.16 - 252015年01月-2015年01月 

Scopus

詳細

概要:Detecting language divergences and predicting possible sub-translations is one of the most essential issues in machine translation. Since the existence of translation divergences, it is impractical to straightforward translate from source sentence into target sentence while keeping the high degree of accuracy and without additional information. In this paper, we investigate the problem from an emerging and special point of view: bigrams and the corresponding translations. We first profile corpora and explore the constituents of bigrams in the source language. Then we translate unseen bigrams based on proportional analogy and filter the outputs using an Support Vector Machine (SVM) classifier. The experiment results also show that even a small set of features from analogous can provide meaningful information in translating by analogy.

Hierarchical sub-sentential alignment with anymalign

Lardilleux, Adrien; Yvon, François; Lepage, Yves

Proceedings of the 16th Annual Conference of the European Association for Machine Translation, EAMT 2012p.279 - 2862012年01月-2012年01月 

Scopus

詳細

概要:© 2012 European Association for Machine Translation.We present a sub-sentential alignment algorithm that relies on association scores between words or phrases. This algorithm is inspired by previous work on alignment by recursive binary segmentation and on document clustering. We evaluate the resulting alignments on machine translation tasks and show that we can obtain state-of-the-art results, with gains up to more than 4 BLEU points compared to previous work, with a method that is simple, independent of the size of the corpus to be aligned, and directly computes symmetric alignments. This work also provides new insights regarding the use of "heuristic" alignment scores in statistical machine translation.

Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

Yang, Wei; Shen, Hanfei; Lepage, Yves

Journal of Information Processing25p.88 - 992017年01月-2017年01月 

DOIScopus

詳細

ISSN:03875806

概要:© 2017 Information Processing Society of Japan.Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

Yet another symmetrical & real-time word alignment method: Hierarchical sub-sentential alignment using F-measure

Wang, Hao; Lepage, Yves

Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, PACLIC 2016p.143 - 1522016年01月-2016年01月 

Scopus

詳細

概要:Symmetrization of word alignments is the fundamental issue in statistical machine translation (SMT). In this paper, we describe an novel reformulation of Hierarchical Subsentential Alignment (HSSA) method using F-measure. Starting with a soft alignment matrix, we use the F-measure to recursively split ENGL the matrix into two soft alignment submatrices. A direction is chosen as the same time on the basis of Inversion Transduction Grammar (ITG). In other words, our method simplifies the processing of word alignment as recursive segmentation in a bipartite graph, which is simple and easy to implement. It can be considered as an alternative of growdiag- final-and heuristic. We show its application on phrase-based SMT systems combined with the state-of-the-art approaches. In addition, by feeding with word-to-word associations, it also can be a real-time word aligner. Our experiments show that, given a reliable lexicon translation table, this simple method can yield comparable results with state-of-theart approaches.

HSSA tree structures for BTG-based preordering in machine translation

Zhang, Yujia; Zhang, Yujia; Wang, Hao; Lepage, Yves

Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, PACLIC 2016p.123 - 1322016年01月-2016年01月 

Scopus

詳細

概要:The Hierarchical Sub-Sentential Alignment (HSSA) method is a method to obtain aligned binary tree structures for two aligned sentences in translation correspondence. We propose to use the binary aligned tree structures delivered by this method as training data for preordering prior to machine translation. For that, we learn a Bracketing Transduction Grammar (BTG) from these binary aligned tree structures. In two oracle experiments in English to Japanese and Japanese to English translation, we show that it is theoretically possible to outperform a baseline system with a default distortion limit of 6, by about 2.5 and 5 BLEU points and, 7 and 10 RIBES points respectively, when preordering the source sentences using the learnt preordering model and using a distortion limit of 0. An attempt at learning a preordering model and its results are also reported.

Solving analogical equations between strings of symbols using neural networks

Kaveeta, Vivatchai; Lepage, Yves

CEUR Workshop Proceedings1815p.67 - 762016年01月-2016年01月 

Scopus

詳細

ISSN:16130073

概要:Copyright © 2016 for this paper by its authors.A neural network model to solve analogical equations between strings of symbols is proposed. The method transforms the input strings into two fixed size alignment matrices. The matrices act as the input of the neural network which predicts two output matrices. Finally, a string decoder transforms the predicted matrices into the final string output. By design, the neural network is constrained by several properties of analogy. The experimental results show a fast learning rate with a high prediction accuracy that can beat a baseline algorithm.

Morphological predictability of unseen words using computational analogy

Fam, Rashel; Lepage, Yves

CEUR Workshop Proceedings1815p.51 - 602016年01月-2016年01月 

Scopus

詳細

ISSN:16130073

概要:Copyright © 2016 for this paper by its authors.We address the problem of predicting unseen words by relying on the organization of the vocabulary of a language as exhibited by paradigm tables. We present a pipeline to automatically produce paradigm tables from all the words contained in a text. We measure how many unseen words from an unseen test text can be predicted using the paradigm tables obtained from a training text. Experiments are carried out in several languages to compare the morphological richness of languages, and also the richness of the vocabulary of different authors.

外部研究資金

科学研究費採択状況

研究種別:

言語生産性:有効な類推関係クラスターの迅速な抽出・統計的機械翻訳でその評価

2015年-0月-2018年-0月

配分額:¥4550000

研究種別:

統計・用例機械翻訳のためのアラインメント向上と多言語文法パターン公開

2011年-0月-2014年-0月

配分額:¥5070000

研究資金の受入れ状況

実施形態:その他

アラビア語のマルチメディアプラトフォーム2009年-2012年

学内研究制度

特定課題研究

用例自動翻訳エンジンと実験応用基盤

2010年度

研究成果概要:The final goal of this study is to produce an example-based machine translation engine that can be distributed to t...The final goal of this study is to produce an example-based machine translation engine that can be distributed to the research community on a site dedicated to example-based approaches to machine translation. The engine should use chunks to translate by analogy, and should be made fast by using C implementations of basic computations (resolution of analogical equations). The approach should be tested on various data, like the Europarl data.1. Work on chunking has been done by implementing two methods: marker-based chunking (Gough and Way, 2004) (255 lines of Python code for chunking) and secability (Chenon, 2005) (170 lines of Python code).Tests on the Europarl corpus and informal assessment of the relevance of the chunks produced by the two methods has led to prefer the marker-based chunking technique.In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine the markers as the most frequent less informative words in a corpus (207 lines of Python code).The number of markers can be freely chosen by the user.In contrast to the standard method proposed by (Gough and Way, 2004), we automatically determine whether to cut on the left or on the right of the markers to have a truly language-independent method.There are still problems on this part of the computation, which is currently done by estimating the difference in entropies on the left and right of each marker.Improvements are under study.1.1. We conducted experiments to compute the number of analogies between the chunks obtained (100,000 lines in 11 languages of the Europarl corpus, average sentence length in English: 30 words).This led to a paper at the Japanese Natural Language Processing Annual Conference (gengosyorigakkai) this year.My participation to gengosyorigakkai was charged on this budget.1.2. The production of all chunks for each of the 11 languages of the Europarl corpus (300,000 lines in each language) has been done.The alignment of chunks by computation of lexical weights is currently being done.The corresponding programs have been written and tested (136 lines of code in Python).We determine the most reliable chunk segmentation between two languages by keeping the same average number of chunks for each sentence over the entire corpus.We are currently in the phase of producing the data.1.3. Relatively to language models, trigrams and analogy, a connex research will be reported at the French Natural Language Processing Annual Conference on a new smoothing scheme for trigrams. This technique has been shown to beat even Kneser-Ney smoothing on relatively small amounts of corpora: 300,000 lines from the Europarl corpus in all 11 languages except Finnish.2. The translation engine2.1. A new engine has been reimplemented in Python (511 lines of code).Its main feature is the use of threads. to allow concurrent computation of different kinds.Each of the following task is performed in a different thread:- generation of analogy equations,- resolution of analogical equations,- transfer from source language into target language, and- linking between source text and translation.This allows a clearer design.Work on the design is still in progress.In particular, the use of UML diagrams for class design allowed to improve the code.The engine is now in its 3rd version.Two students are still working on the design of the engine through UML diagrams.Their part-time job salaries charged on this budget.2.2. The resolution of analogical equations as a C library has been integrated within the Python translation engine using C/Python SWIG.The same has been done for the efficient computation of distance or similarity between strings.The use of the C library leads to an acceleration of 5 to 10 times measured on small examples in formal language theory (translation of the context-free language a^n.b^n n into a regular language (ab)^n).3. The validation part of the work is ongoing research.The production of the alignment of chunks in all pairs for the 11 languages of the Europarl corpus is currently being done.The next step will be systematic assessment of translation by analogy of the chunks in each of these pairs using the standard scripts for assessment with various translation quality metrics: WER, BLEU, NIST and TER.4. The disclosure of the translation engine on the example-based web site is unfortunately not yet possible. It is hoped that it is made possible in the next few months.

用例機械翻訳のための二カ国語の同時構造分析の手法の検討

2013年度

研究成果概要:背景と目標 本研究では本研究室で開発している用例翻訳エンジンの適切な翻訳テーブルの検討をする目的である。現在統計翻訳手法の研究が盛んでありのに対して、類推関係に基づく用例翻訳エンジンを開発している。基本技術としては三つの文の部分か...背景と目標 本研究では本研究室で開発している用例翻訳エンジンの適切な翻訳テーブルの検討をする目的である。現在統計翻訳手法の研究が盛んでありのに対して、類推関係に基づく用例翻訳エンジンを開発している。基本技術としては三つの文の部分から4つ目の計算ができる形式化と実装に取り組んでいる(例:「風邪を」:「ひどい風邪が」::「熱さを」:x => x = 「ひどい熱さが」)。統計翻訳後術と同様に翻訳知識として翻訳テーブルが必要である。 翻訳テーブルを生成するため、本研究では単語間アラインメント結果に基づき、(Zha et al., 2001)のクラスタリング手法を適用し、対訳文を同時に構造解析とアラインメントを行なう。構造解析とアラインメントから自動的に翻訳テーブルを生成する。また、以前に提案された単言語構造解析の可切性(secability)手法で得られた翻訳テーブルと比較し、翻訳品質を測定した。本研究の主な結果は次のようになる。 ① 類推関係に基づく用例翻訳エンジンで長文の翻訳の可能性を示した。可切性を利用し、単言語の構造解析を行って、翻訳実験結果で長い文の翻訳は提案手法で可能であると示した。尺度BLEUで測定した翻訳本質は統計翻訳システムより低いが、文の長さの影響を計ると同じグラフの振る舞いの観察ができた。 ② 複数の言語対で実験を行ない、得られた翻訳テーブルを公開した。Europarlコーパスを使用し、予備実験で代表言語対の間で翻訳実験を行なった:フランス語・英語、スペイン語・ポルトガル語、フィンランド語・英語。また、可切性手法で全ての11カ国語の言語対の間の翻訳テーブルを生成し、その翻訳テーブルとそれを使用して得られたBLEUスコアを本研究室のウェッブサイトで公開した(http://133.9.48.109/index/analogy-based-ebmt/、Experiments with an in-house analogy-based EBMT systemを参照)。 ③ 二カ国語同時構造解析アラインメントツールの向上した。一般と特別計算場合の区別によって基礎演算数量を減少し、50倍の加速ができ、マルチプロセッシングを使用し、コア数の半倍弱の加速できて、会わせて4コアで100倍の加速できた。 行なった実験では二カ国語同時構造解析アラインメントで得られた翻訳結果は可切性で得られた結果の比較するとやや低い。しかし、両実験で入力文の構造解析手法は可切性手法であるため、ある意味で不公平な比較となると考えられる。今後の課題として、同時構造解析アラインメントを利用するとき、入力文構造解析を行なわずに翻訳手法の検討をするべき。研究費の使い方: ① 国内と国際学会参加費:Lepage (LTC 2013, ポーランド) 木村竜矢 (AISE 2013, タイ),西川裕介と尾美圭亮 (言語処理学会第20次年大会、札幌) ② 国内と国際学会出張費:木村竜矢 (AISE 2013, タイ),西川裕介と尾美圭亮 (言語処理学会第20次年大会、札幌) ③ 予定した図書購入は研究費の調整のため異なる研究費で購入した。

機械翻訳のための言語生産性の検討:類推関係マップ

2014年度

研究成果概要:言語データの構造化の一般的な問題と機械翻訳でその言語データ構造化の結果に基づき翻訳品質改善の問題を扱った。ここでいう構造化とは、類推関係に基づいた構造化のことである。今まで適応した日中データ以外、欧州連合言語に適応するため、加速が...言語データの構造化の一般的な問題と機械翻訳でその言語データ構造化の結果に基づき翻訳品質改善の問題を扱った。ここでいう構造化とは、類推関係に基づいた構造化のことである。今まで適応した日中データ以外、欧州連合言語に適応するため、加速が必要であった。5倍以上の加速ができ、時間と素性数の様々な値で測定し英仏データで実験最中である。国際会議PolTALにも国内会議言語処理学会年次大会にも発表した日中翻訳実験で本研究で開発したプログラムを適応した。国際ワークショップCogalex2014に発表された論文の実験でも同プログラムを使用した。

統計的機械翻訳システムの開発時間の減少:サンプリング手法の検討

2015年度

研究成果概要:Background: to train a statistical machine translation (SMT) system is time-consuming.  In 2013, for the proba...Background: to train a statistical machine translation (SMT) system is time-consuming.  In 2013, for the probabilistic approach, a fast alignment method (Fast_align) has been proposed. It is 10 times as fast as the standard method (GIZA++).Goal: the present research project addressed the problem of reducing the training time of SMT systems for the associative approach 1/ in word-to-word associations (Anymalign) and 2/ in hierarchical sub-sentential alignment (Cutnalign), while increasing translation accuracy.Method: 1/ for word-to-word association, we studied two improvements in sampling: a/ sampling given the knowledge of a test set to produce ad-hoc translation tables. Two different techniques to estimate inverse translation probabilities have been studied; b/ relying on whether a word is a hapax or not to build and sample sub-corpora. 2/ For sub-sentential alignment, we accelerated decisions in segmentation and reduced the search space. Core components have been re-implemented in C and we introduced multi-processing.Results: we report improvements in time and translation accuracy using three different language pairs: Spanish-Portuguese, French-English and Finnish-English. Compared to our previous methods, our improved methods increased translation accuracy by one confidence interval in average. Compared with Fast_align, same or lower training times yield similar translation accuracy in the two easiest language pairs.

現在担当している科目

科目名開講学部・研究科開講年度学期
分布意味論の背景と基礎大学院情報生産システム研究科2019秋学期
用例翻訳・言語処理研究(修士) 春大学院情報生産システム研究科2019春学期
用例翻訳・言語処理研究(修士) 秋大学院情報生産システム研究科2019秋学期
用例翻訳・言語処理特論大学院情報生産システム研究科2019秋学期
自然言語処理大学院情報生産システム研究科2019春学期
機械翻訳技術大学院情報生産システム研究科2019春学期
用例翻訳・言語処理演習A大学院情報生産システム研究科2019秋学期
用例翻訳・言語処理演習D大学院情報生産システム研究科2019秋学期
用例翻訳・言語処理演習C大学院情報生産システム研究科2019春学期
用例翻訳・言語処理演習B大学院情報生産システム研究科2019春学期
用例翻訳・言語処理研究(博士) 春大学院情報生産システム研究科2019春学期
用例翻訳・言語処理研究(博士) 秋大学院情報生産システム研究科2019秋学期