Hello and welcome to the second part of Module 6: "Challenges in Multilingual Text Analysis - cross-lingual alignment" In the first part of this module, we focussed on automatic language identification and code-switching, i.e. the automatic detection of a given language in a certain text segment. The second part of this module deals with the topic of parallel corpora containing translated texts and how the parts of such a parallel corpus can be connected and aligned with each other. That's what we call cross-lingual alignment. In this module, we will discuss possible alignments on the level of sentences and words. Let's first take a look at parallel corpora in general. Those are texts with translations that are available in two or more languages. Some examples: The texts of the European Parliament, called "Europarl" are very well-known, I will go into detail in a minute, and the texts of the Canadian Parliament, the "Canadian Hansard" are used as parallel corpora. The collection of national laws of the European Union or Switzerland are available in three or four languages, great amounts of film subtitles are available for several languages. The subtitles of the TED talks have been translated into several dozen languages. And also large collections of patents in various languages. You see, there is a great amount of different text genres available as parallel corpora. Let's take a closer look at the Europarl corpus where all the utterances from the European Parliament have been collected. That are very large amounts of text data. In the current slide, we see that for German there are 47 million words and 2.1 million sentences in this corpus. Also the other languages of the European Union which have been represented in the EU for a long time, have an comparable amount of text collections. While for the other languages that recently joined the EU smaller amounts of texts are available. All those texts have been preprocessed and are ready for linguistic and computer-linguistic research and analyses. You can easily use them for your own investigations. An example: A very short utterance, close to the style of spoken language: DE: "Vielen Dank, Herr Segni, das will ich gerne tun." EN: "Thank you, Mister Segni, I shall do so gladly." But you can also find longer utterances: DE: "Wie Sie sicher aus der Presse und dem Fernsehen wissen, gab es in Sri Lanka mehrere Bombenexplosionen mit zahlreichen Toten." An utterance from January 2000, which can be found as translated version for all those languages that were already part of the European Union at that time. If we want to use such parallel corpora alignments can help to make them more useful to us. Alignments can take place on all levels of the linguistic structure: from paragraph to paragraph, from sentence to sentence and from word to word. Let's start with the approaches for automatic sentence alignments. The question is the following: We have a pair of translated documents, i.e. d1 and d2 in two languages and the task of automatic sentence alignment is to find out which sentence s1 in document d1 is a translation of sentence s2 in document d2. The alignment of multiple sentences in d1 with one sentence in d2 is of course possible, it is indeed often the case that two short sentences in a German document might be translation of a long sentence in another language, English for example. We assume that the sentences follow the same order in both texts when applying approaches for automatic sentence alignment. These alignment approaches are typically based on the comparison of length, i.e. the number of characters per sentence, the number of letters, punctuation marks, blank spaces, etc. are summed up and compared to each other. The algorithm calculates and identifies the sentences that most likely correspond to each other. This algorithm also benefits of so-called "anchor words". Those are words that are identical in both languages, i.e. numbers, numerical expressions. In this case, when we see "2005" in the German sentence it is very likely that this year appears also in the French text. You need to take into account that there are several different conventions, some numbers are written in Arabic numerals in some languages and in other languages they are expressed by using Roman numerals. Other typical anchor words are proper nouns, especially personal names like in this example, "Angela Merkel" or geographical expressions. Please note that toponyms are often translated, the German word "Matterhorn" doesn't simply correspond to "Matterhorn" in French but to "Mont Cervin", the French version of "Matterhorn". It is important to consider word lists in such cases. Additionally, anchor words can also be words that are written in a similar way for example German "Lokomotive", "locomotive" in English or "locomotive" in French. These words can also be used to compute alignments beyond sentence boundaries. Here an example, how this could look like in the Text+Berg corpus. We start with the first sentence pair and check if the lengths correspond more or less, which is the case in the current example. We move on to the second sentence and to the third sentence. And here we see that the German sentence is very long while the French sentence is really short. Then, we take the subsequent sentence in French and check whether there is a correspondence due to anchor words. If yes, the sentence is added to the current alignment and we move on to the next sentence pairs. At the position of the long German sentence which is actually a linked chain of two sentences where we weren't able to identify the sentence boundary correctly we see that actually two French sentences correspond to this long German sentence. A program to compute these alignments is "InterText", a freely available program that you can find online. You can upload a parallel text into this tool and the sentence alignments will be computed at the touch of a button. That's an algorithm implemented in the system "HunAlign". This "HunAlign" program is available as single, independent program or integrated in InterText. I'd recommend you to use "InterText", it has a very nice graphical user interface that you can see in the current slide. In yellow, those sentences are being marked where the system was not able to find 1:1 translations but 2:2 sentences or indeed also 1:0 translations. We see in the next slide, that this Italian sentence doesn't have a correspondent translation in English. These alignments can be manually corrected and adapted directly in the "InterText" interface until you reached the quality of alignments that you expected. In general, the automatic sentence alignment works fairly well for a large variety of texts that are reasonably close translations of each other and don't contain too many omissions. If we are dealing with texts that are very different, a slightly different approach based on machine translation should be used for automatic sentence alignment. The idea is that we have a sentence s1 in the source document and we translate this sentence into the language of the target document, French for instance, by using machine translation. And if the machine-translated sentence is similar to the sentence in the target documents, these two sentences will be aligned. We need thus a machine translation system for this language pair in order to apply such an approach which is particularly suitable for texts with many omissions, insertions and many OCR errors. We can say that these are texts that contain a lot of "noise". An approach where this has been implemented is called "Bleu Align". It is a freely available system that you can use for such texts. That's all about the sentence alignments, let's move on to the word alignments. We have the following question now: In a pair of translated sentences s1 and s2, we want to known which word w1 in sentence s1 is a translation of a word w2 in sentence s2 in the target language. Again, we might have alignments between groups of words in one sentence and single words in another sentence. And we cannot assume, that these words are in the same order in both sentences. Word alignment is inherently a more complex problem than automatic sentence alignment. In the following word alignment example we have a German sentence at the left and a French sentence at the right. Due to the arrows we can understand that those word alignments were the arrows aren't straight imply that the word order is not identical in the two sentences. And that's something that we need to take into account when dealing with automatic word alignment. How does this system work? Automatic word alignment is based on the co-occurrence of words in the parallel, translated sentences. The intuition behind that is: Take all German sentences where for example the word "Haus" (house) occurs. Then, look at all French translations. Which word in these parallel French sentences is the most frequent translation to "Haus" (house)? And we will find out, that it is the word "maison". We can thus conclude that "maison" is a translation, a translated version of the German word "Haus" (house). Large, parallel corpora are required in order to have sufficient statistical material and evidence. That's how this works intuitively. In practice, a complex algorithm is used called "Expectation Maximisation Algorithm" which applies this method for all the words in a sentence at the same time. Then, the word alignment with the highest probability is computed. There is a freely available program called "GIZA++" that can be used to compute the word alignments automatically. Then, we obtain word alignment matrices. These matrices might look like this for example: Vertically, we have the German sentence: "Es war die Zeit, in der Portugal und Spanien die Welt unter sich aufgeteilt hatten." And horizontally, the English expression: "It was an age in which Portugal and Spain had divided the world between themselves." The automatically computed alignment for the German word "Zeit" is "age". For the German word "die", no alignment was found. That's most likely due to the corpus evidence, and the fact that the undefined article "an" in the expression "an age" rarely co-occurs with the German article "die" in our corpus. That's why the alignment system didn't suggest any corresponding translation. You can see that also words that aren't in the same order can be aligned properly like the German "aufgeteilt hatten" that needs to be aligned with the English expression "had divided". But there are also alignment errors: The German word "aufgeteilt" has also been aligned with "an age". That's wrong. This should show you that automatic word alignment doesn't always work perfectly fine. Based on statistical alignment approaches we can, however, achieve useful results. All this can also work in a multi-parallel way. We computed the word alignments for the languages German, English, Spanish, French and Italian. I selected the English word "partnership" and as you can see, the corresponding words "Partnerschaft", "colaboración" are found and also for French and Italian the alignments are correct. But there is also a wrong alignment, an additional alignment in French resulting in a redundant alignment. You can also see that the alignments are computed in a certain direction: The yellow alignment of "accord" in French is only an alignment for the direction French -> English and not viceversa. That's something we can use if we want to clean the word alignments. Let's summarise: The alignments are a very important aspect when pre-processing parallel corpora. Sentence alignments are usually computed by comparing the sentence length or in special cases, using machine translation. Word alignments are based on the co-occurrence of words in parallel, aligned sentences. This alignment information is saved in XML and can be used for further applications and research. Parallel corpora are fascinating because this additional information from the translations contains a lot of further information about the source language and thus allows us to disambiguate. I would like to conclude this module and thank you for your attention. Thank you very much!