{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Translation: Word-based SMT using IBM1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook we'll be looking at the IBM Model 1 word alignment method. This will be used to demonstrate the process of training, using expectation maximisation, over a toy dataset. Note that this dataset and presentation closely follows JM2 Chapter 25. The optimised version of the code is based on Koehn09 Chapter 4. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from collections import defaultdict\n",
"import itertools"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our dataset will consist of two very short sentence pairs."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"bitext = []\n",
"bitext.append((\"green house\".split(), \"casa verde\".split()))\n",
"bitext.append((\"the house\".split(), \"la casa\".split()))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Based on the vocabulary items in Spanish and English, we initialise our translation table, *t*, to a uniform distribution. That is, for each word type in English, we set all translations in Spanish to have 1/3."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(dict,\n",
" {'green': {'casa': 0.3333333333333333,\n",
" 'la': 0.3333333333333333,\n",
" 'verde': 0.3333333333333333},\n",
" 'house': {'casa': 0.3333333333333333,\n",
" 'la': 0.3333333333333333,\n",
" 'verde': 0.3333333333333333},\n",
" 'the': {'casa': 0.3333333333333333,\n",
" 'la': 0.3333333333333333,\n",
" 'verde': 0.3333333333333333}})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t0 = defaultdict(dict)\n",
"for en_type in \"the green house\".split():\n",
" for es_type in \"la casa verde\".split():\n",
" t0[en_type][es_type] = 1.0 / 3\n",
"t0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now for the algorithm itself. Although we tend to merge the expectation and maximisation steps (to save storing big data structures for the expected counts), here we'll do the two separately for clarity. Also, following JM:\n",
" - we won't apply the optimisation for IBM1 which allows us to deal with each position *j* independently. Instead we enumerate the space of all alignments using a cartesian product, see *itertools.product*.\n",
" - we don't consider alignments to the null word"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def expectation_step(bitext, translation_probs):\n",
" expectations = []\n",
" for E, F in bitext:\n",
" I = len(E)\n",
" J = len(F)\n",
" # store the unnormalised alignment probabilities\n",
" align = []\n",
" # track the sum of unnormalised alignment probabilities\n",
" Z = 0\n",
" for A in itertools.product(range(I), range(I)):\n",
" pr = 1.0\n",
" for j, aj in enumerate(A):\n",
" pr *= translation_probs[E[aj]][F[j]]\n",
" align.append([A, E, F, pr])\n",
" Z += pr\n",
" # normalise align to produce the alignment probabilities\n",
" for atuple in align:\n",
" atuple[-1] /= Z\n",
" # save the expectations for the M step\n",
" expectations.extend(align)\n",
" return expectations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's try running this and see what the expected alignments are"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[(0, 0), ['green', 'house'], ['casa', 'verde'], 0.25],\n",
" [(0, 1), ['green', 'house'], ['casa', 'verde'], 0.25],\n",
" [(1, 0), ['green', 'house'], ['casa', 'verde'], 0.25],\n",
" [(1, 1), ['green', 'house'], ['casa', 'verde'], 0.25],\n",
" [(0, 0), ['the', 'house'], ['la', 'casa'], 0.25],\n",
" [(0, 1), ['the', 'house'], ['la', 'casa'], 0.25],\n",
" [(1, 0), ['the', 'house'], ['la', 'casa'], 0.25],\n",
" [(1, 1), ['the', 'house'], ['la', 'casa'], 0.25]]"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"e0 = expectation_step(bitext, t0)\n",
"e0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also view this graphically. You need to have Graphviz - Graph Visualization Software installed and the path to its bin folder e.g. C:\\Program Files (x86)\\Graphviz2.38\\bin added to PATH."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.2500'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from IPython.display import SVG, display\n",
"from nltk.translate import AlignedSent, Alignment\n",
"\n",
"def display_expect(expectations):\n",
" stuff = []\n",
" for A, E, F, prob in expectations:\n",
" if prob > 0.01:\n",
" stuff.append('Prob = %.4f' % prob)\n",
" asent = AlignedSent(F, E, Alignment(list(enumerate(A))))\n",
" stuff.append(SVG(asent._repr_svg_()))\n",
" return display(*stuff)\n",
"\n",
"display_expect(e0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note the uniform probabilities for each option (is this a surprise, given our initialisation?) Next up we need to learn the model parameters *t* from these expectations. This is simply a matter of counting occurrences of translation pairs, weighted by their probability. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def maximization_step(expectations):\n",
" counts = defaultdict(dict)\n",
" for A, E, F, prob in expectations:\n",
" for j, aj in enumerate(A):\n",
" counts[E[aj]].setdefault(F[j], 0.0)\n",
" counts[E[aj]][F[j]] += prob\n",
" \n",
" translations = defaultdict(dict)\n",
" for e, fcounts in counts.items():\n",
" tdict = translations[e]\n",
" total = float(sum(fcounts.values()))\n",
" for f, count in fcounts.items():\n",
" tdict[f] = count / total\n",
" \n",
" return translations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can test this over our expectations. Do you expect this to be uniform like *t0*?"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(dict,\n",
" {'green': {'casa': 0.5, 'verde': 0.5},\n",
" 'house': {'casa': 0.5, 'la': 0.25, 'verde': 0.25},\n",
" 'the': {'casa': 0.5, 'la': 0.5}})"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t1 = maximization_step(e0)\n",
"t1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With working E and M steps, we can now iterate!"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(dict,\n",
" {'green': {'casa': 0.09631137780887424,\n",
" 'verde': 0.9036886221911259},\n",
" 'house': {'casa': 0.982003652902086,\n",
" 'la': 0.008998173548957001,\n",
" 'verde': 0.008998173548957008},\n",
" 'the': {'casa': 0.09631137780887424, 'la': 0.9036886221911258}})"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t = t0\n",
"for step in range(10):\n",
" e = expectation_step(bitext, t)\n",
" t = maximization_step(e)\n",
"t"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Prob = 0.1031'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.8805'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.0147'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.1031'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.8805'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"'Prob = 0.0147'"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"display_expect(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Great, we've learned sensible translations as we hoped. Try viewing the expectations using *display_expect*, and vary the number of iterations. What happens to the learned parameters? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Speeding things up"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Recall that the E-step above uses a naive enumeration over all possible alignments, which is going to be woefully slow for anything other than toy data. (What's its computational complexity?) Thankfully a bit of algebraic manipulation of the model1 formulation of $P(A|E,F)$ gives rise to a much simple formulation. Let's give this a try. "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def fast_em(bitext, translation_probs):\n",
" # E-step, computing counts as we go\n",
" counts = defaultdict(dict)\n",
" for E, F in bitext:\n",
" I = len(E)\n",
" J = len(F)\n",
" # each j can be considered independently of the others\n",
" for j in range(J):\n",
" # get the translation probabilities (unnormalised)\n",
" prob_ajs = []\n",
" for aj in range(I):\n",
" prob_ajs.append(translation_probs[E[aj]][F[j]])\n",
" # compute denominator for normalisation\n",
" z = sum(prob_ajs)\n",
" # maintain running counts (this is really part of the M-step)\n",
" for aj in range(I):\n",
" counts[E[aj]].setdefault(F[j], 0.0)\n",
" counts[E[aj]][F[j]] += prob_ajs[aj] / z\n",
" \n",
" # Rest of the M-step to normalise counts\n",
" translations = defaultdict(dict)\n",
" for e, fcounts in counts.items():\n",
" tdict = translations[e]\n",
" total = float(sum(fcounts.values()))\n",
" for f, count in fcounts.items():\n",
" tdict[f] = count / total\n",
" \n",
" return translations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can test that the parameters learned in each step match what we computed before. What's the time complexity of this algorithm? "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(dict,\n",
" {'green': {'casa': 0.5, 'verde': 0.5},\n",
" 'house': {'casa': 0.5, 'la': 0.25, 'verde': 0.25},\n",
" 'the': {'casa': 0.5, 'la': 0.5}})"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t1p = fast_em(bitext, t0)\n",
"t1p"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(dict,\n",
" {'green': {'casa': 0.4285714285714286,\n",
" 'verde': 0.5714285714285715},\n",
" 'house': {'casa': 0.6000000000000001, 'la': 0.2, 'verde': 0.2},\n",
" 'the': {'casa': 0.4285714285714286, 'la': 0.5714285714285715}})"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"t2p = fast_em(bitext, t1)\n",
"t2p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Alignment models in NLTK"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLTK has a range of translation tools, including the IBM models 1 - 5. These are implemented in their full glory, including the null alignment, and complex optimisation algorithms for models 3 and up. Note that model 4 requires a clustering of the vocabulary, see the [documentation](http://www.nltk.org/api/nltk.translate.html) for details."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"defaultdict(.>,\n",
" {'green': defaultdict(...>,\n",
" {None: 1.5000253604969975e-12,\n",
" 'casa': 0.4631720851940016,\n",
" 'verde': 0.9999999999970001}),\n",
" 'house': defaultdict(...>,\n",
" {None: 0.9999999999969998,\n",
" 'casa': 0.07365582961199792,\n",
" 'la': 3.0000507210981886e-12,\n",
" 'verde': 3.000050721098189e-12}),\n",
" 'the': defaultdict(...>,\n",
" {None: 1.5000253604969975e-12,\n",
" 'casa': 0.46317208519400005,\n",
" 'la': 0.999999999997})})"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from nltk.translate import IBMModel3\n",
"\n",
"bt = [AlignedSent(E,F) for E,F in bitext]\n",
"m = IBMModel3(bt, 5)\n",
"\n",
"m.translation_table"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NLTK also includes a small section of the Europarl corpus (about 20K sentence pairs). You might want to apply the alignment models to this larger dataset, although be aware that you will first need to do sentence alignment to discard any sentences that aren't aligned 1:1, e.g., using [nltk.translate.gale_church](http://www.nltk.org/api/nltk.translate.html#module-nltk.translate.gale_church) to infer the best alignment. You might also want to lower-case the dataset, which keeps the vocabulary small enough for reasonable runtime and robust estimation."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[nltk_data] Downloading package europarl_raw to\n",
"[nltk_data] /Users/tcohn/nltk_data...\n",
"[nltk_data] Package europarl_raw is already up-to-date!\n",
"['Resumption', 'of', 'the', 'session', 'I', 'declare', 'resumed', 'the', 'session', 'of', 'the', 'European', 'Parliament', 'adjourned', 'on', 'Friday', '17', 'December', '1999', ',', 'and', 'I', 'would', 'like', 'once', 'again', 'to', 'wish', 'you', 'a', 'happy', 'new', 'year', 'in', 'the', 'hope', 'that', 'you', 'enjoyed', 'a', 'pleasant', 'festive', 'period', '.']\n",
"['Reanudación', 'del', 'período', 'de', 'sesiones', 'Declaro', 'reanudado', 'el', 'período', 'de', 'sesiones', 'del', 'Parlamento', 'Europeo', ',', 'interrumpido', 'el', 'viernes', '17', 'de', 'diciembre', 'pasado', ',', 'y', 'reitero', 'a', 'Sus', 'Señorías', 'mi', 'deseo', 'de', 'que', 'hayan', 'tenido', 'unas', 'buenas', 'vacaciones', '.']\n"
]
}
],
"source": [
"import nltk\n",
"nltk.download('europarl_raw')\n",
"\n",
"from nltk.corpus.europarl_raw import english, spanish\n",
"\n",
"print(english.sents()[0])\n",
"print(spanish.sents()[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}