{ "cells": [ { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "# Preprocessing with NLTK" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "First, if you haven't used iPython notebooks before, in order to run the code on this workbook, you can use the run commands in the Cell menu, or do shift-enter when an individual code cell is selected. Generally, you will have to run the cells in order for them to work properly. The output for a given cell (in any) will appear below the code after it has completed running. To make sure things are working, run the cell bellow:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello world\n" ] } ], "source": [ "print(\"hello world\")" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Okay, now let's do some simple preprocessing on this snippet from the html from the class website:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [ "text = '''\n", " \n", " \n", " \n", " \n", " \n", "\n", "
\n", "
\n", "

COMP90042 Web Search and Text Analysis

\n", "
\n", "\n", "The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing and text retrieval, for use in a diverse range of\n", "applications including text classification, information retrieval, machine translation, and question answering. Topics to be\n", "covered include vector space models, part-of-speech tagging, n-gram language\n", "modelling, syntactic parsing and neural sequence models. The programming language used is Python, see the\n", "detailed configuration instructions for more information on its use in the workshops, assignments and\n", "installation at home.\n", "\n", "'''" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "First, let's remove the html markup using regular expressions" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "deletable": true, "editable": true, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "COMP90042 Web Search and Text Analysis\n", "\n", "\n", "The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing and text retrieval, for use in a diverse range of\n", "applications including text classification, information retrieval, machine translation, and question answering. Topics to be\n", "covered include vector space models, part-of-speech tagging, n-gram language\n", "modelling, syntactic parsing and neural sequence models. The programming language used is Python, see the\n", "detailed configuration instructions for more information on its use in the workshops, assignments and\n", "installation at home.\n" ] } ], "source": [ "import re\n", "\n", "text = re.sub(\"<[^>]+>\", \"\", text).strip()\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "We can see more clearly now that there are three newline characters between the title and the main text, and also some newlines within the text. Our sentence tokenizer won't be able to handle the title properly, so let's remove it, and change the other newlines to spaces." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing and text retrieval, for use in a diverse range of applications including text classification, information retrieval, machine translation, and question answering. Topics to be covered include vector space models, part-of-speech tagging, n-gram language modelling, syntactic parsing and neural sequence models. The programming language used is Python, see the detailed configuration instructions for more information on its use in the workshops, assignments and installation at home.\n" ] } ], "source": [ "text = text.split(\"\\n\\n\\n\")[1].replace(\"\\n\", \" \")\n", "print(text)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Next let's segment the text into sentences. Though a simple method like splitting on periods would work well enough in this case, let's try a sentence segmenter from NLTK, which would be able to handle harder cases if they appeared in our text." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[nltk_data] Downloading package punkt to /Users/tcohn/nltk_data...\n", "[nltk_data] Package punkt is already up-to-date!\n", "['The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing and text retrieval, for use in a diverse range of applications including text classification, information retrieval, machine translation, and question answering.', 'Topics to be covered include vector space models, part-of-speech tagging, n-gram language modelling, syntactic parsing and neural sequence models.', 'The programming language used is Python, see the detailed configuration instructions for more information on its use in the workshops, assignments and installation at home.']\n" ] } ], "source": [ "import nltk\n", "nltk.download('punkt')\n", "sent_segmenter = nltk.data.load('tokenizers/punkt/english.pickle')\n", "\n", "sentences = sent_segmenter.tokenize(text)\n", "print(sentences)" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "NLTK also has a word tokenizer. For the first sentence, let's compare a naive split using spaces and the NTLK regex tokenizer" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['Topics', 'to', 'be', 'covered', 'include', 'vector', 'space', 'models', ',', 'part', '-', 'of', '-', 'speech', 'tagging', ',', 'n', '-', 'gram', 'language', 'modelling', ',', 'syntactic', 'parsing', 'and', 'neural', 'sequence', 'models', '.']\n", "['Topics', 'to', 'be', 'covered', 'include', 'vector', 'space', 'models,', 'part-of-speech', 'tagging,', 'n-gram', 'language', 'modelling,', 'syntactic', 'parsing', 'and', 'neural', 'sequence', 'models.']\n" ] } ], "source": [ "word_tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()\n", "\n", "tokenized_sentence = word_tokenizer.tokenize(sentences[1])\n", "print(sentences[1].split(\" \"))" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "The NLTK tokenizer correctly splits off commas and periods from the ends of words. It also splits up the hyphenated word \"part-of-speech\", which might be the right behavior for some applications, but not for others.\n", "\n", "Let's try out lemmatization. NLTK has a lemmatizer, though using it requires that we know the part of speech of the word. In this case, we'll just try verb lemmatization, and if doesn't change the word, we'll try noun." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[nltk_data] Downloading package wordnet to /Users/tcohn/nltk_data...\n", "[nltk_data] Package wordnet is already up-to-date!\n", "['Topics', 'to', 'be', 'cover', 'include', 'vector', 'space', 'model', ',', 'part', '-', 'of', '-', 'speech', 'tag', ',', 'n', '-', 'gram', 'language', 'model', ',', 'syntactic', 'parse', 'and', 'neural', 'sequence', 'model', '.']\n" ] } ], "source": [ "nltk.download('wordnet')\n", "lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()\n", "\n", "def lemmatize(word):\n", " lemma = lemmatizer.lemmatize(word,'v')\n", " if lemma == word:\n", " lemma = lemmatizer.lemmatize(word,'n')\n", " return lemma\n", "\n", "print([lemmatize(token) for token in tokenized_sentence])" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Compare this to the result of stemming using the Porter Stemmer:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "deletable": true, "editable": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['topic', 'to', 'be', 'cover', 'includ', 'vector', 'space', 'model', ',', 'part', '-', 'of', '-', 'speech', 'tag', ',', 'n', '-', 'gram', 'languag', 'model', ',', 'syntact', 'pars', 'and', 'neural', 'sequenc', 'model', '.']\n" ] } ], "source": [ "stemmer = nltk.stem.porter.PorterStemmer()\n", "print([stemmer.stem(token) for token in tokenized_sentence])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true, "deletable": true, "editable": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }