Input: Everything to permit us. Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)] In this tutorial, you will learn –
POS Tagging What is Chunking in NLP? COUNTING POS TAGS Frequency Distribution Collocations: Bigrams and Trigrams Tagging Sentences POS tagging with Hidden Markov Model How Hidden Markov Model (HMM) Works?
Steps Involved in the POS tagging example:
Tokenize text (word_tokenize) apply pos_tag to above step that is nltk.pos_tag(tokenize_text)
NLTK POS Tags Examples are as below: The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of POS NLTK is complete. In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level. Shallow parsing is also called light parsing or chunking.
Rules for Chunking:
There are no pre-defined rules, but you can combine them according to need and requirement.
For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below
chunk:{<NN.?><VBD.?><JJ.?>*
from nltk import pos_tag
from nltk import RegexpParser
text =“learn php from guru99 and make study easy”.split()
print(“After Split:",text)
tokens_tag = pos_tag(text)
print(“After Token:",tokens_tag)
patterns= “““mychunk:{<NN.?><VBD.?><JJ.?>*
Output:
After Split: [’learn’, ‘php’, ‘from’, ‘guru99’, ‘and’, ‘make’, ‘study’, ’easy’]
After Token: [(’learn’, ‘JJ’), (‘php’, ‘NN’), (‘from’, ‘IN’), (‘guru99’, ‘NN’), (‘and’, ‘CC’), (‘make’, ‘VB’), (‘study’, ‘NN’), (’easy’, ‘JJ’)]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
<ChunkRule: ‘<NN.?><VBD.?><JJ.?>*
The conclusion from the above Part of Speech tagging Python example: “make” is a verb which is not included in the rule, so it is not tagged as mychunk
Use Case of Chunking
Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention.
Example: Temperature of New York. Here Temperature is the intention and New York is an entity.
In other words, chunking is used as selecting the subsets of tokens. Please follow the below code to understand how chunking is used to select the tokens. In this example, you will see the graph which will correspond to a chunk of a noun phrase. We will write the code and draw the graph for better understanding.
Code to Demonstrate Use Case
import nltk text = “learn php from guru99” tokens = nltk.word_tokenize(text) print(tokens) tag = nltk.pos_tag(tokens) print(tag) grammar = “NP: {
Output:
[’learn’, ‘php’, ‘from’, ‘guru99’] – These are the tokens [(’learn’, ‘JJ’), (‘php’, ‘NN’), (‘from’, ‘IN’), (‘guru99’, ‘NN’)] – These are the pos_tag (S (NP learn/JJ php/NN) from/IN (NP guru99/NN)) – Noun Phrase Chunking
Graph
Noun Phrase chunking Graph From the graph, we can conclude that “learn” and “guru99” are two different tokens but are categorized as Noun Phrase whereas token “from” does not belong to Noun Phrase. Chunking is used to categorize different tokens into the same chunk. The result will depend on grammar which has been selected. Further Chunking NLTK is used to tag patterns and to explore text corpora.
COUNTING POS TAGS
We have discussed various pos_tag in the previous section. In this particular tutorial, you will study how to count these tags. Counting tags are crucial for text classification as well as preparing the features for the Natural language-based operations. I will be discussing with you the approach which guru99 followed while preparing code along with a discussion of output. Hope this will help you. How to count Tags: Here first we will write working code and then we will write different steps to explain the code.
from collections import Counter import nltk text = “Guru99 is one of the best sites to learn WEB, SAP, Ethical Hacking and much more online.” lower_case = text.lower() tokens = nltk.word_tokenize(lower_case) tags = nltk.pos_tag(tokens) counts = Counter( tag for word, tag in tags) print(counts)
Output: Counter({‘NN’: 5, ‘,’: 2, ‘TO’: 1, ‘CC’: 1, ‘VBZ’: 1, ‘NNS’: 1, ‘CD’: 1, ‘.’: 1, ‘DT’: 1, ‘JJS’: 1, ‘JJ’: 1, ‘JJR’: 1, ‘IN’: 1, ‘VB’: 1, ‘RB’: 1}) Elaboration of the code
To count the tags, you can use the package Counter from the collection’s module. A counter is a dictionary subclass which works on the principle of key-value operation. It is an unordered collection where elements are stored as a dictionary key while the count is their value.
Import nltk which contains modules to tokenize the text.
Write the text whose pos_tag you want to count.
Some words are in upper case and some in lower case, so it is appropriate to transform all the words in the lower case before applying tokenization.
Pass the words through word_tokenize from nltk.
Calculate the pos_tag of each token
Output = [(‘guru99’, ‘NN’), (‘is’, ‘VBZ’), (‘one’, ‘CD’), (‘of’, ‘IN’), (’the’, ‘DT’), (‘best’, ‘JJS’), (‘site’, ‘NN’), (’to’, ‘TO’), (’learn’, ‘VB’), (‘web’, ‘NN’), (’,’, ‘,’), (‘sap’, ‘NN’), (’,’, ‘,’), (’ethical’, ‘JJ’), (‘hacking’, ‘NN’), (‘and’, ‘CC’), (‘much’, ‘RB’), (‘more’, ‘JJR’), (‘online’, ‘JJ’)]
Now comes the role of dictionary counter. We have imported in the code line 1. Words are the key and tags are the value and counter will count each tag total count present in the text.
Frequency Distribution
Frequency Distribution is referred to as the number of times an outcome of an experiment occurs. It is used to find the frequency of each word occurring in a document. It uses FreqDistclass and defined by the nltk.probabilty module. A frequency distribution is usually created by counting the samples of repeatedly running the experiment. The no of counts is incremented by one, each time. E.g. freq_dist = FreqDist() for the token in the document: freq_dist.inc(token.type()) For any word, we can check how many times it occurred in a particular document. E.g.
Count Method: freq_dist.count(‘and’)This expression returns the value of the number of times ‘and’ occurred. It is called the count method. Frequency Method: freq_dist.freq(‘and’)This the expression returns frequency of a given sample.
We will write a small program and will explain its working in detail. We will write some text and will calculate the frequency distribution of each word in the text.
import nltk a = “Guru99 is the site where you can find the best tutorials for Software Testing Tutorial, SAP Course for Beginners. Java Tutorial for Beginners and much more. Please visit the site guru99.com and much more.” words = nltk.tokenize.word_tokenize(a) fd = nltk.FreqDist(words) fd.plot()
Explanation of code:
Import nltk module. Write the text whose word distribution you need to find. Tokenize each word in the text which is served as input to FreqDist module of the nltk. Apply each word to nlk.FreqDist in the form of a list Plot the words in the graph using plot()
Please visualize the graph for a better understanding of the text written
Frequency distribution of each word in the graph
NOTE: You need to have matplotlib installed to see the above graph Observe the graph above. It corresponds to counting the occurrence of each word in the text. It helps in the study of text and further in implementing text-based sentimental analysis. In a nutshell, it can be concluded that nltk has a module for counting the occurrence of each word in the text which helps in preparing the stats of natural language features. It plays a significant role in finding the keywords in the text. You can also extract the text from the pdf using libraries like extract, PyPDF2 and feed the text to nlk.FreqDist. The key term is “tokenize.” After tokenizing, it checks for each word in a given paragraph or text document to determine that number of times it occurred. You do not need the NLTK toolkit for this. You can also do it with your own python programming skills. NLTK toolkit only provides a ready-to-use code for the various operations. Counting each word may not be much useful. Instead one should focus on collocation and bigrams which deals with a lot of words in a pair. These pairs identify useful keywords to better natural language features which can be fed to the machine. Please look below for their details.
Collocations: Bigrams and Trigrams
What is Collocations?
Collocations are the pairs of words occurring together many times in a document. It is calculated by the number of those pair occurring together to the overall word count of the document. Consider electromagnetic spectrum with words like ultraviolet rays, infrared rays. The words ultraviolet and rays are not used individually and hence can be treated as Collocation. Another example is the CT Scan. We don’t say CT and Scan separately, and hence they are also treated as collocation. We can say that finding collocations requires calculating the frequencies of words and their appearance in the context of other words. These specific collections of words require filtering to retain useful content terms. Each gram of words may then be scored according to some association measure, to determine the relative likelihood of each Ingram being a collocation. Collocation can be categorized into two types-
Bigrams combination of two words Trigramscombinationof three words
Bigrams and Trigrams provide more meaningful and useful features for the feature extraction stage. These are especially useful in text-based sentimental analysis.
Bigrams Example Code
import nltk
text = “Guru99 is a totally new kind of learning experience.” Tokens = nltk.word_tokenize(text) output = list(nltk.bigrams(Tokens)) print(output)
Output:
[(‘Guru99’, ‘is’), (‘is’, ’totally’), (’totally’, ’new’), (’new’, ‘kind’), (‘kind’, ‘of’), (‘of’, ’learning’), (’learning’, ’experience’), (’experience’, ‘.’)]
Trigrams Example Code
Sometimes it becomes important to see a pair of three words in the sentence for statistical analysis and frequency count. This again plays a crucial role in forming NLP (natural language processing features) as well as text-based sentimental prediction. The same code is run for calculating the trigrams.
import nltk text = “Guru99 is a totally new kind of learning experience.” Tokens = nltk.word_tokenize(text) output = list(nltk.trigrams(Tokens)) print(output)
Output:
[(‘Guru99’, ‘is’, ’totally’), (‘is’, ’totally’, ’new’), (’totally’, ’new’, ‘kind’), (’new’, ‘kind’, ‘of’), (‘kind’, ‘of’, ’learning’), (‘of’, ’learning’, ’experience’), (’learning’, ’experience’, ‘.’)]
Tagging Sentences
Tagging Sentence in a broader sense refers to the addition of labels of the verb, noun, etc., by the context of the sentence. Identification of POS tags is a complicated process. Thus generic tagging of POS is manually not possible as some words may have different (ambiguous) meanings according to the structure of the sentence. Conversion of text in the form of list is an important step before tagging as each word in the list is looped and counted for a particular tag. Please see the below code to understand it better
import nltk text = “Hello Guru99, You have to build a very good site, and I love visiting your site.” sentence = nltk.sent_tokenize(text) for sent in sentence: print(nltk.pos_tag(nltk.word_tokenize(sent)))
Output: [(‘Hello’, ‘NNP’), (‘Guru99’, ‘NNP’), (‘,’, ‘,’), (‘You’, ‘PRP’), (‘have’, ‘VBP’), (‘build’, ‘VBN’), (‘a’, ‘DT’), (‘very’, ‘RB’), (‘good’, ‘JJ’), (‘site’, ‘NN’), (‘and’, ‘CC’), (‘I’, ‘PRP’), (‘love’, ‘VBP’), (‘visiting’, ‘VBG’), (‘your’, ‘PRP$’), (‘site’, ‘NN’), (‘.’, ‘.’)]
Code Explanation:
Code to import nltk (Natural language toolkit which contains submodules such as sentence tokenize and word tokenize.) Text whose tags are to be printed. Sentence Tokenization For loop is implemented where words are tokenized from sentence and tag of each word is printed as output.
In Corpus there are two types of POS taggers:
Rule-Based Stochastic POS Taggers
1.Rule-Based POS Tagger: For the words having ambiguous meaning, rule-based approach on the basis of contextual information is applied. It is done so by checking or analyzing the meaning of the preceding or the following word. Information is analyzed from the surrounding of the word or within itself. Therefore words are tagged by the grammatical rules of a particular language such as capitalization and punctuation. e.g., Brill’s tagger. 2.Stochastic POS Tagger: Different approaches such as frequency or probability are applied under this method. If a word is mostly tagged with a particular tag in training set then in the test sentence it is given that particular tag. The word tag is dependent not only on its own tag but also on the previous tag. This method is not always accurate. Another way is to calculate the probability of occurrence of a specific tag in a sentence. Thus the final tag is calculated by checking the highest probability of a word with a particular tag.
POS tagging with Hidden Markov Model
Tagging Problems can also be modeled using HMM. It treats input tokens to be observable sequence while tags are considered as hidden states and goal is to determine the hidden state sequence. For example x = x1,x2,…………,xn where x is a sequence of tokens while y = y1,y2,y3,y4………ynis the hidden sequence.
How Hidden Markov Model (HMM) Works?
HMM uses join distribution which is P(x, y) where x is the input sequence/ token sequence and y is tag sequence. Tag Sequence for x will be argmaxy1….ynp(x1,x2,….xn,y1,y2,y3,…..). We have categorized tags from the text, but stats of such tags are vital. So the next part is counting these tags for statistical study.
Summary
POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context. Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc. POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of Part of Speech tagging with NLTK is complete. Chunking in NLP is a process to take small pieces of information and group them into large units. There are no pre-defined rules, but you can combine them according to need and requirement. Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention. Chunking is used to categorize different tokens into the same chunk.