Term Frequency-Inverse Document Frequency
In this tutorial we will look at what TF and IDF are and how they can be use to process text data in Machine learning.
What is TFIDF?
Unlike with traditional supervised Machine Learning (ML) problems the challenge with dealing with text data is that it is hard to figure out how to deal with it. Computers do not know how to deal with text data. We have come up with a standard of representing numbers in the form of ASCII, Latin, or UTF-8 encodings. Similarily our ML models will have some sort of encodings that we need it to know about. Dealing with text in the ML pipeline is reffered to as the preprocessing step.
Lets say we have a list of sentences
text_data = ['A big Whale.', 'The quick brown fox.', 'The Giraffe that cried.' ]
We can lowercase each word, remove any punctuation, and split on space. This will allow us to tokenize each word and count up how many times it appears in each sentence.
import pandas as pd def create_doc_term_df(corpus, vectorizer): doc_term_matrix = vectorizer.fit_transform(corpus) df = pd.DataFrame(doc_term_matrix.toarray(), columns=vectorizer.get_feature_names_out()) df.index.name = 'Sentence' return df Lets run the function and get the output `python from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer() create_doc_term_df(account_names, vecotrizer) '''Output: big brown cried fox giraffe quick that the whale Sentence 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 0 2 0 0 1 0 1 0 1 1 0 '''
Higher relevance is given to words that ppear more often across many sentences. The formula below calculates this frequency.
where referes to the index of the sentence and refers to the index of the word at the ‘th position.
IDF
Now that we have calculated the frequency of each word within a sentence, we must also calculate the amount of times each word appears across all sentences. We are comparing the relevance of each word against other words in the sentence across all sentences. he reason this is an important step in our pipeline is because it normalizes our data across all account names.
Think of a spam email classfier. The word “urgent! click here!” would have a high term frequency within the email; however, across all the emails you have received these words will have a low frequency. The purpose of inverse document frequency is to drown out the words in documents that do not appear in other documents.
The formula for calculating is
where is the number of documents that contain the word . E.g. . Also, is the number of words in our list. For our toy example, .
Putting it all Together, TFIDF
The last step is to get the . We simply multiply the and the .
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer() print(create_doc_term_df(text_data, vectorizer).to_string()) '''output big brown cried fox giraffe quick that the whale Sentence 0 0.707107 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.707107 1 0.000000 0.528635 0.000000 0.528635 0.000000 0.528635 0.000000 0.40204 0.000000 2 0.000000 0.000000 0.528635 0.000000 0.528635 0.000000 0.528635 0.40204 0.000000 '''