Let me show you a basic approach to classify text with machine learning, which won a competition at the Indiana University. This post is for everybody, who would like to get started with natural language processing (NLP) in Python.
Some weeks a good friend of mine, with a background in humanities and antisemitic studies, asked me if I would like to take part in an event at the Indiana University. The event was divided into a Datathon and Hackathon, a virtual workshop and competition with the goal to recognize antisemitism online and use programming to take care of this task automatically. You can read about the event in detail HERE.
The use of social media in our modern society is huge and getting bigger by the day. With the increasing number of users, there is also a portion of people growing, who use these platforms as an outlet for homophobia, sexism, racism as well as antisemitism. Those hate speeches evolved from little internet trolls to a real pain to the digital society we are living in. The event aimed to fight against this particular threat and I was more than happy to contribute!
The Datathon was about manually annotating random tweets with keywords like „Jews“ or „Israel“ in it. For this task, a well-founded background in antisemitic studies was needed. The goal of the Hackathon, on the other hand, was to classify tweets programmatically as either antisemitic (1) or non-antisemitic (0). For the hackathon, a bunch of labeled data was given in a file. Because I was the only one with programming skills in the team, I was happy to think about a solution for the hackathon, even though I had no experience with NLP at all.
4 weeks later
WE ACTUALLY WON!
The rest of my team annotated a crazy amount of tweets properly, which won us the datathon. For the hackathon, we also took first place because of my machine learning model, which achieved an F1 score of 0.90 on the final validation data.
A basic NLP approach
In the following article, I want to reveal you a really basic approach for classifying text in Python. This approach is based on an article from Rebecca Vickery on medium. Thank you so much for sharing!
As I said, there was a data set given in a JSON file, consisting of labeled tweets either as antisemitic (1) or non-antisemitic (0). I converted the given JSON into a pandas data frame to keep the data science manner 🙂
|0||0||„RT @purplechrain: so members of the GOP have, as predicted, officially started combining “liberal & leftist Jews aren’t real Jews & thus it…“|
|1||0||„RT @Laurent_Weppe: @the_moviebob SAME White petit bourgeois start openly saying stuff like „“beating to death Niggers and Kikes make my dick…“|
|2||1||🗼 #unpopularopinion #MAGA The crisis currently plaguing kikes conceivably will be feral cats 🙂 Mad . 🙅 🐔 ↔ 🚶 💜 🌜 🆙 👪 🚠 🚙 🐚 🈶|
|3||1||„@Michael__Baskin @mattduss Again, you are changing the subject. Are you a zionazi? hmm?|
Why does international law apply to others, but not Israel? This was the original point.
This massive zionist influence and control over our politics needs to go, and dual loyalists should be given the death penalty.“
|4||1||„True, but that won’t stop the ZioNazi run #NATO & its corrupt, arrogant #Ottoman Nazi hireling from attacking Syria while useless, spineless #Putin just sits and watches..#CapeTown #SouthAFrica“|
Cleaning the Data
In order to make any machine algorithm work well, I had to pre-process the given text data. As you can see, tweets contain characters like hashtags or emojis, which won’t be meaningful to a machine learning algorithm.
I am using the re library in Python for removing these characters.
import re def f_clean_data(df,field): df[field]=df[field].str.lower() df[field] = df[field].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", elem)) return df
This function also converts every word in lower case, which is nice because the following algorithm will count the frequency of words. Without this processing, a lower and upper version of the same word would count differently, even if the meaning is the same.
Let’s have a look at the cleaned data.
|0||0||so members of the gop have as predicted officially started combining liberal amp leftist jews arent real jews amp thus it|
|1||0||weppe moviebob same white petit bourgeois start openly saying stuff like beating to death niggers and kikes make my dick|
|2||1||unpopularopinion maga the crisis currently plaguing kikes conceivably will be feral cats mad|
|3||1||baskin again you are changing the subject are you a zionazi hmmwhy does international law apply to others but not israel this was the original pointthis massive zionist influence and control over our politics needs to go and dual loyalists should be given the death penalty|
|4||1||true but that wont stop the zionazi run nato amp its corrupt arrogant ottoman nazi hireling from attacking syria while useless spineless putin just sits and watchescapetown southafrica|
This is looking a lot cleaner. After some research, I thought another important step of cleaning my tweets is needed: removing stop words and lemmatization. Stop words are the most common and basic words in a language like „to“, „the“ or „and“. Those words had to be removed because they can’t give a hint about the intention of a tweet. Lemmatization is a linguistic process of affiliating a word to its very basic form (lemma). For example „runs“, „ran“ and „running“ are forms of the same lemma: „run“. Like in the lowercase processing, this is really important to identify equal words even if they are in different forms.
I implemented this by using spacy, a really strong library for NLP. I am basically splitting up my tweet as a text in separate words (tokens). If the token is a stop word, I don’t want to append it to my cleaned text. If it is not a stop word, I want to apply the lemmatization and append it to my cleaned text.
import spacy nlp=spacy.load('en_core_web_sm') def remove_stop_apply_lemma_for_string(text): cleaned_text="" tokens=nlp(text) for token in tokens: if not token.is_stop: cleaned_text=cleaned_text+" "+token.lemma_ return cleaned_text
At this point in pre-processing the tweets, it is getting harder to understand the text as a human being. On the other hand, this will improve the performance of our algorithm in the end.
|0||0||member gop predict officially start combine liberal amp leftist jews not real jews amp|
|1||0||weppe moviebob white petit bourgeois start openly say stuff like beat death nigger kike dick|
|2||1||unpopularopinion maga crisis currently plaguing kike conceivably feral cat mad|
|3||1||baskin change subject zionazi hmmwhy international law apply israel original pointthis massive zionist influence control politic need dual loyalist give death penalty|
|4||1||true will not stop zionazi run nato amp corrupt arrogant ottoman nazi hirele attack syria useless spineless putin sit watchescapetown southafrica|
Balancing the Classes
After having the data cleaned I realized, that the number of non-antisemitic tweets predominated the antisemitic tweets. If the machine learning model is only being fed with tweets, which are labeled as 0, it is more likely, that it will predict new tweets to 0 in the end.
Having the classes balanced is an important requirement to proceed in this stage.
It is one approach, to either upsample or downsample the minority/majority of a class. At this point, it was really about trying which method scores better results. I went for upsampling the minority, where samples of the minority are being used multiple times until the class is the same size as the majority. For this implementation, I used the functions of the sklearn library.
from sklearn.utils import resample import pandas as pd #they are like 1200 more antisemitic tweets than not antisemitic. #Thats why we are upsampling the minority. train_majority=clean_data[clean_data.Label=="0"] train_minority=clean_data[clean_data.Label=="1"] train_minority_upsampled=resample(train_minority,replace=True, n_samples=len(train_majority),random_state=123) train_upsampled=pd.concat([train_minority_upsampled,train_majority])
After having our whole data-set prepared, let’s have a look at the chosen algorithm. A common approach to detect patterns in text is the use of the Bag of Words (BoW) model. A BoW model counts the frequency of single words and assigns a weight proportional to this frequency. I built this model with the help of the functions of the sklearn library.
This function simply does the job of splitting a text into tokens and counting the frequency. It is possible to adjust some parameters on this function but I left it in default mode.
This sklearn function called TfidfTransformer applies the frequency weighting to our machine learning model. Without this process of weighting tokens, less frequent words would vanish in the training process later on. To sum up: the TfidfTransformer assigns less value to more frequent words and allocates more value to less frequent but perhaps more meaningful words.
Training the Model
For training my machine learning model I used a sklearn pipeline with a SGDClassifier. This pipeline object will apply every step of the BoW model to a given tweet.
Before training the model, all the given data had to be split into a training and test set.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.linear_model import SGDClassifier from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test=train_test_split(train_upsampled['Text'], train_upsampled['Label'],test_size=0.2) pipeline=Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()) ,('nb', SGDClassifier()),])
All it needs is to fit the training data into the pipeline and obtain a model. In the end, a prediction about the test data and the F1 score can be received.
from sklearn.metrics import f1_score model=pipeline.fit(x_train,y_train) y_predict=model.predict(x_test) print(f1_score(y_test, y_predict))
This model achieved an F1 score of > 0.95 on the given test data and 0.90 on the final validation data. I was really surprised at how successful this rather basic approach was. I think this is a good foundation to get started with natural language processing but also has plenty of chances to improve.
Check my github for the full code!