Written by
Ardit Xhaferi

Share this post!


← ../

Exploring Machine Learning

February 18, 20226 min read

Starting Point

I just decided one morning, fuck it I'm learning Machine Learning... so yeah here we are. I started with this tutorial it was really helpful for the technical side of things but in theory, I already knew somewhat how the logic behind it works.

I had little experience with Python and none with Machine Learning but after this tutorial, I gained the confidence to do something myself and the idea came immediately.

The idea - “Write like Kadare”

The idea was to feed the ML algorithm books from the famous Albanian writer 'Ismail Kadare'.

The goal was to predict the next word in the context of the previous word just like Kadare would have written it.

Kinda like an autosuggestion tool that would suggest a word, but it's Kadare suggesting it.

Creating the dataset

First, we need to clean the data and save it in an appropriate format. What we want is the data to be split in input data and output data so the algorithm can pick up on patterns and use them to predict outputs that were never taught before.

The input, in this case, is the words Kadare wrote and the output is the word that proceeds it. I got a book online written by Kadare as a text file and started removing any special characters with a simple regex and then inserted every word into an array.

import re

def get_dictionary_word_list(book):
    with open(book) as f:
        text = f.read()
        textF = re.sub(r'[^a-zA-Z ËëÇç]', '', text).lower()
        return textF.split()

data = get_dictionary_word_list("books/prill.txt")

Then I wanted to have a nested array that contains every word and its successor.

Please don't judge my python because I don't have any real training or any real jobs on it so the code is mostly StackOverflow and tinkering

def chunks(lst, n):
    for i in range(0, len(lst)):
        yield lst[i:i + n]

data = list(chunks(data, 2))

And finally, we need to parse the data into a CSV format because that's how we will feed it to the ML algorithm and we will just append the headers.

import csv
header = ['First', 'Second']

with open('prilli-i-thyer.csv', 'w', encoding='UTF8', newline='') as f:
    writer = csv.writer(f)



And finally it should look something like this:
the dataset
The scrapped data.

The model

I decided to use the Decision Tree Classifier from scikit-learn because that's the same algorithm the tutorial guy used, it's probably not the best, maybe even the worst model for my specific problem but it's my first project with ML so I don’t care that much about mistakes but more about learning during the process.

Preparing the data

So first we need to read the CSV file we created earlier with pandas

import pandas as pd

# creating initial dataframe
book_data = pd.read_csv("prilli-i-thyer.csv")

Pandas is a software library written for the Python programming language for data manipulation and analysis

And then we need to use the LabelEncoder to read the columns we have and assign numerical values to them and store them in different columns

from sklearn.preprocessing import LabelEncoder

# creating instance of labelencoder
labelencoder = LabelEncoder()

# Assigning numerical values and storing in another column
book_data['First_df'] = labelencoder.fit_transform(book_data['First'])
book_data['Second_df'] = labelencoder.fit_transform(book_data['Second'])

We will drop every column except First_df so now we have the encoded input data and we will assign it to X

#The input data
X = book_data.drop(columns=['Second', 'First', 'Second_df'])

The drop() function won't change our current dataset, it will create a new one without these columns and it's more useful when we have to deal with multiple columns of inputs or outputs

And now we will do the same with the encoded output data.

#The output data
Y = book_data['Second_df']


We prepared the data now it's almost time to train the model but before we do that we need to make sure we are leaving a portion of the dataset for testing.

Because we want to train our algorithm but we need to check its credibility and accuracy through testing.

from sklearn.model_selection import train_test_split

# Splitting the dataset for training
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)


Now we initialize the instance of the DecisionTreeClassifier and with the method fit we pass the input data and our expected output data.

model = DecisionTreeClassifier()
model.fit(x_train, y_train)

After training the model if there aren't going to be any changes to your dataset you can save the trained model and load it each time you need it like so:

joblib.dump(model, "kadare-model.joblib")
model = joblib.load("kadare-model.joblib")


Now after we trained the model we are ready to test it with some of our own input I used Anvil for the frontend first we need to establish the connection.

import anvil.server

And then right over at anvil we will add an input field and handle the change event like so:

def text_box_1_change(self, **event_args):
    word_list = self.text_box_1.text.split()
    result = anvil.server.call('getNextWord', word_list[-1])
    self.label_1.text = result

This will listen when the input field changes and will call our function back at our backend (to get our prediction) with the last word of the input as its parameter.

So basically when we type "Do te shkoj ne shkoll" we will only use "shkoll" to predict the next word.

Passing only the last word to the model to predict the next word is not the right way to do this, I learned it the hard way :), I will explain later the mistakes that were made during the process and how I learned from them.


And finally, we will make the getNextWord function callable from Anvil and return the prediction.

def getNextWord(currentWord):
    if(len(book_data[book_data['First']==currentWord]["First_df"]) != 0):
        predictions = model.predict([ [book_data[book_data['First']==currentWord]["First_df"][:1].item()] ])
        result = labelencoder.inverse_transform(predictions)
        result = "..."
    return result
Finally! We have the algorithm guess words like Kadare would have written them in real-time!

Mistakes && Conclusions

The point of the project wasn't so it will be perfect, it was more just a learning process I wanted to share with you guys.

I know next time how I will do things differently one example is how we create the dataset. It's smarter to use a sequence of words that preceded the predicted word, with this logic we could create a coherent sentence and not just a autosuggestion tool, so here is a visual example:

Visual Example
Here is how I would change the algorithm represented visually.

Published February 18, 2022, by Ardit Xhaferi.

You don't have to share this post, but if you feel like it: