This article describes how to train an Artificial Intelligence (AI) to classify tickets in a ticket system like OTOBO. This process involves data preparation, model training, and evaluation.
Requirements
- Python 3.10+
- Libraries: datasets, transformers[torch], psutil, gputil, nvidia_smi, huggingface_hub, nlpaug, nltk, sentencepiece
Install the required packages with:
pip install datasets transformers[torch] psutil gputil nvidia_smi huggingface_hub nlpaug nltk sentencepiece
or in a Jupyter Notebook:
!pip install datasets transformers[torch] psutil gputil nvidia_smi huggingface_hub nlpaug nltk sentencepiece
Step 1: Data Preparation
First, the ticket data must be prepared. This includes loading the data, cleaning, and preprocessing the text. For this tutorial, we use the following data:
Example Data
subject | body | priority | queue |
---|---|---|---|
Login Issue | Unable to login to the system | High | Software |
Password Reset | Need to reset my password | Medium | Hardware |
Email Problem | Not receiving emails | Low | Accounting |
Network Down | Network is down in building 5 | High | Software |
Printer Issue | Printer not working | Medium | Hardware |
We use subject
and body
as features, and priority
and queue
are the labels we want to predict.
Features and Labels
Feature 1 | Feature 2 | Label 1 | Label 2 |
---|---|---|---|
Login Issue | Unable to login to the system | High | Software |
Password Reset | Need to reset my password | Medium | Hardware |
Email Problem | Not receiving emails | Low | Accounting |
Network Down | Network is down in building 5 | High | Software |
Printer Issue | Printer not working | Medium | Hardware |
When using text sequence classification with BERT, we can only use one feature. Therefore, we combine subject
and body
. Since we want to give more weight to the subject
, we concatenate the texts by inserting the subject
twice and the body
once.
import pandas as pd
# Example Data
data = {
'subject': ["Login Issue", "Password Reset", "Email Problem", "Network Down", "Printer Issue"],
'body': ["Unable to login to the system", "Need to reset my password", "Not receiving emails",
"Network is down in building 5", "Printer not working"],
'priority': ["High", "Medium", "Low", "High", "Medium"],
'queue': ["Software", "Hardware", "Accounting", "Software", "Hardware"]
}
df = pd.DataFrame(data)
# Create combined feature
df['combined_feature'] = df.apply(lambda row: f"{row['subject']} {row['subject']} {row['body']}", axis=1)
print(df[['combined_feature', 'priority', 'queue']])
Transformed Table
Combined Feature | Label 1 | Label 2 |
---|---|---|
Login Issue Login Issue Unable to login to the system | High | Software |
Password Reset Password Reset Need to reset my password | Medium | Hardware |
Email Problem Email Problem Not receiving emails | Low | Accounting |
Network Down Network Down Network is down in building 5 | High | Software |
Printer Issue Printer Issue Printer not working | Medium | Hardware |
To train the model, we need to convert the labels into numbers. Here is the code to do this:
from sklearn.preprocessing import LabelEncoder
# Initialize Label Encoder
le_priority = LabelEncoder()
le_queue = LabelEncoder()
# Convert labels to numbers
df['priority_encoded'] = le_priority.fit_transform(df['priority'])
df['queue_encoded'] = le_queue.fit_transform(df['queue'])
print(df[['combined_feature', 'priority_encoded', 'queue_encoded']])
Result:
Combined Feature | priority_encoded | queue_encoded |
---|---|---|
Login Issue Login Issue Unable to login to the system | 0 | 2 |
Password Reset Password Reset Need to reset my password | 2 | 1 |
Email Problem Email Problem Not receiving emails | 1 | 0 |
Network Down Network Down Network is down in building 5 | 0 | 2 |
Printer Issue Printer Issue Printer not working | 2 | 1 |
Since we can only have one label for our classification, we now have two options.
-
Combine the two labels into one. This would result in
priority_queue
:HIGHSoftware
,HIGHHardware
,HIGHAccounting
, etc. This would lead toPRODUCT[len(unique(label)) for label in labels]
, in our caselen(unique(priorities)) * len(unique(queues))
, which is3 * 3 = 9
.Advantages:
- Simple implementation and management.
- One model for the entire classification.
Disadvantages:
- Increased complexity and size of the classification problem.
- Potentially worse performance with low data per combination.
-
Train a separate model for each label. In this tutorial, we use method 2. We have a separate model for each of
Queue
andPriority
.
Code to Split the Table into Queue and Priority Table
# Split into Queue and Priority Tables
queue_df = df[['combined_feature', 'queue_encoded']]
priority_df = df[['combined_feature', 'priority_encoded']]
print(queue_df)
print(priority_df)
Table for Queue Model
Combined Feature | queue_encoded |
---|---|
Login Issue Login Issue Unable to login to the system | 2 |
Password Reset Password Reset Need to reset my password | 1 |
Email Problem Email Problem Not receiving emails | 0 |
Network Down Network Down Network is down in building 5 | 2 |
Printer Issue Printer Issue Printer not working | 1 |
Table for Priority Model
Combined Feature | priority_encoded |
---|---|
Login Issue Login Issue Unable to login to the system | 0 |
Password Reset Password Reset Need to reset my password | 2 |
Email Problem Email Problem Not receiving emails | 1 |
Network Down Network Down Network is down in building 5 | 0 |
Printer Issue Printer Issue Printer not working | 2 |
Tokenizer Explanation
A tokenizer converts text into smaller units called tokens. These tokens can be words, punctuation, or sentence components. Tokenizers are important because machine learning and NLP models require text in a form they can process. Through tokenization, models can analyze text and learn to recognize patterns.
Token Encoding
In token encoding, tokens are converted into numbers so they can be processed by machine learning models. Here is an example of what a tokenized and encoded text for our table might look like:
from transformers import BertTokenizer
# Initialize BERT Tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenization and encoding of an example
example_text = df['combined_feature'][0]
tokens = tokenizer.tokenize(example_text)
encoded_tokens = tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(encoded_tokens)
Example output for "Login Issue Login Issue Unable to login to the system":
Tokens:
['login', 'issue', 'login', 'issue', 'unable', 'to', 'login', 'to', 'the', 'system']
Encoded Tokens:
[2653, 3277, 2653, 3277, 3928, 2000, 2653, 2000, 1996, 2291]
Splitting Tables into Train and Test Dataset
To train and test our models, we split the data into training and test datasets. Here is the code to do this:
from sklearn.model_selection import train_test_split
# Split the Queue table into train and test datasets
queue_train, queue_test, y_queue_train, y_queue_test = train_test_split(queue_df['combined_feature'],
queue_df['queue_encoded'], test_size=0.2,
random_state=42)
# Split the Priority table into train and test datasets
priority_train, priority_test, y_priority_train, y_priority_test = train_test_split(priority_df['combined_feature'],
priority_df['priority_encoded'],
test_size=0.2, random_state=42)
print(queue_train, queue_test, y_queue_train, y_queue_test)
print(priority_train, priority_test, y_priority_train, y_priority_test)
By splitting data, we ensure that we have enough data to train and test our models, allowing us to evaluate their performance.
Step 2: Model Training
Model Training
In this article, we describe how to train the model with our training data. We use the transformers
library from Hugging Face and torch
for training BERT models.
BERT Model
BERT (Bidirectional Encoder Representations from Transformers) is a pre-trained language model capable of capturing the context of words in a sentence. It is commonly used for tasks such as text classification, question answering, and many other NLP tasks.
Parameters for Training
- batch_size: The number of examples processed in one pass through the model. Smaller batch sizes require less memory but result in more frequent updates of the model parameters.
- epochs: The number of complete passes through the entire training dataset. More epochs can lead to a better model but risk overfitting.
- learning_rate: The step size with which the model adjusts its parameters. Too high a learning rate can lead to unstable training processes, while too low a learning rate can result in slow learning.
Initializing the Model
We define a class TicketClassifier
that initializes the model and training parameters.
from transformers import BertForSequenceClassification, BertTokenizer, Trainer, TrainingArguments
class TicketClassifier:
def __init__(self, model_name: str):
self.tokenizer = BertTokenizer.from_pretrained(model_name)
self.model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3) # For 3 classes (Low, Medium, High)
def train(self, train_data, train_labels):
training_args = TrainingArguments(output_dir='./results')
trainer = Trainer(model=self.model, args=training_args, train_dataset=train_data)
trainer.train()
return trainer
classifier = TicketClassifier(model_name='bert-base-uncased')
Training the Model
We use the prepared datasets queue_train
, queue_test
, priority_train
, priority_test
for training and evaluation.
# Training the Queue model
trainer_queue = classifier.train(queue_train, y_queue_train)
# Training the Priority model
trainer_priority = classifier.train(priority_train, y_priority_train)
Model Evaluation
After training, we evaluate the model with the test data.
# Evaluating the Queue model
eval_queue_results = trainer_queue.evaluate(eval_dataset=queue_test)
print(eval_queue_results)
# Evaluating the Priority model
eval_priority_results = trainer_priority.evaluate(eval_dataset=priority_test)
print(eval_priority_results)
Through these steps, we ensure that our models are well trained and evaluated to successfully solve the classification task.
After evaluating the models, we obtain various metrics that describe the performance of the model. One of the most important metrics is accuracy, which indicates how many of the predictions are correct.
Prediction Accuracy
Accuracy is calculated by dividing the number of correct predictions by the total number of predictions. Here is a Python code to calculate the accuracy:
from sklearn.metrics import accuracy_score
# Example predictions and actual labels
y_true = [0, 2, 1, 0, 2] # Actual labels
y_pred = [0, 2, 1, 0, 1] # Predicted labels
# Calculating accuracy
accuracy = accuracy_score(y_true, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
Evaluation with Continuous Numbers
For continuous numbers like priority levels, it's important to consider the proximity of predictions. For example, the prediction 2 is closer to 1 than 3 is to 1. One way to evaluate this is by calculating the mean absolute error and the mean squared error.
Mean Absolute Error
The mean absolute error measures how far predictions are from the actual values. Here is a Python code to calculate the mean absolute error:
import numpy as np
# Example predictions and actual labels
y_true = np.array([0, 2, 1, 0, 2]) # Actual labels
y_pred = np.array([0, 2, 1, 0, 1]) # Predicted labels
# Calculating mean absolute error
mean_absolute_error = np.mean(np.abs(y_true - y_pred))
print(f"Mean Absolute Error: {mean_absolute_error}")
Mean Squared Error
The mean squared error measures the squared deviation of predictions from actual values, which weights larger errors more heavily. Here is a Python code to calculate the mean squared error:
# Calculating mean squared error
mean_squared_error = np.mean((y_true - y_pred) ** 2)
print(f"Mean Squared Error: {mean_squared_error}")
With these metrics, we can better understand and improve the performance of our models. Accuracy gives us an overall view of the performance, while mean absolute error and mean squared error for continuous numbers allow for a more detailed evaluation.
Summary
In this article, we demonstrated how to train an AI to classify tickets. By using Python and libraries like scikit-learn, we were able to create a simple yet effective model. This model can be further improved by using more complex algorithms and larger datasets.