Instructions

Step 1: Data Preparation: Please upload your input labelled training data in csv format and click to clean and prepare your training data. Please make sure that you have only one column of unique identifier (ID) within your data (avoid multiple ID columns, only first column as unique ID). Please see the following file as an acceptable example of the input csv data file format. Please make sure that your csv filename does not include spaces for example MilkQuality.csv is acceptable and Milk Quality.csv is not. IF our use case deploys class and target labels which represent quantities in real-life for example output of a generator (Class 1 below 10KW and Class 2 above 10KW), it is important to always avoid sorting/rearranging those class members/labels and their corresponding csv rows based on the sorted value of that quantity. For example, data rows of the csv file as rows No. 1, 2, 3, 4,... corresponding the Class 1 labels/target values of 1, 1, 1, 1,... representing the values 8.4, 4.3, 1.8, 9.4,... are fine and should not be rearranged based on the sorted value of the quantities they represent otherwise poor prediction or classification results may be generated (so the representing data rows should not be rearranged to correspond to 1.8, 4.3, 8,4, 9,4). Please also avoid NULL within your data. Instead of NULL use blank csv cell:

When preparing your input training csv data file for Butterfly AI, please also avoid sorting all the data columns based on one data column. This may result in poor predictions and low accuracy. As an example, the training csv file (above) has been rearranged/sorted based on the sorted Param5 column. The resulting csv file is shown below which is an unacceptable format. First column (ID) is an exception to this instruction. Please note that the more irregular and randomized the data features of your training csv file are arranged across columns and rows of csv, the better is the chance that Butterfly AI will generate more accurate predictions. To maximize the irregularity of your training data, you may temporarily add a column of random non-binary numbers (which is output of a random number generator as a sequence for example -1,4, 19.3, 0.7, -10.2,...) to your training csv file and sort that column of random numbers from smallest to largest while expanding the sorting process to all the columns and then remove that temporary column.

View More Details

The first row of csv file: for example “ID” and feature names for example “F1”,”F2”, …, “Fn”, and the name of the target to be predicted for example “Target” or "Result". etc.

The first column of csv file: ID of training data sample (please make sure that in compliance with GDPR all the customer or user names or any data that can be traced back to customer or user to identify hem will be removed and replaced with numeric values or textual IDs,

The last column of csv file: Target label values

Current AI is capable of performing both binary and multivariate classifications or predictions. The target column may include

- Either numeric or textual binary values for example “yes” and “no” or 0 and 1

- Or any number of multi-class numeric or textual values for example “Very-Low” “Low”, “Medium”, “High”, and “High-Risk”

Please make sure that csv cell will not include "," For example in cell B2 as a training data feature "Texas-Austin" or "Texas Austin" are acceptable and "Texas,Austin" is not acceptable.

Please make sure that your training csv data has sufficiently large number of samples (typically above 1000 labelled samples) to get the reliable predictions or classifications. Click to edit text. Focus on how you can benefit your customers.

Step 2: Training: Please choose one of your datasets that has already been prepared and cleaned and click to start training.

View More Details

General Training Instructions using Butterfly AI:

You may run first training round using any newly cleaned and prepared dataset with the default hyper-parameters suggested on screen. Depending on accuracy performance of training, you may then run couple of more training rounds to fine tune the hyper-parameters to optimize the model to achieve even better performance. Butterfly AI uses 90% of your original data for training and 10% of your original data to create a blind unseen verification test data by removing the target column and performing a blind prediction/classification on that verification data. The AI then returns the mean of two performance accuracy values 1. Training accuracy and 2. Verification prediction accuracy of the blind unseen test data. The value is between 0 and 1 showing the percentage of correct predictions or classifications for example a value of 0.85 shows that AI has achieved 85% correct predictions or classifications.

The AI uses two hyper-parameters: 1. AI Scaling factor (cell size) and 2. The optimization stopping threshold (a value between 0 and 1, for example if you wish to achieve 80% accuracy you must enter 0.8. For 100% accuracy you may enter a value close to 1 for example 0.9999 as the value of 1 is not acceptable as a threshold). For the first training AI always uses the default values of 19 for scaling and 0.8 (80%) for the optimization stopping threshold. Within the default given time, when AI manages to achieve an accuracy of training above the given threshold, otherwise there is a training failure (timeout):

- If successful, you may increase the AI optimization stopping threshold by steps of 0.05 or 0.1 or any other values to perform other training rounds to see whether you can achieve even achieve better accuracy of training

- If training failed you may reduce the AI optimization stopping threshold by steps of 0.05 or 0.1 or any other values to perform other training rounds to see when you hit a training success

The more successful training model for current dataset and current training attempt will always replace the previous less successful training model (i.e. overwriting it)

You will end up always with the best trained model for the same data preparation round.

The recommended values for AI scaling factor is any integer value between 8 and 200 but you may go higher depending on the number of your training data rows. You may also try to optimize your predictions or classification by performing different and additional rounds of training by trying different values of the AI scaling factor. Please note that you need to perform a new training to create other training models if you have a new training data or a new use case with new training data features.

For MVP we have limited training time assigned to our platform. So if you have training timeout, it means that within the default period of training time, it has not managed to reach and pass the success threshold. The following table represents an example of accuracy performance of a multi-class prediction/classification with default values and with further optimisation.

Prediction when the targets are continuous numbers rather than categories (multi-class and binary):

You can also use Butterfly AI to perform predictions when the targets of the prediction are continuous numbers rather than categories (such as multi-class or binary). Simply divide the value range of labels into N buckets treating each one as a class (i.e., N classes), perform training and prediction to find the winning class or bucket that the unseen and unknown prediction sample falls in, then divide the winning bucket (the winning class of the prediction) into further M more granular sub-buckets and perform same training and prediction process and pick the winning sub-bucket. The average value of the labels of the training samples with that winning sub-bucket will be presented as the outcome of the prediction. As an example, if we wish to predict the output power of a wind turbine, we may divide the range of the training label values between 0KW and 2000KW (the label values in the training files) into 10 buckets of 200 kw each and then perform training and the target prediction to decide into which bucket the unknown prediction sample falls in. For example, if the winner (the outcome of the previous round of prediction) is the bucket no. 2 or the bucket between 200kw to 400kw, we may then divide this bucket into further 10 more granular sub-buckets of 20kw (10 new classes) and perform another round of training and prediction to determine where the unknown prediction sample falls in to get more accurate prediction. You can continue this hierarchical process by dividing into even more granular buckets to get better and better prediction results in terms of accuracy.

Training Imbalanced Datasets using Butterfly AI:

If your training data is an imbalanced datasets (i.e. one class label has smaller number than the other class or classes) you will get better accuracy of prediction and F1 value if rather than performing one training for your large imbalanced dataset, break it into multiple smaller balanced sub-datasets, train them individually and predict individually and then combine their predictions by averaging the probabilities.

Please take the following steps. Please make sure that you will apply the randomization process (described above) to your training data file before performing the following process:

Step A: If your training data is imbalanced (i.e. one class label has smaller number than the other one) then turn your imbalanced datasets into K balanced smaller training sub-datasets: For example if your training data has 100 positive samples and 1000 negative samples then turn that dataset into K=10 balanced training sub-datasets where each has the 100 positive samples in common but first balanced sub-dataset has negative samples between 1 and 100, the second balanced training set has samples 101 to 200 and so on and the 10th balanced data set has the negatives samples 901 to 1000. If your imbalanced dataset is not binary and has multiple classes in a similar way turn it into multiple smaller balanced subsets. (Caution: For binary prediction/classifications please make sure that all your new training sub-datasets has the same number of data rows and the first training data row would have the same target label value e.g. in example above the first training data row across all the smaller balanced subset has the target/result value of P in the first row of training data and last column. For multi-label multi-class prediction/classification cases make sure that all your balanced sub-dataset files have the same number of rows and same order of classes for example if you have four labels A, B, C and D, each training csv file first includes data rows with label A and then data rows with label B and then C and D in order. This helps Butterfly AI to harmonize the phases of probabilities across all the new balanced sub-datasets otherwise you may encounter some complication when combining the final prediction probabilities.)

Step B: Perform K independent data preparation, training and prediction/classification with K newly created smaller balanced training datasets (you may also optimize the performance of each sub-dataset by tuning the two hyper-parameters and doing further training rounds).

Step C: Choose M top performer in terms of average of training performance and blind/verify/test performance and perform the predictions for the blind target using those top M training models

Step D: Calculate the average probability of all M independent top predictions and get the average (probabilities are available from the final prediction results csv) . Follow the same logic in your final predictions results/output csv to get your final prediction labels and results.

We will soon (beyond MVP) automate the entire process (above) so you don't need to perform the steps manually. The following table represents the accuracy performance of blind unseen multi-class test prediction/classification of the case study (as of above) after one round of optimisation without additional balanced sub-dataset prediction and stage-by-stage improvements when we average with more predictions/classifications with additional balanced sub-datasets:

Step 3: Batch Predictions/Classification: Butterfly AI always keeps the trained model with the best performance of all the training rounds, pair it with its corresponding training dataset while overriding the past weaker models. Please choose the dataset originally used for training, upload the blind batch prediction dataset in csv format (example below), and click to start cleaning and preparation of the prediction data and to run the prediction process. Your prediction results will be delivered in an output csv file. Please click "DOWNLOAD" button to see your final prediction results. Within the prediction result file (the output csv) always the column labelled "Class" represents your final prediction or classification results. The probabilities represent the certainty of predictions or classifications.

Your blind unseen batch prediction dataset should have similar format to the original dataset used for training, except that the last column is not included (the unknown labels/targets to be predicted). For example for the training dataset example (above) your batch prediction dataset may look like the following csv: