Training File¶
The training file¶
The training file inputs the information for the training process, the symbol to comment a line is “#”, the following is a sample training file called “training.in” the training files must follow the format shown in the sample:
#training.in
#setting up the training parameters
# path where the data for training is stored
path_to_data dft_data
#code used to generate the data
#1-> abinit 2-> fireball 3-> vasp
code 1
spec_simb C O
spec_Z 6 8
fami_2b 2
fami_3b 3 4
potential_name potential_test
validation_percentage 20
GBR_E_models_to_train 2
GBR_E_n_estimators 650 550
GBR_E_max_depth 7 5
GBR_E_min_samples_split 2 4
GBR_E_learning_rate 0.1732 0.35
GBR_E_min_samples_leaf 3 5
GBR_F_models_to_train 1
GBR_F_n_estimators 900
GBR_F_max_depth 8
GBR_F_min_samples_split 2
GBR_F_learning_rate 0.1732
GBR_F_min_samples_leaf 3
nn_E_models_to_train 2
nn_E_learning_rate 0.0012 0.0009
nn_E_training_steps 100 100
nn_E_batch_size 25 30
nn_E_architecture_1 152 80 20 1
nn_E_architecture_2 152 80 40 20 1
nn_F_models_to_train 2
nn_F_learning_rate 0.0018 0.00009
nn_F_training_steps 1000 1000
nn_F_batch_size 100 250
nn_F_architecture_1 152 80 20 1
nn_F_architecture_2 152 80 40 20 1
Flags for the Training File¶
path_to_data The path to the directory where the DFT data is stored.
code The number indicating the DFT code used for the calculations
1-> abinit
2-> fireball
3-> vasp.
spec_simb The species symbols of the atoms in the data set, in the sample “training.in” the flag is
spec_simb C O
spec_Z The chemical number of the atoms in the data set, in the order in which they appear en the spec_simb flag, in the sample “training.in” the flag is
spec_Z 6 8
fami_2b The families of 2 body Rpcenters for the Gaussians, or filters, in the sample ‘training.in’
fami_2b 2
is using the second family as defined in Dimension of the Feature Space .
fami_3b The families of 3 body pcenters for the Gaussians, or filters, in the sample ‘training.in’
fami_3b 3 4
is using the third and fourth families as defined in Dimension of the Feature Space .
potential_name Name of the directory containing the potential. The potential_name directory will be created in the path where the mlmd_ is invoked.
validation_percentage The percentage of elements from the data set to be used in the validation set.
GBR_E_models_to_train The number of gradients boosting regressors to be trained for energy predictions, the mlmd_ will select the model (GBR or nn) with the lower error in the validation set to be the part of the machine learning potential. In the sample ‘training.in’
GBR_E_models_to_train 2
which means that mlmd_ is going to train 2 GBR models to predict energy. In the case in which the user wants to train only neural networks the GBR_E_models_to_train can be put to 0 for example
GBR_E_models_to_train 0
GBR_E_n_estimators Array of integers with the number of estimators in every one of the GBR energy models defined in GBR_E_models_to_train. In the sample ‘training.in’
GBR_E_n_estimators 650 550
which means that the first GBR energy model is going to have 650 estimators and the second one 550.
GBR_E_max_depth Array of integers with the maximal depth of every one of the GBR energy models defined in GBR_E_models_to_train. In the sample ‘training.in’
GBR_E_max_depth 7 5
which means that the first GBR energy model is going to have a maximal depth of 7, while the second model is going to be 5.
GBR_E_min_samples_split Array of integers with the minimum number of samples required to split an internal node in every one of the GBR energy models defined in GBR_E_models_to_train. In the sample ‘training.in’
GBR_E_min_samples_split 2 4
which means that in the first GBR energy model the minimal number of samples is going to be 2, while in the second model is going to be 4.
GBR_E_learning_rate Array of floats with the learning rates of the GBR energy models defined in GBR_E_models_to_train. In the sample ‘training.in’
GBR_E_learning_rate 0.1732 0.35
which means that in the first GBR energy model the learning rate is going to be 0.1732, while in the second model is going to be 0.35.
GBR_E_min_samples_leaf Array of integers with the minimum number of samples required to be at a terminal node in every one of the GBR energy
- models defined in GBR_E_models_to_train. In the sample ‘training.in’ ::
- GBR_E_min_samples_split 2 4
which means that in the first GBR energy model the minimal number of samples is going to be 2, while in the second model is going to be 4.
GBR_F_models_to_train The number of gradients boosting regressors to be trained for force predictions, the mlmd_ will select the model (GBR or nn) with the lower error in the validation set to be the part of the machine learning potential. In the sample ‘training.in’
GBR_F_models_to_train 1
which means that mlmd is going to train 1 GBR model to predict force.
GBR_F_n_estimators Array of integers with the number of estimators in every one of the GBR force models defined in GBR_F_models_to_train. In the sample ‘training.in’
GBR_F_n_estimators 900
which means that the GBR force model is going to have 900.
GBR_F_max_depth Array of integers with the maximal depth of every one of the GBR force models defined in GBR_F_models_to_train. In the sample ‘training.in’
GBR_F_max_depth 8
which means that the GBR force model is going to have a maximal depth of 8.
GBR_F_min_samples_split Array of integers with the minimum number of samples required to split an internal node in every one of the GBR force models defined in GBR_F_models_to_train. In the sample ‘training.in’
GBR_F_min_samples_split 2
which means that in theGBR force model the minimal number of samples is going to be 2.
GBR_F_learning_rate Array of floats with the learning rates of the GBR force models defined in GBR_F_models_to_train. In the sample ‘training.in’
GBR_F_learning_rate 0.1732
which means that in the GBR force model the learning rate is going to be 0.1732
GBR_F_min_samples_leaf Array of integers with the minimum number of samples required to be at a terminal node in every one of the GBR force models defined in GBR_F_models_to_train. In the sample ‘training.in’
GBR_F_min_samples_leaf 3
which means that in the GBR force model the minimal number of samples is going to be 3
nn_E_models_to_train The number of neural network to be trained for energy predictions, the mlmd_ will select the model (GBR or nn) with the lower error in the validation set to be the part of the machine learning potential. In the sample ‘training.in’
nn_E_models_to_train 2
which means that mlmd is going to train 2 nn models to predict enrgy.
nn_E_learning_rate Array of floats with the learning rates of the nn energy models defined in nn_E_models_to_train. In the sample ‘training.in’
nn_E_learning_rate 0.0012 0.0009
which means that in the first nn energy model the learning rate is going to be 0.0012, while in the second model is going to be 0.0009.
nn_E_training_steps Array of integers with the number of training steps for every one of the models declared in nn_E_models_to_train. In the sample ‘training.in’
nn_E_training_steps 100 100
which means that the in both nn models the number of training steps is going to be 100.
nn_E_batch_size Array of integers with the number of size of the batches for the training in every one of the models declared in nn_E_models_to_train. In the sample ‘training.in’
nn_E_batch_size 25 30
which means that in the first nn model de batch for training is going to have 25 examples, while for the second nn model the batch size is going to be of 30 examples.
nn_E_architecture_i Array of integers, the amount of numbers in the array determines the number of layers in the ‘i’ nn declared in nn_E_models_to_train, every number in the array represents the number of nodes in the represented layer. For example in the ‘training.in’ sample there are 2 nn model declared, then there are also 2 architectures, the first one
nn_E_architecture_1 152 80 20 1
this represents an nn with 4 leyers, the first leyer has 152 nodes (the number of nodes of the first layer must be equal to the number of dimensions of the feature vectors), the second layer has 80 nodes, the third layer has 20 nodes, and the output layer has 1node, (the output layer must always have 1 node). For the second architecture in the ‘training.in’
sample nn_E_architecture_2 152 80 40 20 1
the number of leyers is 5 the first has 152 nodes, the second layer has 80 nodes, the third layer has 40 nodes, the fourth layer has 20 nodes, and the fifth layer (output) has 1 node.
nn_F_models_to_train The number of neural network to be trained for energy predictions, the mlmd_ will select the model (GBR or nn) with the lower error in the validation set to be the part of the machine learning potential. In the sample ‘training.in’
nn_F_models_to_train 2
which means that mlmd is going to train 2 nn models to predict force
nn_F_learning_rate Array of floats with the learning rates of the nn force models defined in nn_F_models_to_train. In the sample ‘training.in’
nn_F_learning_rate 0.0018 0.00009
which means that in the first nn force model the learning rate is going to be 0.0018, while in the second model is going to be 0.0009.
nn_F_training_steps Array of integers with the number of training steps for every one of the models declared in nn_F_models_to_train. In the sample ‘training.in’
nn_F_training_steps 1000 1000
which means that the in both nn models the number of training steps is going to be 1000.
nn_F_batch_size Array of integers with the number of size of the batches for the training in every one of the models declared in nn_F_models_to_train. In the sample ‘training.in’
nn_F_batch_size 100 250
which means that in the first nn model de batch for training is going to have 100 examples, while for the second nn model the batch size is going to be of 250 examples.
nn_F_architecture_i Array of integers, the amount of numbers in the array determines the number of layers in the ‘i’ nn declared in nn_F_models_to_train, every number in the array represents the number of nodes in the represented leyer. For example in the ‘training.in’ sample there are 2 nn model declared, then there are also 2 architectures, the first one
nn_F_architecture_1 152 80 20 1
this represents an nn with 4 layers, the first layer has 152 nodes (the number of nodes of the first layer must be equal to the number of dimensions of the feature vectors), the second layer has 80 nodes, the third layer has 20 nodes, and the output layer has 1 node, (the output layer must always have 1 node). In the second example in the ‘training.in’ sample
nn_F_architecture_2 152 80 40 20 1
the number of layers is 5 the first has 152 nodes, the second layer has 80 nodes, the third layer has 40 nodes, the fourth layer has 20 nodes, and the fifth layer (output) has 1 node.