.. _chap-training_file:

Training File
###########

==================
The training file 
==================
The *training file* inputs the information for the training process, 
the symbol to comment a line is **“#”**, the following is a sample training 
file called **“training.in”** the training files must follow the format 
shown in the sample:
::

	#training.in
	#setting up the training parameters
	# path where the data for training is stored
	path_to_data dft_data 
	#code used to generate the data
	#1-> abinit 2-> fireball 3-> vasp
	code 1
	spec_simb C O
	spec_Z 6 8
	fami_2b 2
	fami_3b 3 4
	potential_name potential_test
	validation_percentage 20

	GBR_E_models_to_train  2
	GBR_E_n_estimators 650 550
	GBR_E_max_depth 7 5
	GBR_E_min_samples_split 2 4
	GBR_E_learning_rate 0.1732 0.35
	GBR_E_min_samples_leaf 3 5

	GBR_F_models_to_train  1
	GBR_F_n_estimators 900
	GBR_F_max_depth 8
	GBR_F_min_samples_split 2
	GBR_F_learning_rate 0.1732
	GBR_F_min_samples_leaf 3

	nn_E_models_to_train 2
	nn_E_learning_rate 0.0012 0.0009
	nn_E_training_steps 100 100
	nn_E_batch_size 25 30
	nn_E_architecture_1 152 80 20 1
	nn_E_architecture_2 152 80 40 20 1

	nn_F_models_to_train 2
	nn_F_learning_rate 0.0018 0.00009
	nn_F_training_steps 1000 1000
	nn_F_batch_size 100 250
	nn_F_architecture_1 152 80 20 1
	nn_F_architecture_2 152 80 40 20 1

===========================	
Flags for the Training File
===========================
**path_to_data** The path to the directory where the DFT data is stored.

**code**  The number indicating the DFT code used for the calculations ::

	1-> abinit 
	2-> fireball 
	3-> vasp.

**spec_simb** The species symbols of the atoms in the data set, in the  
sample *“training.in”* the flag is ::
	spec_simb C O
	
**spec_Z** The chemical number of the atoms in the data set, in the order 
in which they appear en the  *spec_simb* flag, in the  
sample *“training.in”* the flag is ::
	spec_Z 6 8
**fami_2b** The families of 2 body Rpcenters for the Gaussians, or filters, 
in the sample *‘training.in’* ::
	fami_2b 2  
is using the second family as defined in :ref:`chap-dimension_feat_space` .

**fami_3b** The families of 3 body pcenters for the Gaussians, or filters, 
in the sample *‘training.in’* ::
	fami_3b 3 4  
is using the third and fourth families as defined in :ref:`chap-dimension_feat_space` .

**potential_name** Name of the directory containing the potential. 
The *potential_name* directory will be created in the path where the *mlmd_* is invoked.

**validation_percentage** The percentage of elements from the data set 
to be used in the validation set.

**GBR_E_models_to_train** The number of gradients boosting regressors to
be trained for energy predictions, the *mlmd_* will select the model 
(GBR or nn) with the lower error in the validation set to be the part of
the machine learning potential. In the sample ‘training.in’ ::
	GBR_E_models_to_train  2
which means that *mlmd_* is going to train 2 GBR models to predict energy. 
In the case in which the user wants to train only neural networks the 
*GBR_E_models_to_train* can be put to 0 for example ::
	GBR_E_models_to_train  0
	
**GBR_E_n_estimators** Array of integers with the number of estimators 
in every one of the GBR energy models defined in *GBR_E_models_to_train*.  
In the sample ‘training.in’ ::
	GBR_E_n_estimators 650 550 
which means that the first GBR energy model is going to have 650 
estimators and the second one 550.

**GBR_E_max_depth** Array of integers with the maximal depth of every 
one of the GBR energy models defined in GBR_E_models_to_train.  
In the sample ‘training.in’ ::
	GBR_E_max_depth 7 5 
which means that the first GBR energy model is going to have a maximal 
depth of 7, while the second model is going to be 5.

**GBR_E_min_samples_split** Array of integers with the minimum number of 
samples required to split an internal node in every one of the GBR energy 
models defined in GBR_E_models_to_train.  In the sample ‘training.in’ ::
	GBR_E_min_samples_split 2 4 
which means that in the first GBR energy model the minimal number of 
samples is going to be 2, while in the second model is going to be 4.

**GBR_E_learning_rate** Array of floats with the learning rates of the 
GBR energy models defined in GBR_E_models_to_train. 
In the sample ‘training.in’ ::
	GBR_E_learning_rate 0.1732 0.35 
which means that in the first GBR energy model the learning rate is 
going to be 0.1732, while in the second model is going to be 0.35.

**GBR_E_min_samples_leaf** Array of integers with the minimum number of 
samples required to be at a terminal node in every one of the GBR energy
 models defined in GBR_E_models_to_train.  In the sample ‘training.in’ ::
	GBR_E_min_samples_split 2 4 
which means that in the first GBR energy model the minimal number of 
samples is going to be 2, while in the second model is going to be 4.

**GBR_F_models_to_train**  The number of gradients boosting regressors 
to be trained for force predictions, the *mlmd_* will select the model 
(GBR or nn) with the lower error in the validation set to be the part of 
the machine learning potential. In the sample ‘training.in’ ::
	GBR_F_models_to_train  1  
which means that mlmd is going to train 1 GBR model to predict force.

**GBR_F_n_estimators** Array of integers with the number of estimators 
in every one of the GBR force models defined in *GBR_F_models_to_train*.  
In the sample ‘training.in’ ::
	GBR_F_n_estimators 900 
which means that the GBR force model is going to have 900.

**GBR_F_max_depth** Array of integers with the maximal depth of every 
one of the GBR force models defined in *GBR_F_models_to_train*.  
In the sample ‘training.in’ ::
	GBR_F_max_depth 8 
which means that the GBR force model is going to have a maximal depth of 8.

**GBR_F_min_samples_split** Array of integers with the minimum number of
samples required to split an internal node in every one of the GBR 
force models defined in *GBR_F_models_to_train*.  In the sample ‘training.in’ :: 
	GBR_F_min_samples_split 2 
which means that in theGBR force model the minimal number of samples is going to be 2.

**GBR_F_learning_rate** Array of floats with the learning rates of the 
GBR force models defined in GBR_F_models_to_train. In the sample ‘training.in’ ::
	GBR_F_learning_rate 0.1732  
which means that in the GBR force model the learning rate is going to be 0.1732

**GBR_F_min_samples_leaf** Array of integers with the minimum number of 
samples required to be at a terminal node in every one of the GBR force 
models defined in *GBR_F_models_to_train*.  In the sample ‘training.in’ ::
	GBR_F_min_samples_leaf 3
which means that in the GBR force model the minimal number of samples is going to be 3

**nn_E_models_to_train** The number of neural network to be trained for 
energy predictions, the *mlmd_* will select the model (GBR or nn) with 
the lower error in the validation set to be the part of the machine 
learning potential. In the sample ‘training.in’ ::
	nn_E_models_to_train 2 
which means that mlmd is going to train 2 nn models to predict enrgy.

**nn_E_learning_rate** Array of floats with the learning rates of the 
nn energy models defined in *nn_E_models_to_train*. In the sample ‘training.in’ ::
	nn_E_learning_rate 0.0012 0.0009  
which means that in the first nn energy model the learning rate is going
to be 0.0012, while in the second model is going to be 0.0009.

**nn_E_training_steps** Array of integers with the number of training 
steps for every one of the models declared in *nn_E_models_to_train*.  
In the sample ‘training.in’ ::
	nn_E_training_steps 100 100
which means that the in both nn models the number of training steps is going to be 100.

**nn_E_batch_size**  Array of integers with the number of size of the 
batches for the training in every one of the models declared in *nn_E_models_to_train*. 
In the sample ‘training.in’ ::
	nn_E_batch_size 25 30  
which means that in the first nn model de batch for training is going to
have 25 examples, while for the second nn model the batch size is going to be of 30 examples.

**nn_E_architecture_i** Array of integers, the amount of numbers in the 
array determines the number of layers in the ‘i’ nn declared in *nn_E_models_to_train*, 
every number in the array represents the number of nodes in the represented layer. 
For example in the *‘training.in’* sample there are 2 nn model declared, 
then there are also 2 architectures, the first one ::
	nn_E_architecture_1 152 80 20 1 
this represents an nn with 4 leyers, the first leyer has 152 nodes (the 
number of nodes of the first layer must be equal to the number of dimensions 
of the feature vectors), the second layer has 80 nodes, the third layer 
has 20 nodes, and the output layer has 1node, (the output layer must 
always have 1 node). 
For the second architecture in the ‘training.in’ ::
	sample nn_E_architecture_2 152 80 40 20 1
the number of leyers is 5  the first has 152 nodes, the second layer has
80 nodes, the third layer has 40 nodes, the fourth layer has 20 nodes, 
and the fifth layer (output) has 1 node.

**nn_F_models_to_train** The number of neural network to be trained for 
energy predictions, the *mlmd_* will select the model (GBR or nn) with 
the lower error in the validation set to be the part of the machine 
learning potential. In the sample ‘training.in’ ::
	nn_F_models_to_train 2 
which means that mlmd is going to train 2 nn models to predict force

**nn_F_learning_rate** Array of floats with the learning rates of the nn 
force models defined in *nn_F_models_to_train*.  In the sample ‘training.in’ ::
	nn_F_learning_rate 0.0018 0.00009  
which means that in the first nn force model the learning rate is going 
to be 0.0018, while in the second model is going to be 0.0009.

**nn_F_training_steps** Array of integers with the number of training 
steps for every one of the models declared in nn_F_models_to_train. 
In the sample ‘training.in’ ::
	nn_F_training_steps 1000 1000 
which means that the in both nn models the number of training steps is going to be 1000.

**nn_F_batch_size**  Array of integers with the number of size of the 
batches for the training in every one of the models declared in *nn_F_models_to_train*.
In the sample ‘training.in’ ::
	nn_F_batch_size 100 250
which means that in the first nn model de batch for training is going to 
have 100 examples, while for the second nn model the batch size is going to be of 250 examples.

**nn_F_architecture_i** Array of integers, the amount of numbers in the 
array determines the number of layers in the ‘i’ nn declared in *nn_F_models_to_train*, 
every number in the array represents the number of nodes in the 
represented leyer. For example in the ‘training.in’ sample there are 2 nn model declared, 
then there are also 2 architectures, the first one ::
	nn_F_architecture_1 152 80 20 1 
this represents an nn with 4 layers, the first layer has 152 nodes (the 
number of nodes of the first layer must be equal to the number of 
dimensions of the feature vectors), the second layer has 80 nodes, the 
third layer has 20 nodes, and the output layer has 1 node, (the output 
layer must always have 1 node). In the second example in the ‘training.in’ sample ::
	nn_F_architecture_2 152 80 40 20 1
the number of layers is 5  the first has 152 nodes, the second layer has 
80 nodes, the third layer has 40 nodes, the fourth layer has 20 nodes, 
and the fifth layer (output) has 1 node.