RandomForest Class#
This is the class used to construct a random forest. Random forests consist of multiple individual decision trees that are trained on subsets of the data and then combined via averaging. This can greatly improve the generalization performance by avoiding the tendency of decision trees to overfit to the training data. Since random forest learn individual trees many of the parameters and functionality in this class overlaps with the DecisionTree class.
The RandomForest can be imported as follows:
Attributes:
Name | Type | Description |
---|---|---|
max_features |
int | float | Literal["sqrt", "log2"] | None = None
|
The number of features to consider when looking for a split. |
max_depth |
int
|
The maximum depth of the tree. |
forest_type |
str
|
The type of random forest, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient"). |
n_estimators |
int
|
The number of trees in the random forest. |
n_jobs |
int
|
The number of processes used to fit, and predict for the forest, -1 uses all available proccesors. |
sampling |
str | None
|
Either resampling, honest_tree, honest_forest or None. |
sampling_args |
dict | None
|
A parameter used to control the behavior of the sampling scheme. The following arguments are available: 'size': Either int or float used by all sampling schemes (default 1.0). Specifies the number of samples drawn. If int it corresponds to the number of random resamples. If float it corresponds to the relative size with respect to the training sample size. 'replace': Bool used by all sampling schemes (default True). If True resamples are drawn with replacement otherwise without replacement. 'split': Either int or float used by the honest splitting schemes (default 0.5). Specifies how to divide the sample into fitting and prediction indices. If int it corresponds to the size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If float it corresponds to the relative size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If None all parameters are set to their defaults. |
impurity_tol |
float
|
The tolerance of impurity in a leaf node. |
min_samples_split |
int
|
The minimum number of samples in a split. |
min_samples_leaf |
int
|
The minimum number of samples in a leaf node. |
min_improvement |
float
|
The minimum improvement gained from performing a split. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
forest_type |
str
|
The type of random forest, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient"). |
required |
n_estimators |
int
|
The number of trees in the random forest. |
100
|
n_jobs |
int
|
The number of processes used to fit, and predict for the forest, -1 uses all available proccesors. |
-1
|
sampling |
str | None
|
Either resampling, honest_tree, honest_forest or None. |
'resampling'
|
sampling_args |
dict | None
|
A parameter used to control the behavior of the sampling scheme. The following arguments are available: 'size': Either int or float used by all sampling schemes (default 1.0). Specifies the number of samples drawn. If int it corresponds to the number of random resamples. If float it corresponds to the relative size with respect to the training sample size. 'replace': Bool used by all sampling schemes (default True). If True resamples are drawn with replacement otherwise without replacement. 'split': Either int or float used by the honest splitting schemes (default 0.5). Specifies how to divide the sample into fitting and prediction indices. If int it corresponds to the size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If float it corresponds to the relative size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If None all parameters are set to their defaults. |
None
|
max_features |
int | float | Literal['sqrt', 'log2'] | None
|
The number of features to consider when looking for a split. |
None
|
max_depth |
int
|
The maximum depth of the tree. |
maxsize
|
impurity_tol |
float
|
The tolerance of impurity in a leaf node. |
0
|
min_samples_split |
int
|
The minimum number of samples in a split. |
1
|
min_samples_leaf |
int
|
The minimum number of samples in a leaf node. |
1
|
min_improvement |
float
|
The minimum improvement gained from performing a split. |
0
|
seed |
int | None
|
Seed used to reproduce a RandomForest |
None
|
criteria |
Criteria
|
The Criteria class to use, if None it defaults to the forest_type default. |
None
|
leaf_builder |
LeafBuilder
|
The LeafBuilder class to use, if None it defaults to the forest_type default. |
None
|
predict |
type[Predict] | None
|
The Prediction class to use, if None it defaults to the forest_type default. |
None
|
splitter |
Splitter | None
|
The Splitter class to use, if None it defaults to the default Splitter class. |
None
|
fit #
Fit the random forest with training data (X, Y).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like object of dimension 2
|
The feature values used for training. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
Y |
array-like object
|
The response values used for training. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
sample_weight |
ndarray | None
|
Sample weights. Currently not implemented. |
None
|
predict #
Predicts response values at X using fitted random forest. The behavior of this function is determined by the Prediction class used in the decision tree. For currently existing tree types the corresponding behavior is as follows:
Classification:
Returns the class based on majority vote among the trees. In the case of tie, the lowest class with the maximum number of votes is returned.
Regression:
Returns the average response among all trees.
Quantile:
Returns the conditional quantile of the response, where the quantile is
specified by passing a list of quantiles via the quantile
parameter.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like object of dimension 2
|
New samples at which to predict the response. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
(N, K) numpy array with the prediction, where K depends on the Prediction class and is generally 1 |
predict_weights #
Predicts a weight matrix Z, where Z_{i,j} indicates if X_i and X0_j are in the same leaf node, where X0 denotes the training data. If scaling is True, then the value is divided by the number of other training data in the leaf node and averaged over all the estimators of the tree. If scaling is None, it is neither row-wise scaled and is instead summed up over all estimators of the forest.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ArrayLike | None
|
New samples to predict a weight. If None then X is treated as the training and or prediction data of size Nxd. |
None
|
scale |
bool
|
Whether to do row-wise scaling |
True
|
Returns:
Type | Description |
---|---|
ndarray
|
A numpy array of shape MxN, wehre N denotes the number of rows of the training and or prediction data. |
similarity #
Computes a similarity Z of size NxM, where each element Z_{i,j} is 1 if element X0_i and X1_j end up in the same leaf node. Z is the averaged over all the estimators of the forest.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X0 |
ArrayLike
|
Array corresponding to row elements of Z. |
required |
X1 |
ArrayLike
|
Array corresponding to column elements of Z. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
A NxM shaped np.ndarray. |