Skip to content

RandomForest Class#

This is the class used to construct a random forest. Random forests consist of multiple individual decision trees that are trained on subsets of the data and then combined via averaging. This can greatly improve the generalization performance by avoiding the tendency of decision trees to overfit to the training data. Since random forest learn individual trees many of the parameters and functionality in this class overlaps with the DecisionTree class.

The RandomForest can be imported as follows:

from adaXT.random_forest import RandomForest

Attributes:

Name Type Description
max_features int | float | Literal["sqrt", "log2"] | None = None

The number of features to consider when looking for a split.

max_depth int

The maximum depth of the tree.

forest_type str

The type of random forest, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient").

n_estimators int

The number of trees in the random forest.

n_jobs int

The number of processes used to fit, and predict for the forest, -1 uses all available proccesors.

sampling str | None

Either resampling, honest_tree, honest_forest or None.

sampling_args dict | None

A parameter used to control the behavior of the sampling scheme. The following arguments are available: 'size': Either int or float used by all sampling schemes (default 1.0). Specifies the number of samples drawn. If int it corresponds to the number of random resamples. If float it corresponds to the relative size with respect to the training sample size. 'replace': Bool used by all sampling schemes (default True). If True resamples are drawn with replacement otherwise without replacement. 'split': Either int or float used by the honest splitting schemes (default 0.5). Specifies how to divide the sample into fitting and prediction indices. If int it corresponds to the size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If float it corresponds to the relative size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If None all parameters are set to their defaults.

impurity_tol float

The tolerance of impurity in a leaf node.

min_samples_split int

The minimum number of samples in a split.

min_samples_leaf int

The minimum number of samples in a leaf node.

min_improvement float

The minimum improvement gained from performing a split.

Parameters:

Name Type Description Default
forest_type str

The type of random forest, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient").

required
n_estimators int

The number of trees in the random forest.

100
n_jobs int

The number of processes used to fit, and predict for the forest, -1 uses all available proccesors.

-1
sampling str | None

Either resampling, honest_tree, honest_forest or None.

'resampling'
sampling_args dict | None

A parameter used to control the behavior of the sampling scheme. The following arguments are available: 'size': Either int or float used by all sampling schemes (default 1.0). Specifies the number of samples drawn. If int it corresponds to the number of random resamples. If float it corresponds to the relative size with respect to the training sample size. 'replace': Bool used by all sampling schemes (default True). If True resamples are drawn with replacement otherwise without replacement. 'split': Either int or float used by the honest splitting schemes (default 0.5). Specifies how to divide the sample into fitting and prediction indices. If int it corresponds to the size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If float it corresponds to the relative size of the fitting indices, while the remaining indices are used as prediction indices (truncated if value is too large). If None all parameters are set to their defaults.

None
max_features int | float | Literal['sqrt', 'log2'] | None

The number of features to consider when looking for a split.

None
max_depth int

The maximum depth of the tree.

maxsize
impurity_tol float

The tolerance of impurity in a leaf node.

0
min_samples_split int

The minimum number of samples in a split.

1
min_samples_leaf int

The minimum number of samples in a leaf node.

1
min_improvement float

The minimum improvement gained from performing a split.

0
seed int | None

Seed used to reproduce a RandomForest

None
criteria Criteria

The Criteria class to use, if None it defaults to the forest_type default.

None
leaf_builder LeafBuilder

The LeafBuilder class to use, if None it defaults to the forest_type default.

None
predict type[Predict] | None

The Prediction class to use, if None it defaults to the forest_type default.

None
splitter Splitter | None

The Splitter class to use, if None it defaults to the default Splitter class.

None

fit #

fit(X, Y, sample_weight=None)

Fit the random forest with training data (X, Y).

Parameters:

Name Type Description Default
X array-like object of dimension 2

The feature values used for training. Internally it will be converted to np.ndarray with dtype=np.float64.

required
Y array-like object

The response values used for training. Internally it will be converted to np.ndarray with dtype=np.float64.

required
sample_weight ndarray | None

Sample weights. Currently not implemented.

None

predict #

predict(X, **kwargs)

Predicts response values at X using fitted random forest. The behavior of this function is determined by the Prediction class used in the decision tree. For currently existing tree types the corresponding behavior is as follows:

Classification:

Returns the class based on majority vote among the trees. In the case of tie, the lowest class with the maximum number of votes is returned.

Regression:

Returns the average response among all trees.

Quantile:

Returns the conditional quantile of the response, where the quantile is specified by passing a list of quantiles via the quantile parameter.

Parameters:

Name Type Description Default
X array-like object of dimension 2

New samples at which to predict the response. Internally it will be converted to np.ndarray with dtype=np.float64.

required

Returns:

Type Description
ndarray

(N, K) numpy array with the prediction, where K depends on the Prediction class and is generally 1

predict_weights #

predict_weights(X=None, scale=True)

Predicts a weight matrix Z, where Z_{i,j} indicates if X_i and X0_j are in the same leaf node, where X0 denotes the training data. If scaling is True, then the value is divided by the number of other training data in the leaf node and averaged over all the estimators of the tree. If scaling is None, it is neither row-wise scaled and is instead summed up over all estimators of the forest.

Parameters:

Name Type Description Default
X ArrayLike | None

New samples to predict a weight. If None then X is treated as the training and or prediction data of size Nxd.

None
scale bool

Whether to do row-wise scaling

True

Returns:

Type Description
ndarray

A numpy array of shape MxN, wehre N denotes the number of rows of the training and or prediction data.

similarity #

similarity(X0, X1)

Computes a similarity Z of size NxM, where each element Z_{i,j} is 1 if element X0_i and X1_j end up in the same leaf node. Z is the averaged over all the estimators of the forest.

Parameters:

Name Type Description Default
X0 ArrayLike

Array corresponding to row elements of Z.

required
X1 ArrayLike

Array corresponding to column elements of Z.

required

Returns:

Type Description
ndarray

A NxM shaped np.ndarray.