DecisionTree Class#
This is the class used to construct a decision tree. It uses the following four individual components to construct specific types of decision trees that can then be applied to data.
Instead of the user specifying all three components individually, it is also
possible to only specify the tree_type
, which then internally selects the
corresponding default components for several established tree-algorithms, see
user guide.
For more advanced modifications, it might be necessary to change how the splitting is performed. This can be done by passing a custom Splitter class.
The DecisionTree class and can be imported as follows:
Attributes:
Name | Type | Description |
---|---|---|
max_depth |
int
|
The maximum depth of the tree. |
tree_type |
str
|
The type of tree, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient") or None. |
leaf_nodes |
list[LeafNode]
|
A list of all leaf nodes in the tree. |
root |
Node
|
The root node of the tree. |
n_nodes |
int
|
The number of nodes in the tree. |
n_features |
int
|
The number of features in the training data. |
n_rows |
int
|
The number of rows (i.e., samples) in the training data. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tree_type |
str | None
|
The type of tree, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient") or None. |
None
|
max_depth |
int
|
The maximum depth of the tree. |
maxsize
|
impurity_tol |
float
|
The tolerance of impurity in a leaf node. |
0
|
max_features |
int | float | Literal['sqrt', 'log2'] | None
|
The number of features to consider when looking for a split. |
None
|
min_samples_split |
int
|
The minimum number of samples in a split. |
1
|
min_samples_leaf |
int
|
The minimum number of samples in a leaf node. |
1
|
min_improvement |
float
|
The minimum improvement gained from performing a split. |
0
|
criteria |
Type[Criteria] | None
|
The Criteria class to use, if None it defaults to the tree_type default. |
None
|
leaf_builder |
Type[LeafBuilder] | None
|
The LeafBuilder class to use, if None it defaults to the tree_type default. |
None
|
predict |
Type[Predict] | None
|
The Predict class to use, if None it defaults to the tree_type default. |
None
|
splitter |
Type[Splitter] | None
|
The Splitter class to use, if None it defaults to the default Splitter class. |
None
|
skip_check_input |
bool
|
Skips any error checking on the features and response in the fitting function of a tree, should only be used if you know what you are doing, by default false. |
False
|
fit #
Fit the decision tree with training data (X, Y).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like object of dimension 2
|
The feature values used for training. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
Y |
array-like object of 1 or 2 dimensions
|
The response values used for training. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
sample_indices |
array-like object of dimension 1 | None
|
A vector specifying samples of the training data that should be used during training. If None all samples are used. |
None
|
sample_weight |
array-like object of dimension 1 | None
|
Sample weights. May not be implemented for every criteria. |
None
|
predict #
Predict response values at X using fitted decision tree. The behavior of this function is determined by the Prediction class used in the decision tree. For currently existing tree types the corresponding behavior is as follows:
Classification:
Returns the class with the highest proportion within the final leaf node.
Given predict_proba=True, it instead calculates the probability distribution.
Regression:
Returns the mean value of the response within the final leaf node.
Quantile:
Returns the conditional quantile of the response, where the quantile is
specified by passing a list of quantiles via the quantile
parameter.
Gradient:
Returns a matrix with columns corresponding to different orders of derivatives that can be provided via the 'orders' parameter. Default behavior is to compute orders 0, 1 and 2.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like object of dimension 2
|
New samples at which to predict the response. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
(N, K) numpy array with the prediction, where K depends on the Prediction class and is generally 1 |
predict_leaf #
Computes a hash table indexing in which LeafNodes the rows of the provided X fall into.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like object of dimension 2
|
2-dimensional array for which the rows are the samples at which to predict. |
required |
Returns:
Type | Description |
---|---|
dict
|
A hash table with keys corresponding to LeafNode ids and values corresponding to lists of indices of the rows that land in a given LeafNode. |
predict_weights #
Predicts a weight matrix W, where W[i,j] indicates if X[i, :] and Xtrain[j, :] are in the same leaf node, where Xtrain denotes the training data. If scale is True, then the value is divided by the number of other training samples in the same leaf node.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ArrayLike | None
|
New samples to predict a weight (corresponding to columns in the output). If None then the training data is used as X. |
None
|
scale |
bool
|
Whether to do row-wise scaling. |
True
|
Returns:
Type | Description |
---|---|
ndarray
|
A numpy array of shape MxN, where N denotes the number of rows of the original training data and M the number of rows of X. |
refit_leaf_nodes #
Refits the leaf nodes in a previously fitted decision tree.
More precisely, the method removes all leafnodes created on the initial fit and replaces them by predicting all samples in X that appear in sample_indices and placing them into new leaf nodes.
This method can be used to update the leaf nodes in a decision tree based on a new data while keeping the original splitting rules. If X does not contain the original training data the tree structure might change as leaf nodes without samples are collapsed. The method is also used to create honest splitting in RandomForests.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
array-like object of dimension 2
|
The feature values used for training. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
Y |
array-like object of dimension 1 or 2
|
The response values used for training. Internally it will be converted to np.ndarray with dtype=np.float64. |
required |
sample_weight |
array-like object of dimension 1 | None
|
Sample weights. May not be implemented for all criteria. |
None
|
sample_indices |
ArrayLike | None
|
Indices of X which to create new leaf nodes with. |
None
|
similarity #
Computes a similarity matrix W of size NxM, where each element W[i, j] is 1 if and only if X0[i, :] and X1[j, :] end up in the same leaf node.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X0 |
ArrayLike
|
Array corresponding to rows of W in the output. |
required |
X1 |
ArrayLike
|
Array corresponding to columns of W in the output. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
A NxM shaped np.ndarray. |