DecisionTree Class#

This is the class used to construct a decision tree. It uses the following four individual components to construct specific types of decision trees that can then be applied to data.

Instead of the user specifying all three components individually, it is also possible to only specify the tree_type, which then internally selects the corresponding default components for several established tree-algorithms, see user guide.

For more advanced modifications, it might be necessary to change how the splitting is performed. This can be done by passing a custom Splitter class.

The DecisionTree class and can be imported as follows:

from adaXT.decision_tree import DecisionTree

Attributes:

Name	Type	Description
`max_depth`	`int`	The maximum depth of the tree.
`tree_type`	`str`	The type of tree, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient") or None.
`leaf_nodes`	`list[LeafNode]`	A list of all leaf nodes in the tree.
`root`	`Node`	The root node of the tree.
`n_nodes`	`int`	The number of nodes in the tree.
`n_features`	`int`	The number of features in the training data.
`n_rows`	`int`	The number of rows (i.e., samples) in the training data.

Parameters:

Name	Type	Description	Default
`tree_type`	`str \| None`	The type of tree, either a string specifying a supported type (currently "Regression", "Classification", "Quantile" or "Gradient") or None.	`None`
`max_depth`	`int`	The maximum depth of the tree.	`maxsize`
`impurity_tol`	`float`	The tolerance of impurity in a leaf node.	`0`
`max_features`	`int \| float \| Literal['sqrt', 'log2'] \| None`	The number of features to consider when looking for a split.	`None`
`min_samples_split`	`int`	The minimum number of samples in a split.	`1`
`min_samples_leaf`	`int`	The minimum number of samples in a leaf node.	`1`
`min_improvement`	`float`	The minimum improvement gained from performing a split.	`0`
`criteria`	`Type[Criteria] \| None`	The Criteria class to use, if None it defaults to the tree_type default.	`None`
`leaf_builder`	`Type[LeafBuilder] \| None`	The LeafBuilder class to use, if None it defaults to the tree_type default.	`None`
`predictor`	`Type[Predictor] \| None`	The Predictor class to use, if None it defaults to the tree_type default.	`None`
`splitter`	`Type[Splitter] \| None`	The Splitter class to use, if None it defaults to the default Splitter class.	`None`
`skip_check_input`	`bool`	Skips any error checking on the features and response in the fitting function of a tree, should only be used if you know what you are doing, by default false.	`False`

fit #

fit(X, Y, sample_indices=None, sample_weight=None)

Fit the decision tree with training data (X, Y).

Parameters:

Name	Type	Description	Default
`X`	`array-like object of dimension 2`	The feature values used for training. Internally it will be converted to np.ndarray with dtype=np.float64.	required
`Y`	`array-like object of 1 or 2 dimensions`	The response values used for training. Internally it will be converted to np.ndarray with dtype=np.float64.	required
`sample_indices`	`array-like object of dimension 1 \| None`	A vector specifying samples of the training data that should be used during training. If None all samples are used.	`None`
`sample_weight`	`array-like object of dimension 1 \| None`	Sample weights. May not be implemented for every criteria.	`None`

predict #

predict(X, **kwargs)

Predict response values at X using fitted decision tree. The behavior of this function is determined by the Prediction class used in the decision tree. For currently existing tree types the corresponding behavior is as follows:

Classification:

Returns the class with the highest proportion within the final leaf node.

Given predict_proba=True, it instead calculates the probability distribution.

Regression:

Returns the mean value of the response within the final leaf node.

Quantile:

Returns the conditional quantile of the response, where the quantile is specified by passing a list of quantiles via the quantile parameter.

Gradient:

Returns a matrix with columns corresponding to different orders of derivatives that can be provided via the 'orders' parameter. Default behavior is to compute orders 0, 1 and 2.

Parameters:

Name	Type	Description	Default
`X`	`array-like object of dimension 2`	New samples at which to predict the response. Internally it will be converted to np.ndarray with dtype=np.float64.	required

Returns:

Type	Description
`ndarray`	(N, K) numpy array with the prediction, where K depends on the Prediction class and is generally 1

predict_leaf #

predict_leaf(X)

Computes a hash table indexing in which LeafNodes the rows of the provided X fall into.

Parameters:

Name	Type	Description	Default
`X`	`array-like object of dimension 2`	2-dimensional array for which the rows are the samples at which to predict.	required

Returns:

Type	Description
`dict`	A hash table with keys corresponding to LeafNode ids and values corresponding to lists of indices of the rows that land in a given LeafNode.

predict_weights #

predict_weights(X=None, scale=True)

Predicts a weight matrix W, where W[i,j] indicates if X[i, :] and Xtrain[j, :] are in the same leaf node, where Xtrain denotes the training data. If scale is True, then the value is divided by the number of other training samples in the same leaf node.

Parameters:

Name	Type	Description	Default
`X`	`ArrayLike \| None`	New samples to predict a weight (corresponding to columns in the output). If None then the training data is used as X.	`None`
`scale`	`bool`	Whether to do row-wise scaling.	`True`

Returns:

Type	Description
`ndarray`	A numpy array of shape MxN, where N denotes the number of rows of the original training data and M the number of rows of X.

refit_leaf_nodes #

refit_leaf_nodes(X, Y, sample_weight=None, sample_indices=None)

Refits the leaf nodes in a previously fitted decision tree.

More precisely, the method removes all leafnodes created on the initial fit and replaces them by predicting all samples in X that appear in sample_indices and placing them into new leaf nodes.

This method can be used to update the leaf nodes in a decision tree based on a new data while keeping the original splitting rules. If X does not contain the original training data the tree structure might change as leaf nodes without samples are collapsed. The method is also used to create honest splitting in RandomForests.

Parameters:

Name	Type	Description	Default
`X`	`array-like object of dimension 2`	The feature values used for training. Internally it will be converted to np.ndarray with dtype=np.float64.	required
`Y`	`array-like object of dimension 1 or 2`	The response values used for training. Internally it will be converted to np.ndarray with dtype=np.float64.	required
`sample_weight`	`array-like object of dimension 1 \| None`	Sample weights. May not be implemented for all criteria.	`None`
`sample_indices`	`ArrayLike \| None`	Indices of X which to create new leaf nodes with.	`None`

similarity #

similarity(X0, X1)

Computes a similarity matrix W of size NxM, where each element W[i, j] is 1 if and only if X0[i, :] and X1[j, :] end up in the same leaf node.

Parameters:

Name	Type	Description	Default
`X0`	`ArrayLike`	Array corresponding to rows of W in the output.	required
`X1`	`ArrayLike`	Array corresponding to columns of W in the output.	required

Returns:

Type	Description
`ndarray`	A NxM shaped np.ndarray.