Using scikit-learn functionality#
To simplify integration of adaXT into existing ML workflows based on scikit-learn, adaXT's DecisionTree and RandomForest classes are both designed to be compatible with with some of scikit-learn tools.
For example, functions such as GridSearchCV and Pipeline can be used with adaXT.
Using GridSearchCV#
Here we introduce the difference when using scikit-learn's own DecisionTreeClassifier and adaXT's DecisionTree with the GridSearchCV. First, there is the initial setup:
from adaXT.decision_tree import DecisionTree
from adaXT.criteria import Gini_index, Entropy
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
import numpy as np
import time
n = 20000
m = 5
X = np.random.uniform(0, 100, (n, m))
Y = np.random.randint(1, 3, n)
param_grid = {
"max_depth": [3, 5, 10, 20, 100],
"min_samples_split": [2, 5, 10],
}
param_grid_ada = param_grid | {"criteria": [Gini_index, Entropy]}
param_grid_sk = param_grid | {"criterion": ["gini", "entropy"]}
grid_search_ada = GridSearchCV(
estimator=DecisionTree(tree_type="Classification"),
param_grid=param_grid_ada,
cv=5,
scoring="accuracy",
)
grid_search_sk = GridSearchCV(
estimator=DecisionTreeClassifier(),
param_grid=param_grid_sk,
cv=5,
scoring="accuracy",
)
grid_search_ada.fit(X, Y)
grid_search_sk.fit(X, Y)
print("Best Hyperparameters ada: ", grid_search_ada.best_params_)
print("Best Hyperparameters sklearn: ", grid_search_sk.best_params_)
print("Best accuracy ada: ", grid_search_ada.best_score_)
print("Best accuracy sklearn: ", grid_search_sk.best_score_)
Using Pipeline#
AdaXT makes it easy to use any preprocessing tools from sklearn because adaXT is compatible with sklearn's Pipeline. An example that combines a scaling step with a decision tree is provided below. Note that while combining a scaling step with a decision tree is generally not needed as decision trees are scale invariant, it can become useful if one additionally adds a dimensonality reduction step after the scaling, for example.
from adaXT.decision_tree import DecisionTree
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
pipe = Pipeline(
[("scaler", StandardScaler()), ("tree", DecisionTree("Classification"))]
)
print(pipe.fit(X_train, y_train).score(X_test, y_test))
print(pipe.set_params(tree__max_depth=5).fit(X_train, y_train).score(X_test, y_test))
Again, there are only minor changes between how the DecisionTree and the DecisionTreeClassifier would be used. The only difference is, that we have to specify, that the DecisionTree is for classification. Instead, one could also pass in a custom criteria, leaf_builder, and predictor and the DecisionTree can still be used as part of a Pipeline.