Creating a custom criteria#
In this section we explain how to create a custom criteria function by walking through the required steps. The Criteria class is implemented as a Cython extension types -- also known as a cdef class. While this ensures that the criteria evaluations are fast, it also means that a custom criteria needs to be written as an extension type. From a practical perspective this means that, we need to implement the custom criteria in a .pyx file, that then needs to be compiled before use.
We start by explaining how to create the .pyx file.
Creating a .pyx file and implementing criteria#
In your working directory create a .pyx file (e.g., my_custom_criteria.pyx
) in
which you first import the Criteria "super" class and then define the new custom
criteria class which inherits from the imported Criteria class. An example of
the skeleton of such a file is given as below.
from adaXT.criteria cimport Criteria
cdef class My_custom_criteria(Criteria):
# Your implementation here
The class type of Criteria is a cdef class, which works in much of the same fashion as a regular Python class, but with faster performance. To read more on cdef classes checkout Cython extension types.
Next, for the new criteria to work with adaXT you have to implement the impurity method inherited from the Criteria class. The impurity method has to follow the type definition specified within criteria.pxd, which is as follows:
The variable indices
refers to the sample indices for which the impurity value
should be computed. To access the feature and response you can make use of
self.x
and self.y
, respectively. More specifically, self.x[indices]
and
self.y[indices]
are the feature and response samples for which the impurity
needs to be computed. With this in place you should be able to implement almost
any criteria function you can imagine. Keep in mind that the impurity
method
is extremely often (approximately \(n\log(n)\) times). Therefore you should invest
a bit of time in optimizing the function in order to avoid long fitting times.
Further computational speed-ups can be achieved by implementing
proxy_improvement
and update_proxy
methods in the criteria class. If these
are not explicitly defined the code defaults to using the impurity
method.
Once you have finished defining your critera class and saved the .pyx file, you can compile the Cython code and use it as part of adaXT.
Compiling and using the custom criteria#
We discuss two specific approaches for using your custom criteria. More details on how to compile and use Cython code can be found here.
- Building a Cython module: This allows you to only compile the new criteria class once and then import it as a regular Python module.
- Use pyximport to import the .pyx file: This avoids needing to build the Cython code manually and instead compiles the code each time you run your Python code.
Option 1: Building a Cython module#
The first option of using your custom criteria is to create a setup.py
file in
which you build a Cython module that you can then import in your Python code.
For this approach create a new subfolder (e.g., custom_criteria/
) in your
working directory in which you copy your .pyx file (e.g.,
my_custom_criteria.pyx
) together with an empty file called __init__.py
. Then
in your working directory create a file called setup.py
containing the
following code:
from setuptools import setup
from Cython.Build import cythonize
setup(
name='Custom Criteria',
ext_modules=cythonize("my_custom_criteria.pyx"),
)
The Cython code can now be compiled with the command
Note that this command needs to be run in an environment in which adaXT and setuptools is installed. After the code is built the custom criteria can be imported as a regular Python module as follows:
Option 2: Using pyximport to import the .pyx file#
Alternatively, you can let Cython compile the code each time you run your Python
script. For this approach create your .py file in which you fit a decision
tree using the new custom criteria within the same folder where you created the
.pyx. For this to work there are a few small things left to do. In order to
not manually recompile the Cython file every time you make changes, you can
automate the compilation within the Python file by telling Cython to comile the
Cython source file, whenever you run the file. Assuming your criteria is in the
.pyx file my_custom_criteria.pyx
you would import it as follows:
Now you can access the new criteria within the Python file like you would do
with any other criteria function. For example, you could fit a decision tree
with your custom criteria class My_custom_criteria
as follows:
from adaXT.decision_tree import DecisionTree
import numpy as np
n = 100
m = 4
X = np.random.uniform(0, 100, (n, m))
Y = np.random.uniform(0, 10, n)
tree = DecisionTree("Regression", criteria=my_custom_critera.My_custom_criteria, max_depth=3)
tree.fit(X, Y)
We now go over a detailed example in which we construct the Partial_linear
criteria.
A detailed example: Partial_linear
#
The general idea of the Partial_linear
criteria is to fit a linear function on
the first feature with the \(Y\) value as the response, that is,
where \(I\) denotes the indices of the samples within a given node, \(X\) is the feature matrix, \(Y\) is the response vector, \(\mu_X\) is mean vector of the columns of \(X\) and \(\mu_Y\) is the mean of \(Y\).
When creating a new criteria class we import and define the class as described
above. We open a new file in our working directory and call it testCrit.pyx
and start with the following lines:
Calculating the mean#
Now although we have provided a mean function within the
adaXT.decision_tree.crit_helpers
, in this specific example we need to
calculate multiple means, and there is no reason to do two passes over the same
indices \(I\), so we create a custom mean method in Cython.
# Custom mean function, such that we don't have to loop through twice.
cdef (double, double) custom_mean(self, int[:] indices):
cdef:
double sumX, sumY
int i
int length = indices.shape[0]
sumX = 0.0
sumY = 0.0
for i in range(length):
sumX += self.x[indices[i], 0]
sumY += self.y[indices[i]]
return ((sumX / (<double> length)), (sumY / (<double> length)))
You might notice that the syntax is a little different than for default Python. In general you can write default Python within the cdef class, however the runtime can be significantly decreased by adding type definitions and other Cython magic. If you wish to learn more about the Cython language be sure to check out the Cython documentation.
Furthermore, note that you can freely create any new method within the custom
criteria class even if it is not defined in the standard parent criteria class.
Therefore you are not limited by the parent class and can freely extend it as
long as you do not overwrite the evaluate_split
(unless that is your
intention).
Calculating theta#
Now that we have a method for getting the mean, we can create a method for calculating the theta values, such that our code is more manageable.
Again the majority of the Cython is not mandatory but speeds up the code. On line 26 we access our previously defined custom mean function, which returns the mean of the \(X\) indices and the mean of the \(Y\) indices as described above. Then on line 27 we loop over all the indices a second time and calculate \(\sum_{i \in I} (X[i, 0] - \mu_X) (Y[i] - \mu_Y)\) and \(\sum_{i \in I} (X[i, 0] - \mu_X)^2\) which are the numerator and denominator, respectively. These are the two values used to calculate \(\theta_1\). Further on line 31 we check to make sure that the denominator is not 0.0, this ensures that we can consider the underidentified case separately and do not divide by zero by accident. If the denominator is zero, we simply set \(\theta_1\) to 0.0 as this will give an L value of 0 in the end. We finish off by returning the two values \(\theta_0,\theta_1\).
The impurity function#
Finally, we reach the most important step, which is to create the impurity function.
Making sure the impurity method has the required signature, we simply calculate
L as described and return its value. One important point to note is that
cur_sum
is defined on line 3 to have type double, because the impurity
function is defined to have a double as return value. When creating the impurity
method you must ensure this return type.
Finishing up#
And that is it. You have now created your first custom criteria class, and can freely use it within your own Python code. If you use the method that compiles the Cython file every time the Python file is run leads to code that looks like this:
import numpy as np
import matplotlib.pyplot as plt
from adaXT.decision_tree import DecisionTree
from adaXT.decision_tree.tree_utils import plot_tree
import pyximport
pyximport.install()
import testCrit
# Generate training data
n = 100
m = 4
X = np.random.uniform(0, 100, (n, m))
Y = np.random.uniform(0, 10, n)
# Initialize and fit tree
tree = DecisionTree("Regression", testCrit.Partial_linear, max_depth=3)
tree.fit(X, Y)
# Plot the tree
plot_tree(tree)
plt.show()
This creates a regression tree with the newly created custom Partial_linear
criteria class, specifies the max_depth
to be 3 and then plots the tree using
both the
plot_tree based
on matplotlib. The full source code used within this
article can be found
here.