public class BFTree extends RandomizableClassifier implements AdditionalMeasureProducer, TechnicalInformationHandler
@mastersthesis{Shi2007,
address = {Hamilton, NZ},
author = {Haijian Shi},
note = {COMP594},
school = {University of Waikato},
title = {Best-first decision tree learning},
year = {2007}
}
@article{Friedman2000,
author = {Jerome Friedman and Trevor Hastie and Robert Tibshirani},
journal = {Annals of statistics},
number = {2},
pages = {337-407},
title = {Additive logistic regression : A statistical view of boosting},
volume = {28},
year = {2000},
ISSN = {0090-5364}
}
Valid options are:
-S <num> Random number seed. (default 1)
-D If set, classifier is run in debug mode and may output additional info to the console
-P <UNPRUNED|POSTPRUNED|PREPRUNED> The pruning strategy. (default: POSTPRUNED)
-M <min no> The minimal number of instances at the terminal nodes. (default 2)
-N <num folds> The number of folds used in the pruning. (default 5)
-H Don't use heuristic search for nominal attributes in multi-class problem (default yes).
-G Don't use Gini index for splitting (default yes), if not information is used.
-R Don't use error rate in internal cross-validation (default yes), but root mean squared error.
-A Use the 1 SE rule to make pruning decision. (default no).
-C Percentage of training data size (0-1] (default 1).
| Modifier and Type | Field and Description |
|---|---|
protected Attribute |
m_Attribute
Attribute used for splitting.
|
protected Attribute |
m_ClassAttribute
Class attribute of a dataset.
|
protected double[] |
m_ClassProbs
Class probabilities.
|
protected double |
m_ClassValue
Class value for a node.
|
protected double[] |
m_Distribution
Class distributions.
|
protected double[][][] |
m_Dists
Distributions of each attribute for two successor nodes.
|
protected static int |
m_Expansion
Number of expansions.
|
protected int |
m_FixedExpansion
Fixed number of expansions (if no pruning method is used, its value is -1.
|
protected boolean |
m_Heuristic
If use huristic search for binary split (default true).
|
protected boolean |
m_isLeaf
If the ndoe is leaf node.
|
protected int |
m_minNumObj
Minimum number of instances at leaf nodes.
|
protected int |
m_numFoldsPruning
Number of folds for the pruning.
|
protected double[] |
m_Props
Branch proportions.
|
protected int |
m_PruningStrategy
the pruning strategy
|
protected double |
m_SizePer
The training data size (0-1).
|
protected int[][] |
m_SortedIndices
Sorted indices.
|
protected String |
m_SplitString
Split subset (for nominal attributes).
|
protected double |
m_SplitValue
Split point (for numeric attributes).
|
protected BFTree[] |
m_Successors
Successor nodes.
|
protected double |
m_TotalWeight
Total weights.
|
protected boolean |
m_UseErrorRate
If use error rate in internal cross-validation to fix the number of expansions - default
(if not, root mean squared error is used).
|
protected boolean |
m_UseGini
If use Gini index as the splitting criterion - default (if not, information is used).
|
protected boolean |
m_UseOneSE
If use the 1SE rule to make the decision.
|
protected double[][] |
m_Weights
Sorted weights.
|
static int |
PRUNING_POSTPRUNING
pruning strategy: post-pruning
|
static int |
PRUNING_PREPRUNING
pruning strategy: pre-pruning
|
static int |
PRUNING_UNPRUNED
pruning strategy: un-pruned
|
static Tag[] |
TAGS_PRUNING
pruning strategy
|
m_Seedm_Debug| Constructor and Description |
|---|
BFTree() |
| Modifier and Type | Method and Description |
|---|---|
void |
buildClassifier(Instances data)
Method for building a BestFirst decision tree classifier.
|
protected double |
computeEntropy(double[] dist,
double total)
Compute and return entropy for a given distribution of a node.
|
protected double |
computeGini(double[] dist,
double total)
Compute and return gini index for a given distribution of a node.
|
protected double |
computeGiniGain(double[] parentDist,
double[][] childDist)
Compute and return gini gain for given distributions of a node and its
successor nodes.
|
protected double |
computeInfoGain(double[] parentDist,
double[][] childDist)
Compute and return information gain for given distributions of a node
and its successor nodes.
|
protected double |
computeSortedInfo(Instances data,
int[][] sortedIndices,
double[][] weights,
double[] classProbs)
Compute sorted indices, weights and class probabilities for a given
dataset.
|
protected FastVector |
computeSplitInfo(BFTree node,
Instances data,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[][] props,
double[][] totalSubsetWeights,
boolean useHeuristic,
boolean useGini)
Compute the best splitting attribute, split point or subset and the best
gini gain or iformation gain for a given dataset.
|
double[] |
distributionForInstance(Instance instance)
Computes class probabilities for instance using the decision tree.
|
Enumeration |
enumerateMeasures()
Return an enumeration of the measure names.
|
Capabilities |
getCapabilities()
Returns default capabilities of the classifier.
|
boolean |
getHeuristic()
Get if use heuristic search for nominal attributes in multi-class problems.
|
double |
getMeasure(String additionalMeasureName)
Returns the value of the named measure
|
int |
getMinNumObj()
Get minimal number of instances at the terminal nodes.
|
int |
getNumFoldsPruning()
Set number of folds in internal cross-validation.
|
String[] |
getOptions()
Gets the current settings of the Classifier.
|
SelectedTag |
getPruningStrategy()
Gets the pruning strategy.
|
String |
getRevision()
Returns the revision string.
|
double |
getSizePer()
Get training set size.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
boolean |
getUseErrorRate()
Get if use error rate in internal cross-validation.
|
boolean |
getUseGini()
Get if use Gini index as splitting criterion.
|
boolean |
getUseOneSE()
Get if use the 1SE rule to choose final model.
|
String |
globalInfo()
Returns a string describing classifier
|
String |
heuristicTipText()
Returns the tip text for this property
|
Enumeration |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(String[] args)
Main method.
|
protected void |
makeLeaf(Instances data)
Make the node leaf node.
|
protected void |
makeSuccessors(FastVector BestFirstElements,
Instances data,
int[][][] subsetSortedIndices,
double[][][] subsetWeights,
double[][][] dists,
Attribute att,
boolean useHeuristic,
boolean useGini)
Generate successor nodes for a node and put them into BestFirstElements
according to gini gain or information gain in a descending order.
|
protected void |
makeTree(FastVector BestFirstElements,
BFTree root,
Instances train,
Instances test,
FastVector modelError,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[] classProbs,
double totalWeight,
double[] branchProps,
int minNumObj,
boolean useHeuristic,
boolean useGini,
boolean useErrorRate)
This method is to find the number of expansions based on internal
cross-validation for just post-pruning.
|
protected boolean |
makeTree(FastVector BestFirstElements,
BFTree root,
Instances train,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[] classProbs,
double totalWeight,
double[] branchProps,
int minNumObj,
boolean useHeuristic,
boolean useGini)
This method is to find the number of expansions based on internal
cross-validation for just pre-pruning.
|
protected void |
makeTree(FastVector BestFirstElements,
Instances data,
int[][] sortedIndices,
double[][] weights,
double[][][] dists,
double[] classProbs,
double totalWeight,
double[] branchProps,
int minNumObj,
boolean useHeuristic,
boolean useGini,
int preExpansion)
Recursively build a best-first decision tree.
|
double |
measureTreeSize()
Return number of tree size.
|
String |
minNumObjTipText()
Returns the tip text for this property
|
protected String |
nominalDistribution(double[][] props,
double[][][] dists,
Attribute att,
int[] sortedIndices,
double[] weights,
double[][] subsetWeights,
double[] gains,
Instances data,
boolean useHeuristic,
boolean useGini)
Compute distributions, proportions and total weights of two successor
nodes for a given nominal attribute.
|
protected double |
numericDistribution(double[][] props,
double[][][] dists,
Attribute att,
int[] sortedIndices,
double[] weights,
double[][] subsetWeights,
double[] gains,
Instances data,
boolean useGini)
Compute distributions, proportions and total weights of two successor nodes for
a given numeric attribute.
|
String |
numFoldsPruningTipText()
Returns the tip text for this property
|
int |
numLeaves()
Compute number of leaf nodes.
|
int |
numNodes()
Compute size of the tree.
|
String |
pruningStrategyTipText()
Returns the tip text for this property
|
void |
setHeuristic(boolean value)
Set if use heuristic search for nominal attributes in multi-class problems.
|
void |
setMinNumObj(int value)
Set minimal number of instances at the terminal nodes.
|
void |
setNumFoldsPruning(int value)
Set number of folds in internal cross-validation.
|
void |
setOptions(String[] options)
Parses the options for this object.
|
void |
setPruningStrategy(SelectedTag value)
Sets the pruning strategy.
|
void |
setSizePer(double value)
Set training set size.
|
void |
setUseErrorRate(boolean value)
Set if use error rate in internal cross-validation.
|
void |
setUseGini(boolean value)
Set if use Gini index as splitting criterion.
|
void |
setUseOneSE(boolean value)
Set if use the 1SE rule to choose final model.
|
String |
sizePerTipText()
Returns the tip text for this property
|
protected void |
splitData(int[][][] subsetIndices,
double[][][] subsetWeights,
Attribute att,
double splitPoint,
String splitStr,
int[][] sortedIndices,
double[][] weights,
Instances data)
Split data into two subsets and store sorted indices and weights for two
successor nodes.
|
String |
toString()
Prints the decision tree using the protected toString method from below.
|
protected String |
toString(int level)
Outputs a tree at a certain level.
|
String |
useErrorRateTipText()
Returns the tip text for this property
|
String |
useGiniTipText()
Returns the tip text for this property
|
String |
useOneSETipText()
Returns the tip text for this property
|
getSeed, seedTipText, setSeedclassifyInstance, debugTipText, forName, getDebug, makeCopies, makeCopy, runClassifier, setDebugpublic static final int PRUNING_UNPRUNED
public static final int PRUNING_POSTPRUNING
public static final int PRUNING_PREPRUNING
public static final Tag[] TAGS_PRUNING
protected int m_PruningStrategy
protected BFTree[] m_Successors
protected Attribute m_Attribute
protected double m_SplitValue
protected String m_SplitString
protected double m_ClassValue
protected Attribute m_ClassAttribute
protected int m_minNumObj
protected int m_numFoldsPruning
protected boolean m_isLeaf
protected static int m_Expansion
protected int m_FixedExpansion
protected boolean m_Heuristic
protected boolean m_UseGini
protected boolean m_UseErrorRate
protected boolean m_UseOneSE
protected double[] m_Distribution
protected double[] m_Props
protected int[][] m_SortedIndices
protected double[][] m_Weights
protected double[][][] m_Dists
protected double[] m_ClassProbs
protected double m_TotalWeight
protected double m_SizePer
public String globalInfo()
public TechnicalInformation getTechnicalInformation()
getTechnicalInformation in interface TechnicalInformationHandlerpublic Capabilities getCapabilities()
getCapabilities in interface CapabilitiesHandlergetCapabilities in class ClassifierCapabilitiespublic void buildClassifier(Instances data) throws Exception
buildClassifier in class Classifierdata - set of instances serving as training dataException - if decision tree cannot be built successfullyprotected void makeTree(FastVector BestFirstElements, Instances data, int[][] sortedIndices, double[][] weights, double[][][] dists, double[] classProbs, double totalWeight, double[] branchProps, int minNumObj, boolean useHeuristic, boolean useGini, int preExpansion) throws Exception
BestFirstElements - list to store BFTree nodesdata - training datasortedIndices - sorted indices of the instancesweights - weights of the instancesdists - class distributions for each attributeclassProbs - class probabilities of this nodetotalWeight - total weight of this node (note if the node
can not split, this value is not calculated.)branchProps - proportions of two subbranchesminNumObj - minimal number of instances at leaf nodesuseHeuristic - if use heuristic search for nominal attributes
in multi-class problemuseGini - if use Gini index as splitting criterionpreExpansion - the number of expansions the tree to be expandedException - if something goes wrongprotected boolean makeTree(FastVector BestFirstElements, BFTree root, Instances train, int[][] sortedIndices, double[][] weights, double[][][] dists, double[] classProbs, double totalWeight, double[] branchProps, int minNumObj, boolean useHeuristic, boolean useGini) throws Exception
BestFirstElements - list to store BFTree nodesroot - root node of tree in each foldtrain - training datasortedIndices - sorted indices of the instancesweights - weights of the instancesdists - class distributions for each attributeclassProbs - class probabilities of this nodetotalWeight - total weight of this node (note if the node
can not split, this value is not calculated.)branchProps - proportions of two subbranchesminNumObj - minimal number of instances at leaf nodesuseHeuristic - if use heuristic search for nominal attributes
in multi-class problemuseGini - if use Gini index as splitting criterionException - if something goes wrongprotected void makeTree(FastVector BestFirstElements, BFTree root, Instances train, Instances test, FastVector modelError, int[][] sortedIndices, double[][] weights, double[][][] dists, double[] classProbs, double totalWeight, double[] branchProps, int minNumObj, boolean useHeuristic, boolean useGini, boolean useErrorRate) throws Exception
BestFirstElements - list to store BFTree nodesroot - root node of tree in each foldtrain - training data in each foldtest - test data in each foldmodelError - list to store error for each expansion in
each foldsortedIndices - sorted indices of the instancesweights - weights of the instancesdists - class distributions for each attributeclassProbs - class probabilities of this nodetotalWeight - total weight of this node (note if the node
can not split, this value is not calculated.)branchProps - proportions of two subbranchesminNumObj - minimal number of instances at leaf nodesuseHeuristic - if use heuristic search for nominal attributes
in multi-class problemuseGini - if use Gini index as splitting criterionuseErrorRate - if use error rate in internal cross-validationException - if something goes wrongprotected void makeSuccessors(FastVector BestFirstElements, Instances data, int[][][] subsetSortedIndices, double[][][] subsetWeights, double[][][] dists, Attribute att, boolean useHeuristic, boolean useGini) throws Exception
BestFirstElements - list to store BestFirst nodesdata - training instancesubsetSortedIndices - sorted indices of instances of successor nodessubsetWeights - weights of instances of successor nodesdists - class distributions of successor nodesatt - attribute used to split the nodeuseHeuristic - if use heuristic search for nominal attributes in multi-class problemuseGini - if use Gini index as splitting criterionException - if something goes wrongprotected double computeSortedInfo(Instances data, int[][] sortedIndices, double[][] weights, double[] classProbs) throws Exception
data - training datasortedIndices - sorted indices of instances at the nodeweights - weights of instances at the nodeclassProbs - class probabilities at the nodeException - if something goes wrongprotected FastVector computeSplitInfo(BFTree node, Instances data, int[][] sortedIndices, double[][] weights, double[][][] dists, double[][] props, double[][] totalSubsetWeights, boolean useHeuristic, boolean useGini) throws Exception
node - node to be splitdata - training datasortedIndices - sorted indices of the instancesweights - weights of the instancesdists - class distributions for each attributeprops - proportions of two branchestotalSubsetWeights - total weight of two subsetsuseHeuristic - if use heuristic search for nominal attributes
in multi-class problemuseGini - if use Gini index as splitting criterionException - if something is wrongprotected double numericDistribution(double[][] props,
double[][][] dists,
Attribute att,
int[] sortedIndices,
double[] weights,
double[][] subsetWeights,
double[] gains,
Instances data,
boolean useGini)
throws Exception
props - proportions of each two branches for each attributedists - class distributions of two branches for each attributeatt - numeric att split onsortedIndices - sorted indices of instances for the attirubteweights - weights of instances for the attirbutesubsetWeights - total weight of two branches split based on the attributegains - Gini gains or information gains for each attributedata - training instancesuseGini - if use Gini index as splitting criterionException - if something goes wrongprotected String nominalDistribution(double[][] props, double[][][] dists, Attribute att, int[] sortedIndices, double[] weights, double[][] subsetWeights, double[] gains, Instances data, boolean useHeuristic, boolean useGini) throws Exception
props - proportions of each two branches for each attributedists - class distributions of two branches for each attributeatt - numeric att split onsortedIndices - sorted indices of instances for the attirubteweights - weights of instances for the attirbutesubsetWeights - total weight of two branches split based on the attributegains - Gini gains for each attributedata - training instancesuseHeuristic - if use heuristic searchuseGini - if use Gini index as splitting criterionException - if something goes wrongprotected void splitData(int[][][] subsetIndices,
double[][][] subsetWeights,
Attribute att,
double splitPoint,
String splitStr,
int[][] sortedIndices,
double[][] weights,
Instances data)
throws Exception
subsetIndices - sorted indecis of instances for each attribute for two successor nodesubsetWeights - weights of instances for each attribute for two successor nodeatt - attribute the split based onsplitPoint - split point the split based on if att is numericsplitStr - split subset the split based on if att is nominalsortedIndices - sorted indices of the instances to be splitweights - weights of the instances to bes splitdata - training dataException - if something goes wrongprotected double computeGiniGain(double[] parentDist,
double[][] childDist)
parentDist - class distributions of parent nodechildDist - class distributions of successor nodesprotected double computeGini(double[] dist,
double total)
dist - class distributionstotal - class distributionsprotected double computeInfoGain(double[] parentDist,
double[][] childDist)
parentDist - class distributions of parent nodechildDist - class distributions of successor nodesprotected double computeEntropy(double[] dist,
double total)
dist - class distributionstotal - class distributionsprotected void makeLeaf(Instances data)
data - training datapublic double[] distributionForInstance(Instance instance) throws Exception
distributionForInstance in class Classifierinstance - the instance for which class probabilities is to be computedException - if something goes wrongpublic String toString()
protected String toString(int level)
level - the level at which the tree is to be printedpublic int numNodes()
public int numLeaves()
public Enumeration listOptions()
listOptions in interface OptionHandlerlistOptions in class RandomizableClassifierpublic void setOptions(String[] options) throws Exception
-S <num> Random number seed. (default 1)
-D If set, classifier is run in debug mode and may output additional info to the console
-P <UNPRUNED|POSTPRUNED|PREPRUNED> The pruning strategy. (default: POSTPRUNED)
-M <min no> The minimal number of instances at the terminal nodes. (default 2)
-N <num folds> The number of folds used in the pruning. (default 5)
-H Don't use heuristic search for nominal attributes in multi-class problem (default yes).
-G Don't use Gini index for splitting (default yes), if not information is used.
-R Don't use error rate in internal cross-validation (default yes), but root mean squared error.
-A Use the 1 SE rule to make pruning decision. (default no).
-C Percentage of training data size (0-1] (default 1).
setOptions in interface OptionHandlersetOptions in class RandomizableClassifieroptions - the options to useException - if setting of options failspublic String[] getOptions()
getOptions in interface OptionHandlergetOptions in class RandomizableClassifierpublic Enumeration enumerateMeasures()
enumerateMeasures in interface AdditionalMeasureProducerpublic double measureTreeSize()
public double getMeasure(String additionalMeasureName)
getMeasure in interface AdditionalMeasureProduceradditionalMeasureName - the name of the measure to query for its valueIllegalArgumentException - if the named measure is not supportedpublic String pruningStrategyTipText()
public void setPruningStrategy(SelectedTag value)
value - the strategypublic SelectedTag getPruningStrategy()
public String minNumObjTipText()
public void setMinNumObj(int value)
value - minimal number of instances at the terminal nodespublic int getMinNumObj()
public String numFoldsPruningTipText()
public void setNumFoldsPruning(int value)
value - the number of foldspublic int getNumFoldsPruning()
public String heuristicTipText()
public void setHeuristic(boolean value)
value - if use heuristic search for nominal attributes in
multi-class problemspublic boolean getHeuristic()
public String useGiniTipText()
public void setUseGini(boolean value)
value - if use Gini index splitting criterionpublic boolean getUseGini()
public String useErrorRateTipText()
public void setUseErrorRate(boolean value)
value - if use error rate in internal cross-validationpublic boolean getUseErrorRate()
public String useOneSETipText()
public void setUseOneSE(boolean value)
value - if use the 1SE rule to choose final modelpublic boolean getUseOneSE()
public String sizePerTipText()
public void setSizePer(double value)
value - training set sizepublic double getSizePer()
public String getRevision()
getRevision in interface RevisionHandlergetRevision in class Classifierpublic static void main(String[] args)
args - the options for the classifierCopyright © 2015 University of Waikato, Hamilton, NZ. All rights reserved.