Creator:Daniël Höhle
Function:Decision tree generation
Input Type:csv, discrete or continuous
Input from Agent:Dizzy et al., Ultimate Discretizer, Data Selection Agent
Output Type:Classsification advice as xml (description of the n best trees), dot (graphiviz, to make a tree in PDF), xls (confusion matrix, boundaries file, train and test classification)
Output to Agent:Advice, Ceres, Juno
Short Description:

Moku is a decision tree-building algorithm.
Moku was derived from ID3 and C4.5 tree building algorithms, but improved for tree selection and usability for non-AI specialists. Decision trees are drawn from historic patient data and may then be used to classify new patients, either manually or implemented as a set of rules in the Ceres or Juno agents.

To build decision trees with Moku, an historic dataset is required, with input variables and one output variable. Typically, 70% of the historic data is used to create the trees, the remaining 30% to test or validate the trees. The output variable needs to be categorical, whereas the input variables may be categorized on the way.
Moku creates individual trees by calculating the mutual information between each input variable, and the output variable. The input variable with the highest mutual information will be used for the root node, or starting point of the tree. The categories of the root node decide which subset of patients go into which branch. For each subset or branch, again,  the mutual information between the remaining input variables and the output variable is calculated, and the variable with the highest mutual information is used as the node, and so on and so forth. The tree is then pruned, removing branches with too little information gained.

Tree performance is calculated from (user adjustable) weighted parameters including classification accuracy of the train set (the 70%) and the validation set (the 30%), the average depth of the tree, the number of leafs (final decision nodes, without children) and the average predictability of all leafs. Moku generates x trees (typically 1000), but only prints the best y trees (typically 10) in both machine readable xml and human readable graphs (based on Graphviz's dot language).

Further tree information that is supplied are the confusion matrix (including sensitivity and specifictiy measures) and classification of all train and test cases. 

© 2015 Alan Turing Institute Almere