-
Notifications
You must be signed in to change notification settings - Fork 1
Description
The code for determining the similarity of two condition thresholds is shown below:
PERMISSIBLE_DELTA = 0.1
…
def condition_similarity(condition1: Condition, condition2: Condition):
# Different attributes
if condition1.attribute != condition2.attribute:
return 0
# Different operators
# TODO: Extend???
if condition1.operator != condition2.operator:
return 0
# Handle <= as a special case as per paper
if condition1.operator == Operator.LE and condition2.operator == Operator.LE:
t = abs(PERMISSIBLE_DELTA * condition1.threshold)
x = abs(condition1.threshold - condition2.threshold)
if x == 0:
return 1
return 1 - (x / t) if x < t else 0
return 1
(The original code also contained a bug in the calculation of the tollerance, t, which was fixed in PR #6)
This threshold logic is not appropriate in case of ordinal numbers. For example, the UCI Poker Hand dataset represents the rank of cards as numbers between 1-13. As PERMISSIBLE_DELTA = 1.1, a Queen (12) is has a threshold, t, of 12 * 0.1 = 1.2, which means it would be considered similar to a Jack (11) or King (13), but an Ace (1) would have a threshold, t, of 1 * 0.1 = 0.1 so wouldn’t be considered similar to any other card.
The similar_tree module needs to be modified to allow a list of attributes to be treated as ordinal numbers, and tollerance threshold logic adjusted accordingly. The condition similarity should be 1 if the thresholds represent the same partitioning (e.g. <= 2.0 is the same as <= 2.9 as they both split {1, 2} vs {3, 4, ..}), and 0 otherwise.
Secondly, the code only deals with the case of two <= operators, not two > operators. In the case of two > operators it will return 1 (perfect similarity) even if the thresholds differ.