Rule Learning - Teaching-WIKI

Rule Learning - Teaching-WIKI

Intelligent Systems Rule Learning © Copyright  2010      Dieter  Fensel  and  Ioan  Toma   1   Where are we? # Title 1 Introduction 2 Propos...

4MB Sizes 0 Downloads 12 Views

Intelligent Systems

Rule Learning

© Copyright  2010      Dieter  Fensel  and  Ioan  Toma  

1  

Where are we? #

Title

1

Introduction

2

Propositional Logic

3

Predicate Logic

4

Reasoning

5

Search Methods

6

CommonKADS

7

Problem-Solving Methods

8

Planning

9

Software Agents

10

Rule Learning

11

Inductive Logic Programming

12

Formal Concept Analysis

13

Neural Networks

14

Semantic Web and Services

2  

Tom Mitchell “Machine Learning”

•  Slides in this lecture are partially based on [1], Section 3 “Decision Tree Learning”.

3  

Outline

•  Motivation •  Technical Solution –  Rule learning tasks –  Rule learning approaches •  Specialization –  The ID3 Algorithm

•  Generalization –  The RELAX Algorithm

•  Combining specialization and generalization –  The JoJo Algorithm

–  Refinement of rule sets with JoJo

•  Illustration by a Larger Example •  Extensions –  The C4.5 Algorithm

•  Summary 4

4  

MOTIVATION

5

5  

Motivation: Definition of Learning in AI

•  Research on learning in AI is made up of diverse subfields (cf. [5]) –  “Learning as adaptation”: Adaptive systems monitor their own performance and attempt to improve it by adjusting internal parameters, e.g., •  Self-improving programs for playing games •  Pole balancing •  Problem solving methods

–  “Learning as the acquisition of structured knowledge” in the form of •  Concepts •  Production rules •  Discrimination nets

6  

Machine Learning

• 

• 

• 

Machine learning (ML) is concerned with the design and development of algorithms that allow computers to change behavior based on data. A major focus is to automatically learn to recognize complex patterns and make intelligent decisions based on data (Source: Wikipedia) Driving question: “How can we build computer systems that automatically improve with experience, and what are the fundamental laws that govern all learning processes?” (cf. [2]) Three niches for machine learning [1]: –  –  – 

• 

Data mining: using historical data to improve decisions, e.g. from medical records to medical knowledge Software applications hard to program by hand, e.g. autonomous driving or speech recognition Self-customizing programs, e.g., a newsreader that learns users interest

Practical success of ML: –  –  –  – 

Speech recognition Computer vision, i.e. to recognize faces, to automatically classify microscope images of cells, or for recognition of hand writings Bio surveillance, i.e. to detect and track disease outbreaks Robot control

7  

Rule learning

•  Machine Learning is a central research area in AI to acquire knowledge [1]. •  Rule learning –  is a popular approach for discovering interesting relations in data sets and to acquire structured knowledge. –  is a means to alleviate the „bottleneck“ (knowledge aquisition) – problem in expert systems [4].

8  

Simple Motivating Example

•  I = {milk, bread, butter, beer} •  Sample database ID Milk Bread Butter Beer 1

1

1

1

0

2

0

1

1

0

3

0

0

0

1

4

1

1

1

0

5

0

1

0

0

•  A possible rule would be {milk, bread} → {butter} Source: Wikipedia

9  

TECHNICAL SOLUTIONS

10

10  

Decision Trees

•  Many inductive knowledge acquisition algorithms generate („induce“) classifiers in form of decision trees. •  A decision tree is a simple recursive structure for expressing a sequential classification process. –  Leaf nodes denote classes –  Intermediate nodes represent tests

•  Decision trees classify instances by sorting them down the tree from the root to some leaf node which provides the classification of the instance.

11  

Decision Tree: Example {Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong}: Negative Instance

Figure taken from [1]

12  

Decision Trees and Rules

•  Rules can represent a decision tree: if item1 then subtree1 elseif item2 then subtree2 elseif...

•  There are as many rules as there are leaf nodes in the decision tree. •  Advantage of rules over decision trees: –  –  –  – 

Rules are a widely-used and well-understood representation formalism for knowledge in expert systems; Rules are easier to understand, modify and combine; and Rules can significantly improve classification performance by eliminating unnecessary tests. Rules make it possible to combine different decision trees for the same task.

13  

Advantages of Decision Trees

• 

Decision trees are simple to understand. People are able to understand decision trees model after a brief explanation.

• 

Decision trees have a clear evaluation strategy Decision trees are easily interpreted

• 

Decision trees are able to handle both nominal and categorical data. Other techniques are usually specialised in analysing datasets that have only one type of variable. Ex: neural networks can be used only with numerical variables

14  

Advantages of Decision Trees (1)

• 

Decision trees are a white box model. If a given situation is observable in a model the explanation for the condition is easily explained by boolean logic. An example of a black box model is an artificial neural network since the explanation for the results is excessively complex to be comprehended.

• 

Decision trees are robust, perform well with large data in a short time. Large amounts of data can be analysed using personal computers in a time short enough to enable stakeholders to take decisions based on its analysis.

15  

Converting a Decision Tree to Rules

16  

RULE LEARNING TASKS

17

17  

Rule learning tasks

There are two major rule learning tasks: 1.  Learning of correct rule that cover a maximal number of positive examples but no negative ones. We call this maximal requirement for each rule. 2.  Learning of a minimal rule set that covers all positive example. We call this minimal requirement for the rule set.

18  

Rule learning tasks To illustrate the two rule learning tasks let’s consider a data set, containing both positive and negative examples, as depicted in the following chart 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

19  

Maximal requirement for each rule We want to learn a correct rule that cover a maximal number of positive examples but no negative examples (maximal requirement for each rule). 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

20  

Maximal requirement for each rule We start with deriving a rule that cover the positive example (x=3;y=3). If x=3 and y=3 then class = positive 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

21  

Maximal requirement for each rule We try to cover more positive examples by refining the rule. We need to make sure that we don’t cover negative examples. The new rule if x>=2 and x<=4 and y=3 then class = positive covers positive example (x=2;y=3),(x=3;y=3),(x=4;y=3) 10  

9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

22  

Maximal requirement for each rule We try to cover more positive examples by further refining the rule. The new rule if x>=2 and x<=4 and y>=3 and y<=4 then class = positive covers positive example (x=2;y=3),(x=3;y=3),(x=4;y=3), (x=2;y=4),(x=3;y=4), (x=4;y=4) 10  

9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

23  

Maximal requirement for each rule We try to cover more positive examples by further refining the rule. The new rule if x>=2 and x<=4 and y>=2 and y<=5 then class = positive covers all positive example in the lower part of the space 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

24  

Maximal requirement for each rule We can not refine anymore the rule if x>=2 and x<=4 and y>=2 and y<=5 then class = positive to cover more positive without covering negative examples as well. To cover the rest of the positive examples we need to learn a new rule. 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

25  

Maximal requirement for each rule Following the same approach as illustrated before we can learn a second rule that covers the rest of the positive examples: if x>=5 and x<=6 and y>=6 and y<=8 then class = positive 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

26  

Maximal requirement for each rule The full set of positive examples is covered by the following two rules: if x>=2 and x<=4 and y>=2 and y<=5 then class = positive if x>=5 and x<=6 and y>=6 and y<=8 then class = positive 10   9   8   7   6   PosiAve   5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

27  

Minimal requirement for the rule set In general, for a decision problem, there are many possible sets of rules that cover all positive examples without covering negative examples. Possible sets of rules for our previous example are: Solution 1:

if x>=2 and x<=4 and y>=2 and y<=5 then class = positive if x>=5 and x<=6 and y>=6 and y<=8 then class = positive 10   9   8   7   6   PosiAve  

5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

28  

Minimal requirement for the rule set Solution 2:

if x>=2 and x<=4 and y>=2 and y<=3 then class = positive if x>=2 and x<=4 and y>3 and y<=5 then class = positive if x>=5 and x<=6 and y>=6 and y<=7 then class = positive if x>=5 and x<=6 and y>7 and y<=8 then class = positive 10   9   8   7   6   PosiAve  

5  

NegaAve  

4   3   2   1   0   0  

1  

2  

3  

4  

5  

6  

7  

8  

9  

10  

29  

Minimal requirement for the rule set

•  The minimal requirement for the rule set is about learning of a minimal rule set that covers all positive example. •  The minimal requirement for the rule set is similar with the minimal set cover problem [18] •  The minimal requirement for the rule set, which basically translates to building the optimal linear decision tree, is NP-hard [19].

30  

RULE LEARNING APPROACHES

31

31  

Latice of possible rules

32  

Rule learning approaches

•  Rule learning can be seen as a search problem in the latice of possible rules •  Depending on the starting point and on the direction of search we can distinguished three classes of rule learning approaches: 1.  Specialization – the search procedure starts at the top of the lattice and searches towards the bottom of the latice, towards the concrete descriptions. Specialization is top-down. 2.  Generalization– the search procedure starts at the bottom of the lattice and advances up the lattice towards the top element. Generalization is bottom-up. 3.  Combining of specialization and generalization – the search procedure can start at any arbitrary point in the lattice and can move freely up or down as needed.

33  

Rule learning approaches

•  Specialization and Generalization are dual search directions in a given rule set.

34  

SPECIALIZATION

35

35  

Specialization

•  Specialization algorithms start from very general descriptions and specializes those until they are correct. •  This is done by adding additional premises to the antecedent of a rule, or by restricting the range of an attribute which is used in an antecedent. •  Algorithms relying on specialization generally have the problem of overspecialization: previous specialization steps could become unnecessary due to subsequent specialization steps. •  This brings along the risk for ending up with results that are not maximal-general. •  Some examples of (heuristic) specialization algorithms are the following: ID3, AQ, C4, CN2, CABRO, FOIL, or PRISM; references at the end of the lecture. 36  

THE ID3 ALGORITHM

37

37  

ID3

•  Most algorithms apply a top-down, greedy search through the space of possible trees. –  e.g., ID3 [5] or its successor C4.5 [6]

•  ID3 –  Learns trees by constructing them top down. –  Initial question: “Which attribute should be tested at the root of the tree?” ->each attribute is evaluated using a statistical test to see how well it classifies. –  A descendant of the root node is created for each possible value of this attribute. –  Entire process is repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree.

38  

The Basic ID3 Algorithm

Code taken from [1]

39  

Selecting the Best Classifier

•  Core question: Which attribute is the best classifier? •  ID3 uses a statistical measure called information gain that measures how well a given example separates the training examples according to their target classification.

40  

Entropy

41  

Information Gain

42  

Example: Which attribute is the best classifier?

Figure taken from [1]

43  

Hypothesis Space Search in Decision Tree Learning

• 

• 

ID3 performs a simple-to-complex hill-climbing search through the space of possible decision trees (the hypothesis search). Start: empty tree; progressively more elaborate hypotheses in search of a decision tree are tested.

Figure taken from [1]

44  

Capabilities and Limitations of ID3

•  ID3’s hypothesis space is the complete space of finite discretevalued functions, relative to the available attributes. •  ID3 maintains only a single current hypothesis, as it searches through the space of trees (earlier (perhaps better) versions are eliminated). •  ID3 performs no backtracking in search; it converges to locally optimal solutions that are not globally optimal. •  ID3 uses all training data at each step to make statistically based decisions regarding how to refine its current hypothesis.

45  

Inductive Bias in Decision Tree Learning

• 

Definition: Inductive bias is the set of assumptions that, together with the training data, deductively justifies the classifications assigned by the learner to future instances [1].

• 

Central question: How does ID3 generalize from observed training examples to classify unseen instances? What is its inductive bias?

• 

ID3 search strategy: –  ID3 chooses the first acceptable tree it encounters. –  ID3 selects in favour of shorter trees over longer ones. –  ID3 selects trees that place the attributes with the highest information gain closest to the root.

• 

Approximate inductive bias of ID3: Shorter trees are preferred over larger trees. Trees that place high information gain attributes close to the root are preferred over those that do not. 46  

ID3’s Inductive Bias

•  ID3 searches a complete hypothesis search. •  It searches incompletely through this space, from simple to complex hypotheses, until its termination condition is met. –  Termination condition: Discovery of a hypothesis consistent with the data

•  ID3’s inductive bias is a preference bias (it prefers certain hypotheses over others). •  Alternatively: A restriction bias categorically restricts considered hypotheses.

47  

Occam’s Razor

•  …or why prefer short hypotheses? •  William Occam was one of the first to discuss this question around year 1320. •  Several arguments have been discussed since then. •  Arguments: –  Based on combinatorial arguments there are fewer short hypothesis that coincidentally fit the training data. –  Very complex hypotheses fail to generalize correctly to subsequent data.

48  

ID3 by Example ([1])

–  Target attribute: PlayTennis (values: yes / no)

49  

Example: Which attribute should be tested first in the tree?

•  ID3 determines information gain for each candidate attribute (e.g., Outlook, Temperature, Humidity, and Wind) and selects the one with the highest information gain

•  Gain(S, Outlook) = 0.246; Gain(S, Humidity) = 0.151; Gain(S, Wind) = 0.048; Gain(S, Temperature)=0.029 50

50  

Example: Resulting Tree

51

51  

Example: Continue Selecting New Attributes

•  Determing new attributes at the „Sunny“ – branch only using the examples classified there:

•  Process continues for each new leaf node until either of two conditions are met: 1.  Every attribute has already been included along this path through the tree, or 2.  The training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero).

52

52  

Example: Final Decision Tree

53

53  

GENERALIZATION

54

54  

Generalization

•  Generalization starts from very special descriptions and generalizes them as long as they are not incorrect, i.e. in every step some unnecessary premises are deleted from the antecedent. •  The generalization procedure stops if no more premises to remove exist. •  Generalization avoids the maximal-general issue of specialization, in fact it guarantees most-general descriptions. •  However, generalization of course risks to derive final results that are not most-specific. •  RELAX is an example of a generalization-based algorithm; references at the end of the lecture. 55  

THE RELAX ALGORITHM

56

56  

RELAX

•  RELAX is a generalization algorithm, and proceeds as long as the resulting rule set is not incorrect. •  The RELAX algorithm: –  –  –  – 

RELAX considers every example as a specific rule that is generalized. The algorithm then starts from a first rule and relaxes the first premise. The resulting rule is tested against the negative examples. If the new (generalized) rule covers negative examples, the premise is added again, and the next premise is relaxed. –  A rule is considered minimal, if any further relaxation would destroy the correctness of the rule.

•  The search for minimal rules starts from any not yet considered example, i.e. examples that are not covered by already discovered minimal rules.

57  

RELAX: Ilustration by an example

•  Consider the following positive example for a consequent C: (pos, (x=1, y=0, z=1)) •  This example is represented as a rule: x ∩ ¬ y ∩ z → C •  In case of no negative examples, RELAX constructs and tests the following set of rules: 1) x ∩ ¬ y ∩ z → C 5) x → C 2) ¬ y ∩ z → C 6) ¬ y → C 3) x ∩ z → C 7) z → C 4) x ∩ ¬ y → C 8) → C

58  

COMBINING SPECIALIZATION AND GENERALIZATION

59

59  

THE JOJO ALGORITHM

60

60  

JoJo

•  In general, it cannot be determined which search direction is the better one. •  JoJo is an algorithm that combines both search directions in one heuristic search procedure. •  JoJo can start at an arbitrary point in the lattice of complexes and generalizes and specializes as long as the quality and correctness can be improved, i.e. until a local optimum can be found, or no more search resources are available (e.g., time, memory).

61  

JoJo (1)

•  While specialization moves solely from ‘Top‘ towards ‘Bottom‘ and generalization from ‘Bottom‘ towards ‘Top‘, JoJo is able to move freely in the search space.

•  Either of the two strategies can be used interchangeable, which makes JoJo more expressive than comparable algorithms that apply the two in sequential order (e.g. ID3). 62  

JoJo (2)

•  A starting point in JoJo is described by two parameters: –  Vertical position (length of the description) –  Horizontal position (chosen premises) •  Reminder: JoJo can start at any arbitrary point, while specialization requires a highly general point and generalization requires a most specific point. •  In general, it is possible to carry out several runs of JoJo with different starting points. Rules that were already produced can be used as subsequent starting points.

63  

JoJo (4)

•  Criteria for choosing a vertical position: 1.  Approximation of possible lenght or experience from earlier runs. 2.  Random production of rules; distribution by means of the average correctness of the rules with the same length (so-called quality criterion). 3.  Start with a small sample or very limited resources to discover a real starting point from an arbitrary one. 4.  Randomly chosen starting point (same average expectation of success as starting with ‘Top‘ or ‘Bottom‘). 5.  Heuristic: few positive examples and maximal-specific descriptions suggest long rules, few negative examples and maximal-generic descriptions rather short rules. 64  

JoJo Principle Components

•  JoJo consists of three components: The Specializer, Generalizer, and Scheduler •  The former two can be provided by any such components depending on the chosen strategies and preference criterias. •  The Scheduler is responsible for selecting the next description out of all possible generalizations and specializations available (by means of a t-preference, total preference). •  Simple example scheduler: –  Specialize, if the error rate is above threshold; –  Otherwise, choose the best generalization with allowable error rate; –  Otherwise stop.

65  

REFINEMENT OF RULES WITH JOJO

66

66  

Refinement of rules with JoJo

•  Refinement of rules refers to the modification of a given rule set based on additonal examples. •  The input to the task is a so-called hypothesis (a set of rules) and a set of old and new positive and negative examples. •  The output of the algorithm are a refined set of rules and the total set of examples. •  The new set of rules is correct, complete, non-redundant and (if necessary) minimal.

67  

Refinement of rules with JoJo (1)

•  There is a four step procedure for the refinment of rules: 1.  Rules that become incorrect because of new examples are refined: incorrect rules are replaced by new rules that cover the positive examples, but not the new negative ones. 2.  Complete the rule set to cover new positive examples. 3.  Redundant rules are deleted to correct the rule set. 4.  Minimize the rule set. •  Steps 1 and 2 are subject to the algorithm JoJo that integrates generalization and specification via a heuristic search procedure. 68  

Refinement of rules with JoJo (3)

•  Correctness: –  Modify overly general rules that cover too many negative examples. –  Replace a rule by a new set of rules that cover the positive examples, but not the negative ones. •  Completeness: –  Compute new correct rules that cover the not yet considered positive examples (up to a threshold). •  Non-redundancy: –  Remove rules that are more specific than other rules (i.e. rules that have premises that are a superset of the premises of another rule). 69  

ILLUSTRATION BY A LARGER EXAMPLE: ID3

70

70  

ID3 Example: The Simpsons

Person

Hair Length

Weight

Age

Class

Homer

0”

250

36

M

Marge

10”

150

34

F

Bart

2”

90

10

M

Lisa

6”

78

8

F

Maggie

4”

20

1

F

Abe

1”

170

70

M

Selma

8”

160

41

F

10”

180

38

M

6”

200

45

M

Otto Krusty Comic

8”

290

38

?

Example from: www.cs.sjsu.edu/~lee/cs157b/ID3-­‐AllanNeymark.ppt 71  

ID3 Example: The Simpsons

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no

yes Hair Length <= 5?

Let us try splitting on Hair length

Entropy(3F,2M) = -(3/5)log2(3/5) - (2/5)log2(2/5) = 0.9710 Entropy(1F,3M) = -(1/4)log2(1/4) - (3/4)log2(3/4) = 0.8113

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911 72  

ID3 Example: The Simpsons

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no

yes Weight <= 160?

Let us try splitting on Weight

Entropy(0F,4M) = -(0/4)log2(0/4) - (4/4)log2(4/4) = 0 Entropy(4F,1M) = -(4/5)log2(4/5) - (1/5)log2(1/5) = 0.7219

Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900 73  

ID3 Example: The Simpsons

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9) = 0.9911 no

yes age <= 40?

Let us try splitting on Age

Entropy(1F,2M) = -(1/3)log2(1/3) - (2/3)log2(2/3) = 0.9183 Entropy(3F,3M) = -(3/6)log2(3/6) - (3/6)log2(3/6) = 1

Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183 74  

ID3 Example: The Simpsons

Of the 3 features we had, Weight was best. But while people who weigh over 160 are perfectly classified (as males), the under 160 people are not perfectly classified… So we simply recurse!

no

yes Weight <= 160?

This time we find that we can split on Hair length, and we are done! no

yes Hair Length <= 2?

75  

ID3 Example: The Simpsons We need don’t need to keep the data around, just the test conditions.

Weight <= 160?

How would these people be classified?

yes

no

Hair Length <= 2?

yes

Male

Male

no

Female 76  

ID3 Example: The Simpsons

Weight <= 160?

It is trivial to convert Decision Trees to rules…

yes

no

Hair Length <= 2? yes

Male

Male

no

Female

Rules to Classify Males/Females If Weight greater than 160, classify as Male Elseif Hair Length less than or equal to 2, classify as Male Else classify as Female 77  

EXTENSIONS

78

78  

Issues in Decision Tree Learning

•  •  •  •  •  • 

How deeply to grow the decision tree? Handling continuous attributes. Choosing an appropriate attribute selection measure. Handling training data with missing attribute values. Handling attributes with differing costs. Improving computational efficiency.

•  Successor of ID3, named C4.5, addresses most of these issues (see [6]).

79  

THE C4.5 ALGORITHM

80

80  

Avoiding Overfitting the Data

Impact of overfitting in a typical application of decision tree learning [1]

81  

Example: Overfitting due to Random Noise

• 

Original decision tree based on correct data:

Figure taken from [1]

• 

Incorrectly labelled data leads to construction of a more complex tree, e.g., {Outlook=Sunny, Temperature=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No} –  ID3 would search for new refinements at the bottom of the left tree.

82  

Strategies to Avoid Overfitting

•  • 

Overfitting might occur based on erroneous input data or based on coincidental regularities. Different types of strategies: –  Stopping to grow the tree earlier, before it reaches the point where it perfectly classifies the training data. –  Allowing the tree to overfit the data, and then post-prune the tree (more successful approach).

• 

Key question: How to determine the correct final tree size? –  Use of a separate set of examples to evaluate the utility of post-pruning nodes from the tree (“Training and Validation Set” – approach); two approaches applied by Quinlain: “Reduced Error Pruning” and “Rule-Post Pruning” –  Use all available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. –  Use an explicit measure of the complexity for encoding the training examples and the decision trees, halting growth of the tree when this encoding size is minimized (“Minimum Decision Length” – principle, see [1] Chapter 6)

83  

Rule Post-Pruning

•  • 

Applied in C4.5. Steps ( [1]) 1.  Infer the decision tree from the training set, growing the set until the training data is fit as well as possible and allowing overfitting to occur. 2.  Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node. 3.  Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy. 4.  Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.

• 

Rule accuracy estimation based on the training set using a pessimistic estimate: C4.5 calculates standard deviation and takes the lower bound as estimate for rule performance.

84  

Incorporating Continuous-Valued Attributes

85  

Computing a Threshold

86  

Alternative Measures for Selecting Attributes

87  

Gain Ratio

88

88  

Handling Training Examples with Missing Attribute Values

89  

SUMMARY

90

90  

Summary

•  Machine learning is a prominent topic in the field of AI. •  Rule learning is a means to learn rules from instance data to classify unseen instances. •  Decision tree learning can be used for concept learning, rule learning, or for learning of other discrete valued functions. •  The ID3 family of algorithms infers decision trees by growing them from the root downward in a greedy manner. •  ID3 searches a complete hypothesis space. •  ID3’s inductive bias includes a preference for smaller trees; it grows trees only as large as needed. •  A variety of extensions to basic ID3 have been developed; extensions include: methods for post-pruning trees, handling realvalued attributes, accommodating training examples with missing attribute values, or using alternative selection measures. 91  

Summary (2)

•  Rules cover positive examples and should not cover negative examples. •  There are two main approaches for determining rules: –  Generalization –  Specification •  RELAX is a presented example of a generalization algorithm •  JoJo combines generalization and specialization and allows the algorithm to traverse the entire search space by either generalizing or specializing rules inter-changeably. •  JoJo can also be applied to incrementally refine rule sets.

92  

REFERENCES

93

93  

References

•  Mandatory reading: –  [1] Mitchell, T. "Machine Learning", McGraw Hill, 1997. (Section 3) –  [2] Mitchell, T. "The Discipline of Machine Learning" Technical Report CMU-ML-06-108, July 2006. Available online: http://www.cs.cmu.edu/~tom/pubs/MachineLearning.pdf –  [3] Fensel, D. “Ein integriertes System zum maschinellen Lernen aus Beispielen” Künstliche Intelligenz (3). 1993.

•  Further reading: –  [4] Feigenbaum, E. A. “Expert systems in the 1980s” In: A. Bond (Ed.), State of the art report on machine intelligence. Maidenhea: Pergamon-infotech. –  [5] Quinlan, J. R. “Induction of decision trees”. Machine Learning 1 (1), pp. 81-106, 1986. 94  

References

–  [6] Quinlan, J. R. “C4.5: Programs for Machine Learning” Morgan Kaufmann, 1993. –  [7] Fayyad, U. M. “On the induction of decision trees for multiple concept learning” PhD dissertation, University of Michigan, 1991. –  [8] Agrawal, Imielinsky and Swami: Mining Association Rules between Sets of Items in Large Databases. ACM SIGMOD Conference, 1993, pp. 207-216. –  [9] Quinlan: Generating Production Rules From Decision Trees. 10th Int’l Joint Conference on Artificial Intelligence, 1987, pp. 304-307. –  [10] Fensel and Wiese: Refinement of Rule Sets with JoJo. European Conference on Machine Learning, 1993, pp. 378-383.

95  

References

–  [11] AQ: Michalski, Mozetic, Hong and Lavrac: The MultiPurpose Incremental Learning System AQ15 and its Testing Application to Three Medical Domains. AAAI-86, pp. 1041-1045. –  [12] C4: Quinlan: Learning Logical Definitions from Relations: Machine Learning 5(3), 1990, pp. 239-266. –  [13] CN2: Clark and Boswell: Rule Induction with CN2: Some recent Improvement. EWSL-91, pp. 151-163. –  [14] CABRO: Huyen and Bao: A method for generating rules from examples and its application. CSNDAL-91, pp. 493-504. –  [15] FOIL: Quinlan: Learning Logical Definitions from Relations: Machine Learning 5(3), 1990, pp. 239-266. –  [16] PRISM: Cendrowska: PRISM: An algorithm for inducing modular rules. Journal Man-Machine Studies 27, 1987, pp. 349-370. –  [17] RELAX: Fensel and Klein: A new approach to rule induction and pruning. ICTAI-91. 96  

References

–  [18] http://www.ensta.fr/~diam/ro/online/viggo_wwwcompendium/ node146.html. –  [19] Goodrich, M.T., Mirelli, V. Orletsky, M., and Salowe, J. Decision tree conctruction in fixed dimensions: Being global is hard but local greed is good. Technical Report TR-95-1, Johns Hopkins University, Department of Computer Science, Baltimore, MD 21218, May 1995

97  

References

•  Wikipedia links: •  Machine Learning, http://en.wikipedia.org/wiki/Machine_learning •  Rule learning, http://en.wikipedia.org/wiki/Rule_learning •  Association rule learning, http://en.wikipedia.org/wiki/Association_rule_learning •  ID3 algorithm, http://en.wikipedia.org/wiki/ID3_algorithm •  C4.5 algorithm, http://en.wikipedia.org/wiki/C4.5_algorithm

98  

Next Lecture #

Title

1

Introduction

2

Propositional Logic

3

Predicate Logic

4

Reasoning

5

Search Methods

6

CommonKADS

7

Problem-Solving Methods

8

Planning

9

Software Agents

10

Rule Learning

11

Inductive Logic Programming

12

Formal Concept Analysis

13

Neural Networks

14

Semantic Web and Services

99  

Questions?

100

100