10701 Midterm Exam, Spring 2006 Solutions
1. Write your name and your email address below. • Name: • Andrew account: 2. There should be 17 numbered pages in this exam (including this cover sheet). 3. You may use any and all books, papers, and notes that you brought to the exam, but not materials brought by nearby students. Calculators are allowed, but no laptops, PDAs, or Internet access. 4. If you need more room to work out your answer to a question, use the back of the page and clearly mark on the front of the page if we are to look at what’s on the back. 5. Work efficiently. Some questions are easier, some more difficult. Be sure to give yourself time to answer all of the easy ones, and avoid getting bogged down in the more difficult ones before you have answered the easier ones. 6. Note there are extracredit subquestions. The grade curve will be made without considering students’ extra credit points. The extra credit will then be used to try to bump your grade up without affecting anyone else’s grade. 7. You have 80 minutes. 8. Good luck!
Question 1 2 3 4 5 6
Topic Short questions Regression kNN and Cross Validation Decision trees and pruning Learning theory SVM and slacks 1
Max. score Score 12 + 0.52 extra 12 16 20 20 + 6 extra 20 + 6 extra
1
[12 points] Short questions
The following short questions should be answered with at most two sentences, and/or a picture. For the (true/false) questions, answer true or false. If you answer true, provide a short justification, if false explain why or provide a small counterexample. 1. [2 points] Discuss whether MAP estimates are less prone to overfitting than MLE. F SOLUTION: Usually, MAP is less prone to overfitting than MLE. MAP introduces a prior over the parameters. So, given prior knowledge, we can bias the values of the parameters. MLE on the other hand just returns the most likely parameters. Whether MAP is really less prone to overfitting depends on which prior is used – an uninformative (uniform) prior can lead to the same behavior as MLE. 2. [2 points] true/false Consider a classification problem with n attributes. The VC dimension of the corresponding (linear) SVM hypothesis space is larger than that of the corresponding logistic regression hypothesis space. F SOLUTION: False. Since they are both linear classifiers, they have same VC dimension. 3. [2 points] Consider a classification problem with two classes and n binary attributes. How many parameters would you need to learn with a Naive Bayes classifier? How many parameters would you need to learn with a Bayes optimal classifier? F SOLUTION: NB has 1 + 2n parameters — prior P (y = T ) and for every attribute xi , we have p(xi = T yi = T ) and p(xi = T yi = F ). For optimal Bayes for every configuration of attributes we need to estimate p(yx). This means we have 2n parameters. 4. [2 points] For an SVM, if we remove one of the support vectors from the training set, does the size of the maximum margin decrease, stay the same, or increase for that data set? F SOLUTION: The margin will either increase or stay the same, because support vectors are the ones that hold the marging from expanding. Here is an example of increasing margin. Suppose we have one feature x ∈ R and binary class y. The dataset consists of 3 points: (x1 , y1 ) = (−1, −1), (x2 , y2 ) = (1, 1), (x3 , y3 ) = (3, 1).
2
Figure 1: Example of SVM margin remaining the same after one support vector is deleted (for question 1.4). For standard SVM with slacks the optimal separating hyperplane wx+b = 0 has parameters w = 1, b = 0 2 corresponding to the margin of w = 2. The support vectors are x1 and x2 . If we remove (x2 , y2 ) from the dataset, new optimal parameters for the separating hyperplane will be
1 −1 w = ,b = 2 2 for the new margin of
2 w
= 4.
If there are redundant support vectors, the margin may stay the same  see Fig. 1 Only mentioning that the margin will increase was worth 1 point. Full score was given for mentioning both possibilities. 5. [2 points] true/false In nfold crossvalidation each data point belongs to exactly one test fold, so the test folds are independent. Are the error estimates of the separate folds also independent? So, given that the data in test folds i and j are independent, are ei and ej , the error estimates on test folds i and j, also independent? F SOLUTION: False. Since a data point appears in multiple folds the training sets are dependent and thus test fold error estimates are dependent. 6. [2 points] true/false There is an a priori good choice of n for nfold crossvalidation. F SOLUTION: False. We do not know the relation between sample size and the accuracy. High n increases correlation in training set and decreases variance of estimates. How much depends on the data and the learning method. 7. [0.52 extra credit points] Which of following songs are hits played by the B52s: 3
F Love Shack F Private Idaho • Symphony No. 5 in C Minor, Op. 67
4
2
[12 points] Regression
For each of the following questions, you are given the same data set. Your task is to fit a smooth function to this data set using several regression techniques. Please answer all questions qualitatively, drawing the functions in the respective figures. 1. [3 points] Show the least squares fit of a linear regression model Y = aX + b.
Y
1
0.9 0.8
+
0.7
+
0.6 0.5
+ +
+
+ +
+ +
+
+
+
0.4
+
0.3
+
+
+
+
+
+ +
0.2
+ 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
X1
2. [3 points] Show the fit using Kernel regression with Gaussian kernel and an appropriately chosen bandwidth. Y
1
0.9 0.8
+
0.7
+
0.6 0.5
+ +
+
+ + +
+
0.3
+
+
+
+ +
+
0.4
+
+
+
+ +
0.2
+ 0.1 0 0
0.1
0.2
0.3
0.4
5
0.5
0.6
0.7
0.8
0.9
X1
3. [3 points] Show the fit using Kernel local linear regression for an appropriately chosen bandwidth. Y
1
0.9 0.8
+
0.7
+
0.6 0.5
+ +
+
+ + +
+
0.3
+
+
+
+ +
+
0.4
+
+
+
+ +
0.2
+ 0.1 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
X1
P 4. [3 points] Suggest a linear regression model Y = i φi (X) which fits the data well. Why might you prefer this model to the kernel local linear regression model from part 3)? F SOLUTION: An appropriate choice would be to fit a polynomial of degree three, i.e. Y = w0 + w1 X + w2 X 2 + w3 X 3 . This choice seems to fit the overall trend in the data well. The advantage of this approach over kernel local linear regression is that only four parameters need to be estimated for making predictions. For kernel local linear regression, all the data has to be remembered, leading to high memory and computational requirements.
6
3
[16 points] knearest neighbor and crossvalidation
In the following questions you will consider a knearest neighbor classifier using Euclidean distance metric on a binary classification task. We assign the class of the test point to be the class of the majority of the k nearest neighbors. Note that a point can be its own neighbor.
1

0.9
+
0.8
+ 
0.7 0.6



0.5
+
0.4 0.3 0.2 0.1 0
+
+ 
0.6
0.7
+ 
+ 0
0.1
0.2
0.3
0.4
0.5
0.8
0.9
1
Figure 2: F SOLUTION: 1nearest neighbor decision boundary. 1. [3 points] What value of k minimizes the training set error for this dataset? What is the resulting training error? F SOLUTION: Note that a point can be its own neighbor. So, k = 0 minimizes the training set error. The error is 0. 2. [3 points] Why might using too large values k be bad in this dataset? Why might too small values of k also be bad? F SOLUTION: Too big k (k = 13) misclassifies every datapoint (using leave one out cross validation). Too small k leads to overfitting. 3. [6 points] What value of k minimizes leaveoneout crossvalidation error for this dataset? What is the resulting error? 7
F SOLUTION: k = 5 or k = 7 minimizes the leaveoneout crossvalidation error. The error is 4/14. 4. [4 points] In Figure 2, sketch the 1nearest neighbor decision boundary for this dataset. F SOLUTION: See figure 2.
8
4
[20] Decision trees and pruning
You get the following data set: V 0 0 1 1 1
W 0 1 0 1 1
X 0 0 0 0 1
Y 0 1 1 0 0
Your task is to build a decision tree for classifying variable Y . (You can think of the data set as replicated many times, i.e. overfitting is not an issue here). 1. [6 points] Compute the information gains IG(Y V ), IG(Y W ) and IG(Y X). Remember, information gain is defined as X IG(AB) = H(A) − P (B = b)H(AB = b) b∈B
where H(A) = −
X
P (A = a) log2 P (A = a)
a∈A
is the entropy of A and H(AB = b) = −
X
P (A = aB = b) log2 P (A = aB = b)
a∈A
is conditional entropy of A given B. Which attribute would ID3 select first? F SOLUTION: We calculate: H(Y ) = 0.97 P H(Y V ) = H(Y W ) = b∈B P (B = b)H(AB = b) = 0.95 H(Y X) = 0.8 and information gains are: IG(Y V ) = IG(Y W ) = 0.02 and IG(Y X) = 0.17. So attribute X is selected first. 2. [3 points] Write down the entire decision tree constructed by ID3, without pruning.
9
X 0
1
V
Y=1
0
1
W 0 Y=0
X
W 1
Y=1
0 Y=1
0
1 Y=0
(a) Full tree
Y=0
1 Y=1
(b) After topdown pruning
Figure 3: Solutions to questions 4.3, 4.4., 4.6. F SOLUTION: A full tree is constructed. First we split on X. Given a split on X the information grains for V and W are 0, so we split on either of them (let’s say V ). Last we split on W (information gain is 1). Figure 3(a) gives the solution. 3. [3 points] One idea for pruning would be to start at the root, and prune splits for which the information gain (or some other criterion) is less than some small ε. This is called topdown pruning. What is the decision tree returned for ε = 0.0001? What is the training set error for this tree? F SOLUTION: After splitting on X the information gain of both V and W is 0. So we will prune V and set the prediction to either Y = 0 or Y = 1. In either case we make 2 errors. Figure 3(b) gives the tree. 4. [3 points] Another option would be to start at the leaves, and prune subtrees for which the information gain (or some other criterion) of a split is less than some small ε. In this method, no ancestors of children with high information gain will get pruned. This is called bottomup pruning. What is the tree returned for ε = 0.0001? What is the training set error for this tree? F SOLUTION: Note that Y = V xor W . So when splitting on V information gain of V is IG(Y V ) = 0, and a step later when splitting on W , information gain of W is IG(Y W ) = 1. So bottomup pruning won’t delete any nodes and the tree remains the same, figure 3(a). 5. [2 points] Discuss when you would want to choose bottomup pruning over topdown pruning and vice versa. Compare the classification accuracy and computational complexity of both types of pruning. 10
V 0 W 0 Y=0
1 W
1
Y=1
0 Y=1
1 Y=0
Figure 4: Optimal decision tree for our dataset. F SOLUTION: Topdown is computationally cheaper – when building the tree we can determine when to stop (no need for real pruning). But as we saw topdown pruning prunes too much. On the other hand bottomup pruning is more expensive since we have to first build a full tree – which can be exponentially large – and then apply pruning. The other problem with this is that in the lower levels of the tree the number of examples in the subtree gets smaller so information gain might be an inappropriate criterion for pruning. One would usually use a statistical test (the pvalues discussed in class) instead. 6. [3 points] What is the height of the tree returned by ID3 with bottomup pruning? Can you find a tree with smaller height which also perfectly classifies Y on the training set? What conclusions does that imply about the performance of the ID3 algorithm? F SOLUTION: Bottomup pruning returns tree of height 3. Smaller tree that has zero error is given in figure 4. ID3 is a greedy algorithm. It looks only one step ahead and picks the best split. So for instance it can not find optimal tree for the XOR dataset, which is the case in our dataset.
11
5 5.1
[20 + 6 points] Learning theory [8 points] Sample complexity
Consider the following hypothesis class: 3SAT formulas over n attributes with k clauses. A 3SAT formula is a conjunction (AND, ∧) of clauses, where each clause is a disjunction (OR, ∨) or three attributes, the attributes may appear positively or negated (¬) in a clause, and an attribute may appear in many clauses. Here is an example over 10 attributes, with 5 clauses: (X1 ∨¬X2 ∨X3 )∧(¬X2 ∨X4 ∨¬X7 )∧(X3 ∨¬X5 ∨¬X9 )∧(¬X7 ∨¬X6 ∨¬X10 )∧(X5 ∨X8 ∨X10 ). You are hired as a consultant for a new company called FreeSAT.com, who wants to learn 3SAT formulas from data. They tell you: We are trying to learn 3SAT formulas for secret widget data, all we can tell you us that true hypothesis is a 3SAT formula in the hypothesis class, and our topsecret learning algorithm always returns a hypothesis consistent with the input data. Here is your job: we give you an upper bound > 0 on the amount of true error we are willing to accept. We know that this machine learning stuff can be kind of flaky and the hypothesis you provide may not always be good, but it can only be bad with probability at most δ > 0. We really want to know how much data we need. Please provide a bound on the amount of data required to achieve this goal. Try to make your bound as tight as possible. Justify your answer. F SOLUTION: As discussed in the lecture, P (errortrue (h) > ) ≤ He−m . By rearranging the terms, we obtain that He−m < δ if and only if 1 1 m > (ln H + ln ). (1) δ Thus, we only need to determine the number of possible hypotheses H the learner can choose from. Each attribute may appear either positively or negated, and may only appear once in a clause. Hence, there are 2n 2(n − 1) 2(n − 2) = 8n(n − 1)(n − 2) possible clauses. Since there are a total of k clauses in each formula, H = (8n(n − 1)(n − 2))k . Plugging H back to Equation 1, we obtain 1 1 m > (k ln 8n(n − 1)(n − 2) + ln ). δ
(2)
(n3 ) , reasoning that there COMMON MISTAKE 1: Some people claimed that H = 2 are n3 clauses, and that each clause can either appear or not appear in the formula. However, in this question, each variable can appear negated, and there are exactly k clauses in the formula. COMMON MISTAKE 2: A few people used the bound based on the Chernoff inequality. While this is a valid bound, it is not tight (we took off 1 point).
12
(a) three points
(b) four points
Figure 5: Figures for Question 5.2.
5.2
[12 points] VC dimension
Consider the hypothesis class of rectangles, where everything inside the rectangle is labeled as positive: A rectangle is defined by the bottom left corner (x1 , y1 ) and the top right corner (x2 , y2 ), where x2 > x1 and y2 > y1 . A point (x, y) is labeled as positive if and only if x1 ≤ x ≤ x2 and y1 ≤ y ≤ y2 . In this question, you will determine the VC dimension of this hypothesis class. 1. [3 points] Consider the three points in Figure 5(a). Show that rectangles can shatter these three points. F SOLUTION: We need to verify that for any assignment of labels to the points, there is a rectangle that covers the positively labeled points (and no other). Figure 6 illustrates some possible assignments and corresponding rectangles; the remaining cases follow by symmetry. 2. [3 points] Consider the four points in Figure 5(b). Show that rectangles cannot shatter these four points. F SOLUTION: Assign positive labels to the opposite vertices of the square formed by the 4 points and negative labels to the other two vertices, see Figure 7(a). The rectangle can cover only two points that are vertically or horizontally aligned, but not those along a diagonal.
13
+
+


+

+
+
+
+
+


+




Figure 6: Solution to Question 5.2.1.

+
+

(a) 4 points that cannot be shattered with rectangles
(b) 4 points that can be shattered with rectangles
Figure 7: Solutions to Questions 5.2.2,5.2.3. 3. [3 points] The VC dimension of a hypothesis space is defined in terms of the largest number of input points that can be shattered, where the “hypothesis” gets to pick the locations, and an opponent gets to pick the labels. Thus, even though you showed in Item 2 that rectangles cannot shatter the four points in Figure 5(b), the VC dimension of rectangles is actually equal to 4. Prove that rectangles have VC dimension of at least 4 by showing the position of four points that can be shattered by rectangles. Justify your answer. F SOLUTION: The points in Figure 7(b) can always be correctly classified, as illustrated in Figure 8.
COMMON MISTAKE : Some people merely showed a configuration and one labeling of points that is correctly classified by rectangles. This is incorrect: instead, we need to show a configuration s.t. for every labeling of the points, there is a consistent hypothesis (rectangle). 14
+
+ 

+
+ 
+

+ 
+
+
+ +
Figure 8: Solution to Question 5.2.3. 4. [3 points] So far, you have proved that the VC dimension of rectangles is at least 4. Prove that the VC dimension is exactly 4 by showing that there is no set of 5 points which can be shattered by rectangles. F SOLUTION: Given any five points in the plane, the following algorithm assigns a label to each point s.t. no rectangle can classify the points correctly: Find a point with a minimum x coordinate and a point with a maximum x coordinate and assign these points a positive label. Similarly, find a point with a minimum y coordinate and a maximum y coordinate, and assign these points a positive label. Assign the remaining point(s) a negative label. Any rectangle that classifies the positively labeled points correctly must contain all five points, hence would not correctly classify the point(s) with a negative label.
COMMON MISTAKE : Some people provided a partial proof based on the convex hull of the points, arguing that if there is a point inside the hull, we can assign this point a negative label and assign all the other points a positive label. While this statement is true, it still needs to be shown what the labeling ought to be when all five points lie on the boundary of the convex hull.
15
5
1
+
T 
3 + 
4

L
6
+ 2
(a)
T 
(b)
+
1
+ 2

+ 1
R L
2

+
+
B
B
(c)
+ R
(d)
Figure 9: Solution to 5.2.5. (a) 5 points that can be correctly classified with signed rectangles. (b)(d) Labeling of any 6 points that cannot be correctly classified with signed rectangles. 5. Extra credit: [6 points] Now consider signed rectangles, where, in addition to defining the corners, you get to define whether everything inside the rectangle is labeled as positive or as negative. What is the VC dimension of this hypothesis class? Prove tight upper and lower bounds: if your answer is k, show that you can shatter k points and also show that k + 1 points can not be shattered. F SOLUTION: We will show that k = 5. To prove the lower bound, consider the configuration of points in Figure 9(a). It can be easily verified that this configuration can be classified correctly by signed rectangles for any labeling of the points, hence signed rectangles can shatter 5 points. The proof of the upper bound is based on the following idea: Consider 6 points in an arbitrary position, as illustrated in Figure 9(b). Suppose that we are able to split these 6 points into two sets V , W , so that the minimal rectangle that covers the set V includes at least one point from rectangle W and vice versa. For example, in Figure 9(b), we can let V = {1, 2, 3} and W = {4, 5, 6}. Then, if we label the points in V as positive and the points in W as negative, no signed rectangle can classify the points correctly. How do we obtain such a partition? Similarly to the solution to part 4, let • xL , mini xi denote the x coordinate of a leftmost point L, • xR , maxi xi denote the x coordinate of a rightmost point R, • yB , mini yi denote the y coordinate of a lowest point B, and • yT , maxi yi denote the y coordinate of a topmost point T . Take any two of the remaining points, say, 1 and 2; let 1 denote the point with the smaller x coordinate, i.e. x1 ≤ x2 . If y1 ≤ y2 , let V = {L, B, 2}, W = {R, T, 1}, see Figure 9(c). Otherwise, if y1 > y2 , let V = {R, B, 1}, W = {L, T, 2}, see Figure 9(d).
16
6
[20 + 6 points] SVM and slacks
Consider a simple classification problem: there is one feature x with values in R, and class y can be 1 or 1. You have 2 data points: (x1 , y1 ) = (1, 1) (x2 , y2 ) = (−1, −1). 1. [4 points] For this problem write down the QP problem for an SVM with slack variables and Hinge loss. Denote the weight for the slack variables C, and let the equation of the decision boundary be wx + b = 0. F SOLUTION: The general problem formulation is P min wT w + C i ξi subject to yi (wxi + b) ≥ 1 − ξi ξi ≥ 0
(3)
In our case w is onedimensional, so wT w = w2 . Plugging in the values for xi and yi we get min w2 + C(ξ1 + ξ2 ) subject to 1(1w + b) ≥ 1 − ξ1 −1(−1w + b) ≥ 1 − ξ2 ξ1 ≥ 0, ξ2 ≥ 0
⇒
min w2 + C(ξ1 + ξ2 ) subject to w + b ≥ 1 − ξ1 w − b ≥ 1 − ξ2 ξ1 ≥ 0, ξ2 ≥ 0
(4)
COMMON MISTAKE 1: Many people stopped after writing down a general SVM formulation (Equation 3) or its dual. We gave 2 points out of 4 for that. 2. [6 points] It turns out that optimal w is w∗ = min(C, 1). Find the optimal b as a function of C. Hint: for some values of C there will be an interval of optimal b’s that are equally good. F SOLUTION: First let us write down the constraints on b from Equation 4: w + b ≥ 1 − ξ1 w − b ≥ 1 − ξ2
⇒ 1 − ξ1 − w ≤ b ≤ −1 + ξ2 + w
17
(5)
We know w∗ = min(C, 1), so we only need conditions on ξi . w∗ + b ≥ 1 − ξ1 w∗ − b ≥ 1 − ξ2
⇒ ξ1 + ξ2 ≥ 2 − 2w∗
We also know that ξi ≥ 0 and that we are minimizing w2 + C(ξ1 + ξ2 ), C > 0 w = w∗ is fixed, so we need to minimize C(ξ1 + ξ2 ). Therefore ξ1 + ξ2 = max{2 − 2w, 0} = max{2 − 2 min{C, 1}, 0} Suppose C ≥ 1. Then w∗ = 1 and ξ1 + ξ2 = max{2 − 2w∗ , 0} = max{0, 0} = 0 ⇒ ξ1 = ξ2 = 0 and Equation 5 gives us 1 − ξ1 − w∗ ≤ b ≤ −1 + ξ2 + w∗ ⇒ 1 − w ≤ b ≤ −1 + w ⇒ 0≤b≤0 ⇒ b=0 Now suppose C < 1. Then w∗ = C and 2 − 2w∗ = 2(1 − C) > 0 so ξ1 + ξ2 = max{2 − 2w∗ , 0} = 2 − 2C
⇒ ξ2 = 2 − 2C − ξ1
(6)
and Equation 5 gives us 1 − ξ1 − w∗ ≤ b ≤ −1 + ξ2 + w∗ ⇒ ⇒ ⇒ ⇒
1 − ξ1 − C ≤ b ≤ −1 + ξ2 + C 1 − ξ1 − C ≤ b ≤ −1 + 2 − 2C − ξ1 + C 1 − ξ1 − C ≤ b ≤ 1 − ξ1 − C b = 1 − C − ξ1 (7)
So, where are the intervals of equally good b’s that we were promised in the beginning? Equation 6 along with ξi ≥ 0 give us flexibility in choosing ξ1 : all ξ1 ∈ [0; 2 − 2C] are equally good. Therefore b ∈ [1 − C − (2 − 2C), 1 − C] = [−1 + C, 1 − C] To conclude, we have shown that optimal b is 0, C≥1 b= [−1 + C, 1 − C], 0 < C ≤ 1
18
3. [4 points] Suppose that C < 1 and you have chosen a hyperplane xw∗ + b∗ = 0, such that b∗ = 0 as a solution. Now a third point, (x3 , 1), is added to your dataset. Show that if x3 > then the old parameters (w∗ , b∗ ) achieve the same value of the objective function X w2 + Cξi
1 , C
i
for the 3point dataset as they did for a 2point dataset. F SOLUTION: Because the new criterion is new criterion = w2 + C(ξ1 + ξ2 + ξ3 ) = old criterion + Cξ3 , we only need to show that the corresponding constraint y3 (w∗ x3 + b∗ ) ≥ 1 − ξ3
(8)
is inactive, i.e. ξ3 = 0. Plugging in w∗ = C, b = 0 we get x3 ≥ 1 − ξ3 C But
1 x3 ⇒ >1 C C so the constraint (8) is satisfied for ξ3 = 0, qed. x3 >
4. [6 points] Now in the same situation as in part 3., assume x3 ∈ [1, C1 ]. Show that there exists a b∗3 such that (w∗ , b∗3 ) achieve the same value of the objective function for the 3point dataset as (w∗ , b∗ ) achieve for the 2point dataset. Hint: Consider b∗3 such that the positive canonical hyperplane contains x3 . F SOLUTION: We cannot anymore show that the new constraint y3 (w∗ x3 + b∗ ) ≥ 1 − ξ3 is inactive given the old value of b∗ , so let us use the hint and consider b∗3 such that the positive canonical hyperplane contains x3 : y3 (w∗ x3 + b∗3 ) = 1 ⇒ b∗3 = 1 − Cx3 Using Equation 7 we get b∗3 = 1 − C − ξ1 ⇒ ξ1 = 1 − C − (1 − Cx3 ) ⇒ ξ1 = C(x3 − 1) ⇒ ξ2 = 2 − 2C − ξ1 = 2 − C(1 + x3 ) 19
and we can check that ξi ≥ 0 holds: x3 > 1 ⇒ ξ1 = C(x3 − 1) > 0 1 x3 < ⇒ ξ2 = 2 − C(1 + x3 ) > 1 − C > 0 C and the value of the new objective function is w∗2 + C(ξ1 + ξ2 + ξ3 ) = C 2 + C(C(x3 − 1) + 2 − C(1 + x3 ) + 0) = C 2 + 2C(1 − C) whereas the old objective function value was w∗2 + C(ξ1 + ξ2 ) = {using (6) and (7)} = C 2 + C(1 − C + 1 − C) = C 2 + 2C(1 − C) qed. 5. Extra credit: [6 points] Solve the QP problem that you wrote in part 1 for the optimal w. Show that the optimal w is w∗ = min(C, 1). Hint: Pay attention to which constraints will be tight. It is useful to temporarily denote ξ1 +ξ2 with t. Solve the constraints for t and plug into the objective. Do a case analysis of when the constraint for t in terms of C will be tight. F SOLUTION: From Equation 4 we have w + b ≥ 1 − ξ1 w − b ≥ 1 − ξ2
⇒ w ≥1−
ξ1 + ξ2 2
Let us show that this is in fact a tight constraint. We need to minimize w2 . If then ξ1 + ξ2 1− ≥0 2 so the minimal w2 is achieved when w = 1 − Suppose now that
ξ1 +ξ2 2
ξ1 +ξ2 2
≤ 1,
ξ1 +ξ2 . 2
> 1. Then optimal value of w is 0. The conditions on ξi become b ≥ 1 − ξ1 −b ≥ 1 − ξ2
We can then set b = 0, ξinew = 1 to achieve the optimization criterion value of w2 + C(ξ1new + ξ2new ) = 0 + 2C = 2C instead of the old value of w2 + C(ξ1 + ξ2 ) = 0 + C(ξ1 + ξ2 ) > 2C 20
so
ξ1 +ξ2 2
> 1 cannot be the optimal solution to the problem.
Now we know that w = 1 −
ξ1 +ξ2 . 2
Denote t ≡ ξ1 + ξ2 .
The optimization criterion becomes t min(w2 + C(ξ1 + ξ2 )) = min((1 − )2 + Ct) 2
(9)
Take a derivative w.r.t. t and set it to 0: t −1 2(1 − ) +C = 0 2 2 t = 1−C 2 so the lowest point of (9) is achieved by t = 2(1 − C) Now we need the last observation: because ξi ≥ 0, we have a constraint on t: t ≥ 0. Therefore t = max{2(1 − C), 0} so
t = 1 − max{1 − C, 0} = 1 + min{C − 1, 0} = min{C, 1} 2 qed. This was a really hard question, especially since you did not have a lot of time on the midterm, but then it was an extra credit part. w =1−
21