Chegg This Program Will Read in a Group of Test Scores

Introduction

The idea of creating machines which learn past themselves has been driving humans for decades at present. For fulfilling that dream, unsupervised learning and clustering is the key. Unsupervised learning provides more than flexibility, but is more challenging as well.

Clustering plays an important role to draw insights from unlabeled information. Information technology classifies the data in similar groups which improves various business decisions by providing a meta understanding.

In this skill examination, we tested our community on clustering techniques.  A total of 1566 people registered in this skill test. If you missed taking the test, here is your opportunity for you lot to find out how many questions you could have answered correctly.

If you are simply getting started with Unsupervised Learning, here are some comprehensive resource to assist you in your journeying:

  • Automobile Learning Certification Course for Beginners
  • The Most Comprehensive Guide to K-Ways Clustering You'll Ever Need

  • Certified AI & ML Blackbelt+ Plan

Overall Results

Below is the distribution of scores, this will help you evaluate your performance:

You can access your functioning here. More than 390 people participated in the skill test and the highest score was 33. Here are a few statistics nearly the distribution.

Overall distribution

Mean Score: 15.11

Median Score: 15

Mode Score: 16

Helpful Resources

An Introduction to Clustering and different methods of clustering

Getting your clustering right (Role I)

Getting your clustering right (Part Ii)

Questions & Answers

Q1. Film Recommendation systems are an example of:

  1. Classification
  2. Clustering
  3. Reinforcement Learning
  4. Regression

Options:

B. A. 2 Simply

C. 1 and 2

D. 1 and 3

Due east. 2 and iii

F. ane, 2 and 3

H. 1, two, 3 and 4

Q2. Sentiment Analysis is an instance of:

  1. Regression
  2. Classification
  3. Clustering
  4. Reinforcement Learning

Options:

A. one Just

B. one and 2

C. 1 and 3

D. 1, ii and 3

E. 1, 2 and 4

F. 1, 2, 3 and 4

Q3. Can decision trees be used for performing clustering?

A. Truthful

B. Fake

Q4. Which of the post-obit is the most appropriate strategy for data cleaning before performing clustering analysis, given less than desirable number of data points:

  1. Capping and flouring of variables
  2. Removal of outliers

Options:

A. 1 only

B. 2 only

C. one and 2

D. None of the to a higher place

Q5. What is the minimum no. of variables/ features required to perform clustering?

A. 0

B. i

C. 2

D. 3

Q6. For two runs of K-Mean clustering is it expected to get aforementioned clustering results?

A. Yes

B. No

Solution: (B)

K-Means clustering algorithm instead converses on local minima which might too correspond to the global minima in some cases simply not always. Therefore, it's advised to run the Grand-Means algorithm multiple times before drawing inferences about the clusters.

Nonetheless, note that it'south possible to receive aforementioned clustering results from K-means past setting the same seed value for each run. But that is done past simply making the algorithm choose the set of same random no. for each run.

Q7. Is it possible that Consignment of observations to clusters does not change betwixt successive iterations in One thousand-Means

A. Yes

B. No

C. Can't say

D. None of these

Solution: (A)

When the Yard-Means algorithm has reached the local or global minima, it will non alter the consignment of data points to clusters for 2 successive iterations.

Q8. Which of the following tin act every bit possible termination conditions in K-Means?

  1. For a fixed number of iterations.
  2. Consignment of observations to clusters does not change between iterations. Except for cases with a bad local minimum.
  3. Centroids do not change between successive iterations.
  4. Terminate when RSS falls below a threshold.

Options:

A. one, 3 and four

B. 1, ii and three

C. 1, 2 and iv

D. All of the in a higher place

Solution: (D)

All iv weather condition tin be used as possible termination status in K-Means clustering:

  1. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will exist poor because of an insufficient number of iterations.
  2. Except for cases with a bad local minimum, this produces a good clustering, simply runtimes may exist unacceptably long.
  3. This also ensures that the algorithm has converged at the minima.
  4. Terminate when RSS falls beneath a threshold. This criterion ensures that the clustering is of a desired quality after termination. Practically, it's a practiced practice to combine it with a jump on the number of iterations to guarantee termination.

Q9. Which of the following clustering algorithms suffers from the problem of convergence at local optima?

  1. K- Ways clustering algorithm
  2. Agglomerative clustering algorithm
  3. Expectation-Maximization clustering algorithm
  4. Various clustering algorithm

Options:

A. 1 only

B. two and 3

C. 2 and 4

D. ane and 3

Eastward. i,2 and 4

F. All of the above

Solution: (D)

Out of the options given, merely K-Means clustering algorithm and EM clustering algorithm has the drawback of converging at local minima.

Q10. Which of the following algorithm is virtually sensitive to outliers?

A. K-means clustering algorithm

B. K-medians clustering algorithm

C. Yard-modes clustering algorithm

D. K-medoids clustering algorithm

Solution: (A)

Out of all the options, Thousand-Means clustering algorithm is nearly sensitive to outliers as information technology uses the mean of cluster data points to find the cluster center.

Q11. Afterward performing K-Ways Clustering analysis on a dataset, y'all observed the post-obit dendrogram. Which of the following decision can be drawn from the dendrogram?

A. At that place were 28 information points in clustering assay

B. The best no. of clusters for the analyzed data points is 4

C. The proximity function used is Boilerplate-link clustering

D. The above dendrogram interpretation is not possible for K-Ways clustering analysis

Solution: (D)

A dendrogram is not possible for Chiliad-Means clustering analysis. Yet, one can create a cluster gram based on One thousand-Means clustering assay.

Q12. How can Clustering (Unsupervised Learning) be used to improve the accurateness of Linear Regression model (Supervised Learning):

  1. Creating different models for different cluster groups.
  2. Creating an input characteristic for cluster ids equally an ordinal variable.
  3. Creating an input feature for cluster centroids as a continuous variable.
  4. Creating an input feature for cluster size as a continuous variable.

Options:

A. 1 only

B. 1 and 2

C. ane and 4

D. 3 only

East. ii and 4

F. All of the in a higher place

Solution: (F)

Creating an input feature for cluster ids every bit ordinal variable or creating an input feature for cluster centroids as a continuous variable might not convey any relevant data to the regression model for multidimensional data. Simply for clustering in a unmarried dimension, all of the given methods are expected to convey meaningful information to the regression model. For example, to cluster people in two groups based on their hair length, storing clustering ID every bit ordinal variable and cluster centroids as continuous variables will convey meaningful data.

Q13. What could exist the possible reason(s) for producing two different dendrograms using agglomerative clustering algorithm for the same dataset?

A. Proximity role used

B. of data points used

C. of variables used

D. B and c but

E. All of the in a higher place

Solution: (Eastward)

Alter in either of Proximity function, no. of data points or no. of variables will lead to unlike clustering results and hence different dendrograms.

Q14. In the effigy beneath, if yous draw a horizontal line on y-centrality for y=two. What will exist the number of clusters formed?

A. i

B. two

C. 3

D. 4

Solution: (B)

Since the number of vertical lines intersecting the red horizontal line at y=2 in the dendrogram are 2, therefore, 2 clusters will be formed.

Q15. What is the almost advisable no. of clusters for the data points represented past the post-obit dendrogram:

A. 2

B. 4

C. 6

D. 8

Solution: (B)

The conclusion of the no. of clusters that tin can best depict dissimilar groups can be chosen past observing the dendrogram. The best selection of the no. of clusters is the no. of vertical lines in the dendrogram cut by a horizontal line that tin transverse the maximum distance vertically without intersecting a cluster.

In the above case, the all-time choice of no. of clusters volition exist four every bit the ruby horizontal line in the dendrogram below covers maximum vertical distance AB.

Q16. In which of the post-obit cases will Chiliad-Ways clustering neglect to requite practiced results?

  1. Information points with outliers
  2. Data points with dissimilar densities
  3. Data points with round shapes
  4. Information points with non-convex shapes

Options:

A. ane and 2

B. 2 and iii

C. ii and 4

D. ane, 2 and 4

E. i, two, 3 and 4

Solution: (D)

K-Means clustering algorithm fails to give adept results when the data contains outliers, the density spread of data points across the data space is different and the data points follow non-convex shapes.

Q17. Which of the following metrics, do we have for finding dissimilarity between two clusters in hierarchical clustering?

  1. Single-link
  2. Complete-link
  3. Average-link

Options:

A. one and 2

B. 1 and 3

C. ii and iii

D. 1, 2 and 3

Solution: (D)

All of the three methods i.east. single link, consummate link and average link can be used for finding contrast between ii clusters in hierarchical clustering.

Q18. Which of the post-obit are truthful?

  1. Clustering assay is negatively affected by multicollinearity of features
  2. Clustering analysis is negatively afflicted by heteroscedasticity

Options:

A. one merely

B. ii only

C. ane and 2

D. None of them

Solution: (A)

Clustering assay is not negatively affected past heteroscedasticity just the results are negatively impacted by multicollinearity of features/ variables used in clustering as the correlated characteristic/ variable volition acquit extra weight on the altitude calculation than desired.

Q19. Given, half-dozen points with the following attributes:

Which of the following clustering representations and dendrogram depicts the employ of MIN or Single link proximity office in hierarchical clustering:

A.

B.

C.

D.

Solution: (A)

For the single link or MIN version of hierarchical clustering, the proximity of two clusters is defined to be the minimum of the distance between whatsoever 2 points in the different clusters. For instance, from the tabular array, we run across that the distance betwixt points 3 and 6 is 0.11, and that is the acme at which they are joined into ane cluster in the dendrogram. Every bit another example, the distance betwixt clusters {three, 6} and {2, 5} is given by dist({three, 6}, {ii, v}) = min(dist(iii, 2), dist(6, 2), dist(3, 5), dist(6, 5)) = min(0.1483, 0.2540, 0.2843, 0.3921) = 0.1483.

Q20 Given, six points with the following attributes:

Which of the following clustering representations and dendrogram depicts the use of MAX or Consummate link proximity function in hierarchical clustering:

A.

B.

C.

D.

Solution: (B)

For the single link or MAX version of hierarchical clustering, the proximity of two clusters is defined to exist the maximum of the altitude betwixt any 2 points in the dissimilar clusters. Similarly, here points 3 and 6 are merged first. However, {3, 6} is merged with {4}, instead of {2, v}. This is considering the dist({3, 6}, {iv}) = max(dist(3, 4), dist(6, four)) = max(0.1513, 0.2216) = 0.2216, which is smaller than dist({three, 6}, {2, 5}) = max(dist(three, 2), dist(6, two), dist(3, 5), dist(six, 5)) = max(0.1483, 0.2540, 0.2843, 0.3921) = 0.3921 and dist({3, vi}, {ane}) = max(dist(iii, one), dist(6, 1)) = max(0.2218, 0.2347) = 0.2347.

Q21 Given, half-dozen points with the following attributes:

Which of the post-obit clustering representations and dendrogram depicts the use of Group boilerplate proximity part in hierarchical clustering:

A.

B.
C.

D.

Solution: (C)

For the group average version of hierarchical clustering, the proximity of two clusters is defined to be the average of the pairwise proximities between all pairs of points in the different clusters. This is an intermediate approach betwixt MIN and MAX. This is expressed past the following equation:

Hither, the altitude betwixt some clusters. dist({3, half-dozen, iv}, {1}) = (0.2218 + 0.3688 + 0.2347)/(iii ∗ 1) = 0.2751. dist({2, five}, {1}) = (0.2357 + 0.3421)/(2 ∗ i) = 0.2889. dist({three, vi, 4}, {two, 5}) = (0.1483 + 0.2843 + 0.2540 + 0.3921 + 0.2042 + 0.2932)/(6∗1) = 0.2637. Because dist({3, 6, 4}, {ii, 5}) is smaller than dist({three, half dozen, 4}, {1}) and dist({2, 5}, {ane}), these two clusters are merged at the fourth stage

Q22. Given, 6 points with the following attributes:

Which of the following clustering representations and dendrogram depicts the use of Ward'southward method proximity function in hierarchical clustering:

A.

B.

C.

D.

Solution: (D)

Ward method is a centroid method. Centroid method calculates the proximity betwixt two clusters by calculating the distance between the centroids of clusters. For Ward'southward method, the proximity betwixt two clusters is divers as the increase in the squared fault that results when two clusters are merged. The results of applying Ward's method to the sample information ready of six points. The resulting clustering is somewhat unlike from those produced past MIN, MAX, and grouping average.

Q23. What should be the best option of no. of clusters based on the following results:

A. 1

B. 2

C. 3

D. 4

Solution: (C)

The silhouette coefficient is a measure of how like an object is to its own cluster compared to other clusters. Number of clusters for which silhouette coefficient is highest represents the best choice of the number of clusters.

Q24. Which of the following is/are valid iterative strategy for treating missing values before clustering analysis?

A. Imputation with mean

B. Nearest Neighbor consignment

C. Imputation with Expectation Maximization algorithm

D. All of the in a higher place

Solution: (C)

All of the mentioned techniques are valid for treating missing values before clustering assay but only imputation with EM algorithm is iterative in its functioning.

Q25. K-Mean algorithm  has some limitations. One of the limitation it has is, it makes hard assignments(A bespeak either completely belongs to a cluster or not belongs at all) of points to clusters.

Notation: Soft consignment can exist consider every bit the probability of being assigned to each cluster: say K = three and for some betoken xn, p1 = 0.vii, p2 = 0.2, p3 = 0.1)

Which of the post-obit algorithm(south) allows soft assignments?

  1. Gaussian mixture models
  2. Fuzzy One thousand-means

Options:

A. 1 only

B. 2 only

C. 1 and two

D. None of these

Solution: (C)

Both, Gaussian mixture models and Fuzzy K-means allows soft assignments.

Q26. Assume, yous want to cluster seven observations into iii clusters using M-Ways clustering algorithm. After first iteration clusters, C1, C2, C3 has post-obit observations:

C1: {(two,2), (4,4), (half dozen,6)}

C2: {(0,4), (4,0)}

C3: {(5,v), (9,ix)}

What volition exist the cluster centroids if yous want to proceed for second iteration?

A. C1: (four,4), C2: (2,2), C3: (vii,7)

B. C1: (vi,6), C2: (four,four), C3: (ix,9)

C. C1: (ii,2), C2: (0,0), C3: (5,5)

D. None of these

Solution: (A)

Finding centroid for information points in cluster C1 = ((2+4+6)/3, (two+iv+6)/3) = (four, four)

Finding centroid for information points in cluster C2 = ((0+iv)/2, (4+0)/2) = (2, 2)

Finding centroid for data points in cluster C3 = ((5+nine)/2, (5+9)/2) = (seven, 7)

Hence, C1: (4,4),  C2: (2,2), C3: (7,7)

Q27. Assume, you want to cluster 7 observations into 3 clusters using G-Ways clustering algorithm. Later showtime iteration clusters, C1, C2, C3 has post-obit observations:

C1: {(two,two), (4,four), (6,6)}

C2: {(0,iv), (4,0)}

C3: {(5,5), (nine,9)}

What will be the Manhattan distance for observation (9, 9) from cluster centroid C1. In second iteration.

A. x

B. v*sqrt(2)

C. 13*sqrt(2)

D. None of these

Solution: (A)

Manhattan distance between centroid C1 i.eastward. (four, 4) and (9, 9) = (ix-4) + (9-4) = 10

Q28. If two variables V1 and V2, are used for clustering. Which of the post-obit are true for K means clustering with k =3?

  1. If V1 and V2 has a correlation of 1, the cluster centroids volition be in a straight line
  2. If V1 and V2 has a correlation of 0, the cluster centroids will exist in straight line

Options:

A. 1 just

B. two only

C. i and 2

D. None of the above

Solution: (A)

If the correlation between the variables V1 and V2 is one, and so all the data points will be in a straight line. Hence, all the three cluster centroids will class a directly line likewise.

Q29. Feature scaling is an important step earlier applying 1000-Mean algorithm. What is reason behind this?

A. In altitude calculation it will give the same weights for all features

B. Yous always get the aforementioned clusters. If you use or don't use feature scaling

C. In Manhattan distance it is an important step just in Euclidian it is not

D. None of these

Solution; (A)

Feature scaling ensures that all the features go same weight in the clustering assay. Consider a scenario of clustering people based on their weights (in KG) with range 55-110 and pinnacle (in inches) with range v.6 to half-dozen.four. In this example, the clusters produced without scaling can exist very misleading every bit the range of weight is much college than that of height. Therefore, its necessary to bring them to same scale and then that they have equal weightage on the clustering result.

Q30. Which of the following method is used for finding optimal of cluster in K-Mean algorithm?

A. Elbow method

B. Manhattan method

C. Ecludian mehthod

D. All of the to a higher place

East. None of these

Solution: (A)

Out of the given options, only elbow method is used  for finding the optimal number of clusters. The elbow method looks at the pct of variance explained as a function of the number of clusters: One should cull a number of clusters so that calculation another cluster doesn't requite much improve modeling of the information.

Q31. What is true well-nigh K-Mean Clustering?

  1. K-means is extremely sensitive to cluster center initializations
  2. Bad initialization can lead to Poor convergence speed
  3. Bad initialization can lead to bad overall clustering

Options:

A. 1 and 3

B. 1 and 2

C. 2 and 3

D. 1, ii and 3

Solution: (D)

All iii of the given statements are true. Thousand-means is extremely sensitive to cluster center initialization. Also, bad initialization can lead to Poor convergence speed as well every bit bad overall clustering.

Q32. Which of the following can be applied to get skilful results for K-ways algorithm corresponding to global minima?

  1. Endeavour to run algorithm for different centroid initialization
  2. Arrange number of iterations
  3. Find out the optimal number of clusters

Options:

A. 2 and 3

B. i and 3

C. 1 and two

D. All of to a higher place

Solution: (D)

All of these are standard practices that are used in social club to obtain good clustering results.

Q33. What should be the best selection for number of clusters based on the following results:

A. 5

B. 6

C. xiv

D. Greater than 14

Solution: (B)

Based on the to a higher place results, the best option of number of clusters using elbow method is 6.

Q34. What should exist the best selection for number of clusters based on the post-obit results:

A. 2

B. iv

C. 6

D. 8

Solution: (C)

By and large, a college average silhouette coefficient indicates ameliorate clustering quality. In this plot, the optimal clustering number of grid cells in the study area should exist 2, at which the value of the average silhouette coefficient is highest. However, the SSE of this clustering solution (thousand = ii) is besides large. At yard = vi, the SSE is much lower. In addition, the value of the average silhouette coefficient at k = half-dozen is also very loftier, which is only lower than g = 2. Thus, the all-time choice is k = six.

Q35. Which of the following sequences is correct for a K-Means algorithm using Forgy method of initialization?

  1. Specify the number of clusters
  2. Assign cluster centroids randomly
  3. Assign each data bespeak to the nearest cluster centroid
  4. Re-assign each point to nearest cluster centroids
  5. Re-compute cluster centroids

Options:

A. 1, ii, 3, five, four

B. ane, three, ii, iv, 5

C. 2, 1, 3, 4, five

D. None of these

Solution: (A)

The methods used for initialization in K means are Forgy and Random Partition. The Forgy method randomly chooses yard observations from the data set and uses these as the initial means. The Random Sectionalisation method get-go randomly assigns a cluster to each observation and so gain to the update step, thus computing the initial mean to be the centroid of the cluster's randomly assigned points.

Q36. If you are using Multinomial mixture models with the expectation-maximization algorithm for clustering a gear up of data points into two clusters, which of the assumptions are of import:

A. All the information points follow two Gaussian distribution

B. All the data points follow n Gaussian distribution (n >two)

C. All the data points follow two multinomial distribution

D. All the data points follow n multinomial distribution (due north >2)

Solution: (C)

In EM algorithm for clustering its essential to choose the same no. of clusters to classify the data points into as the no. of unlike distributions they are expected to be generated from and also the distributions must be of the same blazon.

Q37. Which of the post-obit is/are not true nigh Centroid based K-Means clustering algorithm and Distribution based expectation-maximization clustering algorithm:

  1. Both starts with random initializations
  2. Both are iterative algorithms
  3. Both take stiff assumptions that the information points must fulfill
  4. Both are sensitive to outliers
  5. Expectation maximization algorithm is a special case of Thousand-Means
  6. Both requires prior cognition of the no. of desired clusters
  7. The results produced by both are not-reproducible.

Options:

A. 1 merely

B. 5 merely

C. 1 and 3

D. half dozen and vii

Due east. 4, half dozen and 7

F. None of the to a higher place

Solution: (B)

All of the above statements are true except the 5th every bit instead K-Means is a special case of EM algorithm in which only the centroids of the cluster distributions are calculated at each iteration.

Q38. Which of the following is/are not true about DBSCAN clustering algorithm:

  1. For data points to be in a cluster, they must be in a distance threshold to a core point
  2. It has potent assumptions for the distribution of information points in dataspace
  3. It has essentially high fourth dimension complexity of guild O(north3)
  4. It does not require prior knowledge of the no. of desired clusters
  5. It is robust to outliers

Options:

A. 1 only

B. 2 simply

C. four only

D. 2 and 3

E. 1 and 5

F. 1, 3 and 5

Solution: (D)

  • DBSCAN can form a cluster of whatever arbitrary shape and does not have strong assumptions for the distribution of information points in the dataspace.
  • DBSCAN has a low time complexity of guild O(n log n) simply.

Q39. Which of the following are the high and low bounds for the existence of F-Score?

A. [0,ane]

B. (0,ane)

C. [-1,1]

D. None of the higher up

Solution: (A)

The lowest and highest possible values of F score are 0 and one with one representing that every data bespeak is assigned to the correct cluster and 0 representing that the precession and/ or recall of the clustering assay are both 0. In clustering analysis, high value of F score is desired.

Q40. Post-obit are the results observed for clustering 6000 data points into 3 clusters: A, B and C:

What is the Fone-Score with respect to cluster B?

A. 3

B. 4

C. 5

D. 6

Solution: (D)

Here,

True Positive, TP = 1200

True Negative, TN = 600 + 1600 = 2200

False Positive, FP = 1000 + 200 = 1200

False Negative, FN = 400 + 400 = 800

Therefore,

Precision = TP / (TP + FP) = 0.5

Think = TP / (TP + FN) = 0.6

Hence,

Fone = two * (Precision * Recall)/ (Precision + recall) = 0.54 ~ 0.5

End Notes

I hope you enjoyed taking the test and found the solutions helpful. The test focused on conceptual every bit well as practical knowledge of clustering fundamentals and its diverse techniques.

I tried to clear all your doubts through this commodity, but if we have missed out on something and so permit u.s. know in comments beneath. Also, If you take any suggestions or improvements you think we should make in the adjacent skilltest, y'all can allow us know past dropping your feedback in the comments department.

Learn, compete, hack and go hired!

marshalloned1947.blogspot.com

Source: https://www.analyticsvidhya.com/blog/2017/02/test-data-scientist-clustering/

0 Response to "Chegg This Program Will Read in a Group of Test Scores"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel