Cluster


#1

Implement the centroid based document summarization method.
Data set: t1.txt contains 200 sentences and each line is a sentence. Please generate a short summary of 10 sentences.
Method: please implement any centroid based summarization method to conduct document summarization using given data set.
An example procedure could be:
1. Cluster these sentences into 10 clusters
2. Calculate the sentence which has the highest similarity scores with all of the rest sentences in the same cluster, and include this centroid sentence into the summary.
3. Repeat step 2 for all the 10 clusters and generate the 10-sentence short summary.
also any similarity measure you think is the most appropriate. Please indicate the reason why you choose these methods.

this is some directions i have for my problem so far i have kmeans algorithm
from sklearn.cluster import KMeans
km_model = KMeans(n_clusters=10)#set 10 group
km_model.fit(tfidf_matrix) ##parameter is matrix of tfidf with each sentence

print km_model.labels_ #it is a matrix which is result of cluster