Implement the centroid based document summarization method.
Data set: t1.txt contains 200 sentences and each line is a sentence. Please generate a short summary of 10 sentences.
Method: please implement any centroid based summarization method to conduct document summarization using given data set.
An example procedure could be:
- Cluster these sentences into 10 clusters
- Calculate the sentence which has the highest similarity scores with all of the rest sentences in the same cluster, and include this centroid sentence into the summary.
- Repeat step 2 for all the 10 clusters and generate the 10-sentence short summary.
also any similarity measure you think is the most appropriate. Please indicate the reason why you choose these methods.
this is some directions i have for my problem so far i have kmeans algorithm
from sklearn.cluster import KMeans
km_model = KMeans(n_clusters=10)#set 10 group
km_model.fit(tfidf_matrix) ##parameter is matrix of tfidf with each sentence
#print km_model.labels_ #it is a matrix which is result of cluster