Hey guys, currently I am working on creating a scraper/content checker for one of my clients. The website is full of complex casino reviews, and the task is to go through every one of those reviews, check and process the descriptions. We need to process all the descriptions, because we need to be sure we have unique content everywhere, so we are looking for duplicate descriptions. I have already made a function, which process data using BeautifulSoup library, but the content part is where I am stuck. We are looking for 90% and more match. Do you have any ideas how to manage to do it? Thanks!
You could use a diff algorithm. There are loads of articles online on how to do them etc. In fact there are probably libraries for it. Based of the results of the diff you can work out the match % however you want.