Trying to create a search engine based article collector

Trying to create an ‘article gatherer’, a search engine based article collector that would collect documents (links provided search engine upon keyword input) and put them into single document

My friend asks if there is a way to gather articles based on a keyword search which then would put all the articles into a data;
Is there a way to do such thing without an API call?
I guess it imitates a browser’s work and collects data from the pages that the searched up links lead to

The objective here is to save time in terms of documenting every article into local piece of data rather than individually clicking on results of the search and copy pasting them into a document;

Goals:

  • keep track of word counts (or be able to run them through a function which would give count for specific word)
  • create word cloud
  • possibly(?) keep track of dates (thinking about this gives raises a lot of different issues)

Something like this?

Word Cloud Generator


Searched for the word, “implementation”…

About 1,850,000,000 results

One supposes we could query Google for word counts. Their index is a lot larger and easier to search than the web in general. And, the relevancy is no longer in question.

For all we know they may already have an API for that very purpose.


For the fun of it, here are some of the excepts from Alice in Wonderland and Through the Lookin Glass…

The Crocodile

How doth the little crocodile
Improve his shining tail,
And pour the waters of the Nile
On every golden scale!

How cheerfully he seems to grin,
How neatly spreads his claws,
And welcomes little fishes in,
With gently smiling jaws!

You are Old, Father William

“You are old, Father William,” the young man said,
“And your hair has become very white;
And yet you incessantly stand on your head –
Do you think, at your age, it is right?”

“In my youth,” Father William replied to his son,
“I feared it might injure the brain;
But, now that I’m perfectly sure I have none,
Why, I do it again and again.”

Twinkle, Twinkle Little Bat

Twinkle, Twinkle Little Bat
How I wonder what you’re at!
Up above the world you fly,
Like a tea tray in the sky.
Twinkle, twinkle, little bat!
How I wonder what you’re at!

The Mock Turtle’s Song

“Will you walk a little faster?”

said a whiting to a snail.

“There’s a porpoise close behind us,

and he’s treading on my tail.

See how eagerly the lobsters and

the turtles all advance!

They are waiting on the shingle—

will you come and join the dance?

Will you, won’t you, will you, won’t you,

will you join the dance?

Will you, won’t you, will you, won’t you,

won’t you join the dance?

Jabberwocky

’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”
’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

The Walrus and the Carpenter

The sun was shining on the sea,
Shining with all his might:
He did his very best to make
The billows smooth and bright–
And this was odd, because it was
The middle of the night.

The moon was shining sulkily,
Because she thought the sun
Had got no business to be there
After the day was done–
“It’s very rude of him,” she said,
“To come and spoil the fun!”

Paste that into the word cloud generator above and see the plethora of strange words Lewis Carroll either made up or found a way to use.

When Alexander Pope, Voltaire, Jonathan Swift and perhaps James Joyce are all behind us, Lewis Carroll is a lot more of a discerning read. What’s to keep any word cloud from being subjective or farcical?

1 Like

I did some research on the side and I came to conclude that I pretty much have to design a webcrawler;
thank you for sharing the word cloud though, this is be pretty useful for me

1 Like

There is a course for Beautiful Soup that I haven’t explored yet, but is essentially a website scraper. If you have the resources to cache all the pages you crawl then you can speed up internal indexing and content extraction. Not something you or I are likely to have so I would look for something to leverage. I’m still willing to bet that Google has some APIs to address what you want to do.

An index is the most useful tool, rather than cached content, in my view. Crawl, parse, index, delete cache. Then you can draw in articles based on a list of seed terms that can be found in the index, then requested and parsed for the article text.

1 Like

Thank you; I’ll definitely look into it

1 Like