Neural Crawler
Jan. 3rd, 2017 12:57 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Business context
For years I wanted to collect new jobs from all over internet in order to send appealing job alert emails to candidates that created a profile on postjobfree.com
So, finally, I decided to create a web crawler for that.
However, unlike Google, I do not want to crawl billions of pages (too expensive). Several million pages should be good enough for the first working prototype.
The question is - how to determine automatically what pages to crawl and what to ignore?
That's why our web crawler is combined with self-learning neural network.
Data structure
We represent every page as a record in PageNeuron table (PageNeuronId int, Url varchar(500), …, PageRank real, ...)
We represent links from page to page in LinkAxon table (..., FromPageNeuronId int, ToPageNeuronId int, …)
PageRank calculations
PageRank is inspired by classic Google PageRank, however we calculate it differently.
Instead of calculating probability of visitor click, our NeuralRewardDistribution process distributes PageRank from every PageNeuron record to every connected record (in both directions).
With every “reward distribution” iteration NeuralRewardDistribution process distributes about 10% of PageRank to other pages (that amount is split between all reward destination PageNeuron records proportionally to LinkAxon weights).
Then, in order to prevent self-excitation of the system, NeuralRewardDistribution applies "forgetting" by reducing PageRank of original page by 10%.
Setting goals
When NeuralPageEvaluator parses crawled pages - it tries to detect words and patterns we need.
Every time NeuralPageEvaluator finds something useful, it adds reward in form of extra PageRank for the responsible PageNeuron record. For example, we reward:
- 1 PageRank point for such words as "job", "jobs", "career", "hr".
- 10 PageRank points for such words as "hrms", "taleo", "jobvite", "icims".
- 1000 PageRank points when parser discovers link to a new XML job feed in the content of PageNeuron record.
- 20 PageRank points when parser discovers link to an XML job feed that we already discovered in the past.
What to crawl
NeuralPageProcessor processes already crawled pages (PageNeuron) by passing them to NeuralPageEvaluator.
NeuralPageEvaluator returns collection of outgoing links from that parsed page.
If extracted outgoing link is new, then NeuralPageProcessor creates new PageNeuron record for it. For initial PageRank it uses (10% of source PageNeuron record PageRank * link share or that new URL among all other URLs that source PageNeuron record points to).
NeuralCrawler crawls new PageNeuron records with highest PageRank.
Cleanup
NeuralRewardDistribution deletes PageNeuron records (and all corresponding LinkAxon records) of PageRank is too low.
Current "delete threshold" is at PageRank = 0.01 which deletes about half of ~3 million PageNeuron records we already created.
For years I wanted to collect new jobs from all over internet in order to send appealing job alert emails to candidates that created a profile on postjobfree.com
So, finally, I decided to create a web crawler for that.
However, unlike Google, I do not want to crawl billions of pages (too expensive). Several million pages should be good enough for the first working prototype.
The question is - how to determine automatically what pages to crawl and what to ignore?
That's why our web crawler is combined with self-learning neural network.
Data structure
We represent every page as a record in PageNeuron table (PageNeuronId int, Url varchar(500), …, PageRank real, ...)
We represent links from page to page in LinkAxon table (..., FromPageNeuronId int, ToPageNeuronId int, …)
PageRank calculations
PageRank is inspired by classic Google PageRank, however we calculate it differently.
Instead of calculating probability of visitor click, our NeuralRewardDistribution process distributes PageRank from every PageNeuron record to every connected record (in both directions).
With every “reward distribution” iteration NeuralRewardDistribution process distributes about 10% of PageRank to other pages (that amount is split between all reward destination PageNeuron records proportionally to LinkAxon weights).
Then, in order to prevent self-excitation of the system, NeuralRewardDistribution applies "forgetting" by reducing PageRank of original page by 10%.
Setting goals
When NeuralPageEvaluator parses crawled pages - it tries to detect words and patterns we need.
Every time NeuralPageEvaluator finds something useful, it adds reward in form of extra PageRank for the responsible PageNeuron record. For example, we reward:
- 1 PageRank point for such words as "job", "jobs", "career", "hr".
- 10 PageRank points for such words as "hrms", "taleo", "jobvite", "icims".
- 1000 PageRank points when parser discovers link to a new XML job feed in the content of PageNeuron record.
- 20 PageRank points when parser discovers link to an XML job feed that we already discovered in the past.
What to crawl
NeuralPageProcessor processes already crawled pages (PageNeuron) by passing them to NeuralPageEvaluator.
NeuralPageEvaluator returns collection of outgoing links from that parsed page.
If extracted outgoing link is new, then NeuralPageProcessor creates new PageNeuron record for it. For initial PageRank it uses (10% of source PageNeuron record PageRank * link share or that new URL among all other URLs that source PageNeuron record points to).
NeuralCrawler crawls new PageNeuron records with highest PageRank.
Cleanup
NeuralRewardDistribution deletes PageNeuron records (and all corresponding LinkAxon records) of PageRank is too low.
Current "delete threshold" is at PageRank = 0.01 which deletes about half of ~3 million PageNeuron records we already created.
no subject
Date: 2017-01-03 07:13 pm (UTC)no subject
Date: 2017-01-03 08:03 pm (UTC)Are you doing crawler too?
no subject
Date: 2017-01-03 08:12 pm (UTC)No, I'm not doing crawler. Was kind of moving there while at HealthExpense, but well. Did not work out.
no subject
Date: 2017-01-03 08:14 pm (UTC)no subject
Date: 2017-01-03 08:33 pm (UTC)