Post by account_disabled on Dec 4, 2023 3:19:22 GMT
Simply put, there are too many pages to crawl on the Internet.
Some need to be scanned more frequently, others don't need to be scanned at all. Therefore, we use a queue that decides in which order the URLs will be sent for crawling.
A common problem with this step is crawling too many similar, irrelevant URLs, which could lead to people seeing more spam and fewer unique referring domains.
What we have done?
To optimize the queue, we've added filters that prioritize unique content and higher Country Email List authority websites, as well as counteract link farms. As a result, the system now finds more unique content and generates fewer reports with duplicate links.
To protect our queue from link farms we check if a large number of domains come from the same IP address. If we see too many domains from the same IP, their priority in the queue will be reduced, allowing us to scan multiple domains from different IPs and not get stuck on a link farm.
To protect websites and avoid polluting our reports with similar links, we check if there are too many URLs from the same domain. If we see too many URLs from the same domain, they will not all be crawled on the same day.
To ensure we crawl new pages as soon as possible, any URLs we haven't crawled previously will be given higher priority.
Each page has its own hash code which helps us prioritize crawling unique content.
We take into account how often new links are generated on the source page.
We take into account the authority score of a web page and a domain.
How the queue is improved
More than 10 different factors filter out unnecessary links.
More unique, high-quality pages, thanks to new quality control algorithms.
Crawlers
Our crawlers follow internal and external links across the Internet looking for new linked pages. Therefore, we can only find a page if there is an incoming link on it.
By analyzing our previous system, we saw an opportunity to increase the overall crawl capacity and find better content - the content that website owners would like us to crawl and index.
What we have done?
Tripled our number of crawlers (from 10 to 30)
Stopped crawling pages with URL parameters that do not affect page content (&sessionid, UTM, etc.)
Increased the reading frequency of robots.txt files on websites
How Crawlers Have Improved
More crawlers (now 30!)
Clean data without junk or duplicate links
Improved search for the most relevant content
Scanning speed of 25 billion pages per day
Storage space
Storage is where we keep all the links you can see as a Semrush user. This storage shows you links in the tool and gives you filters you can apply to find what you're looking for.
Some need to be scanned more frequently, others don't need to be scanned at all. Therefore, we use a queue that decides in which order the URLs will be sent for crawling.
A common problem with this step is crawling too many similar, irrelevant URLs, which could lead to people seeing more spam and fewer unique referring domains.
What we have done?
To optimize the queue, we've added filters that prioritize unique content and higher Country Email List authority websites, as well as counteract link farms. As a result, the system now finds more unique content and generates fewer reports with duplicate links.
To protect our queue from link farms we check if a large number of domains come from the same IP address. If we see too many domains from the same IP, their priority in the queue will be reduced, allowing us to scan multiple domains from different IPs and not get stuck on a link farm.
To protect websites and avoid polluting our reports with similar links, we check if there are too many URLs from the same domain. If we see too many URLs from the same domain, they will not all be crawled on the same day.
To ensure we crawl new pages as soon as possible, any URLs we haven't crawled previously will be given higher priority.
Each page has its own hash code which helps us prioritize crawling unique content.
We take into account how often new links are generated on the source page.
We take into account the authority score of a web page and a domain.
How the queue is improved
More than 10 different factors filter out unnecessary links.
More unique, high-quality pages, thanks to new quality control algorithms.
Crawlers
Our crawlers follow internal and external links across the Internet looking for new linked pages. Therefore, we can only find a page if there is an incoming link on it.
By analyzing our previous system, we saw an opportunity to increase the overall crawl capacity and find better content - the content that website owners would like us to crawl and index.
What we have done?
Tripled our number of crawlers (from 10 to 30)
Stopped crawling pages with URL parameters that do not affect page content (&sessionid, UTM, etc.)
Increased the reading frequency of robots.txt files on websites
How Crawlers Have Improved
More crawlers (now 30!)
Clean data without junk or duplicate links
Improved search for the most relevant content
Scanning speed of 25 billion pages per day
Storage space
Storage is where we keep all the links you can see as a Semrush user. This storage shows you links in the tool and gives you filters you can apply to find what you're looking for.