Quora Answer: How would you find the websites to build a search index from scratch?

I originally wrote this as an answer to a question on Quora.

It depends on the scale.

If you just want to experiment with web crawling and build a basic search index, it’s common to start with the Alexa top million websites, which can be downloaded in a CSV file via S3 at: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

The top million changes daily, and includes a lot of spam and porn. It’s easy to game the system to get into the bottom 500k, so you’ll need to decide what to include and how to weight it.

The DMOZ website dump used to be a good starting point. They shut down on March 17th, 2017, so that’s no longer really an option. It left a lot out, but contained about 4 million URLs with most of the low-quality sites filtered out. There may be a mirror that data somewhere (heck, I’d like to download their last URL dump if it’s available anywhere).

Real search engines have agreements in the place with the top-level registrars that let them get a zone file dump listing all registered domains. This involves jumping through some hoops and filling out some forms, and each registrar needs to be dealt with separately. Getting access to .social is completely separate from getting access to .net.

Since it takes a LOT of work to get access to all TLD domain files, a commercial service like Domains Index is probably your best bet if you want to do anything on a large scale. I’ve bought from them before and it’s a good service. They don’t have absolutely everything, but 200 million domains is far better than the 1 million you get from Alexa.

Xangis.com

Software and Music Mad Scientist

Quora Answer: How would you find the websites to build a search index from scratch?