I originally wrote this as an answer to a question on Quora.
“Useful” is a very subjective question. People who frequently ask deep and complex language-and-algorithm-specific software engineering questions will require different levels of depth than those who travel a lot and just want to find good prices on airfare and the top 5 restaurants and hotels in each city.
If you have /really/ good algorithms, you can build a search engine that is “good” for people who don’t require much depth with about 100 million pages. For people who require depth, you could probably be pretty useful at about 1 billion pages.
This depends HEAVILY on what you choose to include and exclude. Are these pages all in a single language? Or is this just 100 pages each from the top 1 million sites regardless of language and content?
Even though the web is phenomenally huge, much of it is duplication and/or computer-generated spam. There are millions of sites that are just scrapes/dumps of other sites (especially Wikipedia) and indexing 1000 copies of Wikipedia with different CSS isn’t going to get you very far.
Think about the sites you visit regularly, and about those that regularly turn up in searches. How many of those useful sites are below the top 100,000? Does it matter if there are 100 million+ domains when 99.9% of your needs are covered by the top 0.1%? With a smaller index, choosing what you leave out is pretty important.
There’s a site I like to play with when trying to find obscure results, it’s fun for experimenting with and it helps you understand how much the size/quality of your index affects your results: