Category Archives: Search Engines

Search engine related posts.

The WbSrch Experiment

Off-and-on over the last 8 years I’ve worked on an independent search engine called WbSrch. It made it as far as being as good as the late-1990s search engines, which is great, because the original goal was to build something much like Altavista. That was my first “main” search engine.

At one point I tried to turn it into a real business. That went poorly and I shut it down. Then I brought it back to work on as a hobby/fun project. That was interesting and fun for a while, but it’s run its course. I’ve done all the things I set out to do and learned all the things I wanted to learn. I’ve had my fun, so there’s no need to tinker with web search anymore. It did keep me busy toward the end of the pandemic as I was starting to go stir crazy, and I’m grateful for that.

If you’d like to see what it looked like when I finished with it, take a look at this capture on archive.org.

If you’d like to use a pretty good alternative search engine, I suggest Mojeek or Yandex. The MusicSrch music search engine is still going, too.

And if you’d like to get a copy of some of the data I collected, there are a few inexpensive data downloads available.

Now that you’re here, feel free to explore the blog a bit. I have a bunch of websites and music projects I’ve created, and you might find some of them interesting (under the “My Stuff” section of the sidebar).

Smarthost.net: Ultimately Just a Waste of Time

Smarthost.net has some nice “storage server” deals with some very configurable options. If you want a VPS with 1TB of disk space, their offerings are pretty attractive.

For my search engine, I need hosting with a good chunk of disk space in order to hold the index. It doesn’t need to be fast storage, and it doesn’t need a lot of CPU and RAM — retrieval of index entries is fast and efficient.

This made Smarthost look pretty ideal, so I signed up and got a web server going. It worked well for about a month. So I decided to set up a second one to do some light web crawling (you don’t get enough cores on their plans to do anything heavy).

After about a week, both machines were unreachable. I contacted support and found out that the drive array had failed on the machine hosting both of them. Support tried to recover it, but ultimately it was a total failure. So the little bit of web crawling data and the search engine log data for about 2 weeks (since the last time I pulled it) was destroyed.

Annoying, but hardware failures happen.

A week later I found my crawler machine suspended because of a false positive on Spamhaus. Apparently their system is so badly-written that just visiting a domain with a web crawler can get you on a “bad list” for supposedly hosting a virus/malware. Many hosting providers, Smarthost included, will auto-suspend service for any box that gets on that list.

I got that machine removed from Spamhaus the same day and had it reactivated a few hours later to download the 200k pages or so that had been crawled, but support was pretty snarky about it. Clearly Smarthost is not a service that is compatible with what I do.

I ended up moving the web server to 1tbvps, which is slightly more expensive, but has more CPU cores and RAM, which is always nice. I moved the crawler to Digital Ocean, which is a very data-science-friendly service. We’ll see whether I have issues with those, but I suspect they will work better for my purposes.

Ultimately my 2 month experience with Smarthost ended up being a complete waste of time.

 

Media.net Didn’t Work For Me

I received a message that my site was disapproved today, so they’re the first ones to fail out of my newest ad network comparison experiment.

Looking closer at their policies I see this (bold added by me):

“Our program has been designed for sites with premium content. Sites that promote, contain, or link directly to the following types of content shall not be approved.

  • Adult, Pornographic or any illegal content
  • Tobacco, alcohol, ammunition, hazardous substances, illegal drugs, gore, violence, gambling and racism content
  • Pages containing profanity or content that and/or discriminates or is offensive to any section of people
  • Hate, violence, racial intolerance, or advocate against any individual, group, or organization
  • Sale of prescription drugs
  • Sale of counterfeit products, imitations of designer or other goods, stolen items or any products that infringe intellectual property rights of other parties
  • Contain programs which promote invalid click activity by paying users to clicking on ads, browse websites, read email etc.
  • Websites that contain forums, discussion boards, chat rooms, or any content area that is open to public updates without adequate moderation
  • Sites with content that has been generated using computer programs and hence may not be comprehendible.
  • Bulk of the content is user-generated
  • Sites with fake news
  • Any other content that we believe in our sole discretion to be illegal”

So their network is ABSOLUTELY incompatible with a search engine since it links to everything on that list.

Still in the running are Adsaro, Adsterra, Bidvertiser, Galaksion, and RevenueHits.

WbSrch Online Again

I found a way to get WbSrch online inexpensively, through a combination of code optimizations and an inexpensive high-disk-space internet provider. It doesn’t need fast SSD storage to serve the index data, so it works just fine on a mechanical hard drive, and it’s easier to get a lot of space inexpensively with on of those. Through a bunch of memory and query optimizations, it’s more zippy now on an inexpensive VPS than it was on a 12-core server with 192GB of memory and 8 SSDs. For now I’m running the crawler and indexer from home and pushing index updates to the server as they’re done.

I’ve been using it as my main search engine even though the indexes are a bit out of date, and the results have better than I expect. It has definitely improved over the years.

Try it out:

https://wbsrch.com

PageRank Lives: OpenPageRank by Domcop

In the early days of Google, PageRank was a very important piece of information about a website. It let you know the general authority level of a site and how well it would tend to rank against similar content on another site. The PageRank toolbar, released in 2000, became an important tool in the SEO world.

Over time, Google de-emphasized PageRank, partly because people were gaming the system and partly because they switched to emphasizing other factors when ranking a site. They eventually stopped updating PageRank data and in 2016, they finally shut off the toolbar.

There have been other metrics created, such as Domain Authority by Moz, but nothing has quite been a proper replacement.

Now that the PageRank patent has expired, companies are free to implement their own versions. Domcop has done just that, using the Common Crawl data to calculate PageRank for the top 10 million domains on the web.

You can use it here: https://www.domcop.com/openpagerank/

As of the time of this post, Xangis.com has an OpenPageRank of 3.40.

Top-Level Domain Popularity

In a crawl of just over 32 million pages, this is the number of domains that I discovered for each top-level domain (TLD). The “Known Domains” is the number of domains with that extension that were found in links, while the “Crawled Domains” is the number of domains where pages were retrieved from.

Extension Known Domains Crawled Domains
.aaa 5 2
.aarp 1 0
.abarth 1 1
.abb 9 1
.abbott 74 32
.abbvie 0 0
.abc 9 1
.able 0 0
.abogado 82 47
.abudhabi 12 2
.ac 1404 540
.academy 9638 6340
.accenture 1 0
.accountant 550 347
.accountants 611 417
.aco 6 6
.actor 682 444
.ad 726 143
.adac 1 1
.ads 29 0
.adult 4438 3832
.ae 15743 8579
.aeg 1 1
.aero 4000 2179
.aetna 0 0
.af 1279 435
.afamilycompany 0 0
.afl 13 8
.africa 305 152
.ag 2497 1325
.agakhan 0 0
.agency 14562 9300
.ai 4685 2083
.aig 1 0
.aigo 0 0
.airbus 0 0
.airforce 101 60
.airtel 0 0
.akdn 0 0
.al 3944 1224
.alfaromeo 0 0
.alibaba 0 0
.alipay 0 0
.allfinanz 1 0
.allstate 0 0
.ally 1 0
.alsace 85 43
.alstom 0 0
.am 8687 3314
.americanexpress 0 0
.americanfamily 0 0
.amex 4 0
.amfam 0 0
.amica 3 1
.amsterdam 15930 13438
.analytics 3 0
.android 2 0
.anquan 0 0
.anz 1 0
.ao 1051 541
.aol 1 0
.apartments 1395 963
.app 3397 1067
.apple 13 0
.aq 60 31
.aquarelle 13 12
.ar 290350 193365
.arab 0 0
.aramco 2 1
.archi 934 633
.army 384 241
.arpa 2 0
.art 1305 467
.arte 4 1
.as 2027 1219
.asda 1 0
.asia 87348 48153
.associates 959 693
.at 482762 283611
.athleta 0 0
.attorney 3192 2492
.au 676566 422310
.auction 1032 694
.audi 146 53
.audible 1 1
.audio 2662 1675
.auspost 6 3
.author 3 0
.auto 143 69
.autos 2 0
.avianca 0 0
.aw 209 141
.aws 30 2
.ax 627 343
.axa 9 4
.az 9198 2090
.azure 3 1
.ba 8972 5858
.baby 10 2
.baidu 3 0
.banamex 0 0
.bananarepublic 0 0
.band 2997 1914
.bank 493 277
.bar 2210 1382
.barcelona 144 56
.barclaycard 9 3
.barclays 21 4
.barefoot 0 0
.bargains 895 597
.baseball 0 0
.basketball 36 15
.bauhaus 1 1
.bayern 267 168
.bb 362 225
.bbc 4 0
.bbt 2 0
.bbva 1 1
.bcg 0 0
.bcn 3 0
.bd 3134 986
.be 394979 286735
.beats 1 0
.beauty 4 0
.beer 315 160
.bentley 1 0
.berlin 15650 9387
.best 249 135
.bestbuy 1 0
.bet 627 229
.bf 386 175
.bg 26043 3364
.bh 777 336
.bharti 0 0
.bi 379 157
.bible 113 50
.bid 2039 539
.bike 6624 4226
.bing 2 1
.bingo 428 240
.bio 6462 4167
.biz 966268 563218
.bj 176 75
.black 1335 904
.blackfriday 505 325
.blockbuster 0 0
.blog 5342 1900
.bloomberg 73 0
.blue 4297 2433
.bm 1284 747
.bms 1 1
.bmw 7 2
.bn 375 137
.bnl 0 0
.bnpparibas 41 20
.bo 3129 2072
.boats 3 2
.boehringer 0 0
.bofa 1 0
.bom 3 1
.bond 1 1
.boo 2 0
.book 2 0
.booking 3 0
.bosch 0 0
.bostik 1 0
.boston 16 6
.bot 55 22
.boutique 3123 1923
.box 26 0
.br 661574 387097
.bradesco 37 20
.bridgestone 3 1
.broadway 0 0
.broker 51 29
.brother 5 2
.brussels 568 182
.bs 209 80
.bt 488 220
.budapest 1 1
.bugatti 4 1
.build 217 137
.builders 1551 1086
.business 4597 3145
.buy 1 0
.buzz 1232 920
.bv 2 0
.bw 582 259
.by 20281 2547
.bz 6646 3668
.bzh 640 310
.ca 494823 303073
.cab 103 18
.cafe 3211 2111
.cal 5 0
.call 9 0
.calvinklein 0 0
.cam 143 34
.camera 1913 1259
.camp 2508 1595
.cancerresearch 1 1
.canon 27 9
.capetown 1731 1270
.capital 2808 1984
.capitalone 2 0
.car 100 36
.caravan 1 1
.cards 1905 1182
.care 5833 4047
.career 14 8
.careers 3172 2019
.cars 103 34
.cartier 1 1
.casa 103 47
.case 0 0
.caseih 0 0
.cash 2524 1701
.casino 1551 999
.cat 30073 8075
.catering 1397 982
.catholic 0 0
.cba 9 0
.cbn 2 1
.cbre 1 0
.cbs 1 0
.cc 109058 39325
.cd 638 243
.ceb 7 1
.center 14026 8734
.ceo 47 18
.cern 19 12
.cf 7736 3354
.cfa 5 1
.cfd 3 1
.cg 114 32
.ch 529953 328052
.chanel 1 1
.channel 1 0
.charity 3 2
.chase 1 0
.chat 2957 1875
.cheap 1275 880
.chintai 1 0
.christmas 684 462
.chrome 2 0
.chrysler 0 0
.church 8726 5577
.ci 1038 447
.cipriani 1 1
.circle 0 0
.cisco 4 0
.citadel 0 0
.citi 3 0
.citic 10 1
.city 8574 5998
.cityeats 0 0
.ck 113 65
.cl 191023 121801
.claims 625 450
.cleaning 891 596
.click 9651 5813
.clinic 2513 1558
.clinique 0 0
.clothing 4731 3183
.cloud 19563 11822
.club 97931 58118
.clubmed 6 3
.cm 1346 494
.cn 265238 56575
.co 384542 252055
.coach 3315 2318
.codes 3058 2344
.coffee 5140 3385
.college 1437 1034
.cologne 1071 499
.com 13873332 7043596
.comcast 1 0
.commbank 0 0
.community 4093 2578
.company 18672 12938
.compare 1 0
.computer 2134 1324
.comsec 0 0
.condos 922 722
.construction 2889 2029
.consulting 8542 5905
.contact 2 0
.contractors 1625 1141
.cooking 21 7
.cookingchannel 0 0
.cool 5408 3315
.coop 5437 2905
.corsica 65 29
.country 86 51
.coupon 1 0
.coupons 600 422
.courses 384 224
.cr 4117 2143
.credit 931 638
.creditcard 305 213
.creditunion 2 1
.cricket 537 329
.crown 3 2
.crs 29 22
.cruise 0 0
.cruises 1138 868
.csc 2 0
.cu 1310 582
.cuisinella 2 1
.cv 485 291
.cw 133 64
.cx 1859 710
.cy 2286 1211
.cymru 2962 1880
.cyou 1 0
.cz 511459 318754
.dabur 1 1
.dad 1 0
.dance 2245 1420
.data 6 0
.date 2654 1153
.dating 1204 813
.datsun 1 1
.day 1 0
.dclk 1 0
.dds 1 0
.de 1549945 947359
.deal 0 0
.dealer 0 0
.deals 3135 2193
.degree 176 97
.delivery 970 619
.dell 0 0
.deloitte 0 0
.delta 2 1
.democrat 260 191
.dental 3232 2110
.dentist 1156 861
.desi 575 403
.design 15331 10090
.dev 893 485
.dhl 9 4
.diamonds 1229 889
.diet 779 544
.digital 8922 5915
.direct 4063 2927
.directory 6229 4730
.discount 1844 1173
.discover 1 0
.dish 0 0
.diy 2 0
.dj 500 268
.dk 343151 280337
.dm 129 64
.dnp 6 2
.do 3578 1635
.docs 2 0
.doctor 522 120
.dodge 0 0
.dog 2630 1763
.domains 2811 1715
.dot 6 0
.download 1848 613
.drive 4 0
.dtv 0 0
.dubai 0 0
.duck 0 0
.dunlop 0 0
.duns 0 0
.dupont 0 0
.durban 832 566
.dvag 10 2
.dvr 0 0
.dz 1767 814
.earth 1840 1318
.eat 1 0
.ec 7957 5105
.eco 135 69
.edeka 8 2
.edu 187599 100044
.education 8609 5628
.ee 35647 22095
.eg 3556 1038
.email 3779 2700
.emerck 0 0
.energy 2445 1643
.engineer 786 516
.engineering 1945 1261
.enterprises 2086 1510
.epson 1 1
.equipment 2667 1857
.er 14 2
.ericsson 0 0
.erni 5 2
.es 503765 286165
.esq 0 0
.estate 3769 2660
.esurance 0 0
.et 576 196
.etisalat 0 0
.eu 419001 235690
.eurovision 1 1
.eus 4185 2282
.events 8968 5898
.everbank 2 1
.exchange 2294 1528
.expert 13573 9239
.exposed 905 701
.express 1490 967
.extraspace 0 0
.fage 8 6
.fail 1112 813
.fairwinds 0 0
.faith 677 339
.family 3265 2487
.fan 23 10
.fans 432 247
.farm 4594 3258
.farmers 1 0
.fashion 2460 1662
.fast 3 0
.fedex 0 0
.feedback 197 78
.ferrari 0 0
.ferrero 0 0
.fi 220758 127408
.fiat 0 0
.fidelity 0 0
.fido 0 0
.film 953 568
.final 2 1
.finance 2168 1493
.financial 1348 929
.fire 4 1
.firestone 0 0
.firmdale 41 1
.fish 578 356
.fishing 26 13
.fit 3189 2207
.fitness 3269 2244
.fj 482 265
.fk 42 13
.flickr 1 0
.flights 808 575
.flir 0 0
.florist 1132 771
.flowers 756 481
.fly 1 0
.fm 12330 8030
.fo 1467 886
.foo 9 0
.food 3 0
.foodnetwork 0 0
.football 1770 1351
.ford 3 0
.forex 21 9
.forsale 1980 1507
.forum 2 0
.foundation 3320 2312
.fox 11 2
.fr 643262 334399
.free 12 0
.fresenius 0 0
.frl 9305 7762
.frogans 1 0
.frontdoor 0 0
.frontier 0 0
.ftr 0 0
.fujitsu 0 0
.fujixerox 0 0
.fun 3122 686
.fund 2088 1378
.furniture 868 597
.futbol 896 653
.fyi 2502 1806
.ga 8892 3830
.gal 1844 1038
.gallery 7602 5129
.gallo 0 0
.gallup 0 0
.game 148 38
.games 782 248
.gap 1 0
.garden 618 413
.gb 6 0
.gbiz 0 0
.gd 1053 481
.gdn 505 45
.ge 8048 2317
.gea 0 0
.gent 857 467
.genting 1 1
.george 0 0
.gf 40 11
.gg 6276 3820
.ggee 1 1
.gh 1083 527
.gi 474 265
.gift 1911 1169
.gifts 1150 735
.gives 358 229
.giving 2 1
.gl 989 443
.glade 0 0
.glass 1516 1039
.gle 2 0
.global 2250 1360
.globo 57 3
.gm 227 120
.gmail 3 0
.gmbh 280 257
.gmo 1 1
.gmx 0 0
.gn 38 11
.godaddy 2 0
.gold 1738 739
.goldpoint 1 1
.golf 2307 1626
.goo 15 1
.goodyear 0 0
.goog 20 0
.google 41 7
.gop 137 67
.got 2 0
.gov 39295 19002
.gp 542 199
.gq 4137 1674
.gr 255119 146141
.grainger 0 0
.graphics 3280 2144
.gratis 1751 1186
.green 131 61
.gripe 363 280
.grocery 0 0
.group 1186 706
.gs 1077 420
.gt 4323 2647
.gu 24 11
.guardian 4 0
.gucci 1 0
.guge 1 0
.guide 6145 4430
.guitars 535 381
.guru 23285 17219
.gw 15 6
.gy 523 243
.hair 1 0
.hamburg 305 189
.hangout 1 0
.haus 1796 1215
.hbo 0 0
.hdfc 0 0
.hdfcbank 0 0
.health 229 151
.healthcare 2589 1853
.help 4430 2641
.helsinki 0 0
.here 18 0
.hermes 1 0
.hgtv 0 0
.hiphop 342 234
.hisamitsu 3 1
.hitachi 3 1
.hiv 150 86
.hk 42346 15826
.hkt 0 0
.hm 134 50
.hn 1191 663
.hockey 388 218
.holdings 1982 1358
.holiday 2578 1650
.homedepot 3 0
.homegoods 0 0
.homes 18 14
.homesense 0 0
.honda 4 2
.honeywell 0 0
.horse 205 97
.hospital 8 2
.host 2805 792
.hosting 1590 927
.hot 18 0
.hoteles 1 0
.hotels 1 0
.hotmail 2 1
.house 6318 4484
.how 785 489
.hr 36459 22489
.hsbc 7 1
.ht 491 231
.hu 359105 254012
.hughes 1 0
.hyatt 0 0
.hyundai 1 1
.ibm 3 0
.icbc 1 1
.ice 11 2
.icu 9620 2978
.id 54716 12406
.ie 154458 91810
.ieee 0 0
.ifm 2 0
.ikano 2 2
.il 150023 34255
.im 4018 1921
.imamat 1 0
.imdb 2 1
.immo 5846 3124
.immobilien 4698 2950
.in 399200 235689
.inc 22 6
.industries 1262 849
.infiniti 1 1
.info 519439 280079
.ing 0 0
.ink 3898 2211
.institute 3610 2533
.insurance 10 8
.insure 1497 1028
.int 1671 569
.intel 0 0
.international 8481 5800
.intuit 1 0
.investments 1168 858
.io 154605 60348
.ipiranga 6 1
.iq 717 88
.ir 250770 45464
.irish 641 487
.is 16493 9459
.iselect 0 0
.ismaili 2 1
.ist 147 57
.istanbul 163 62
.it 790513 470658
.itau 9 2
.itv 1 0
.iveco 0 0
.jaguar 0 0
.java 13 0
.jcb 8 3
.jcp 0 0
.je 730 396
.jeep 0 0
.jetzt 1411 810
.jewelry 884 583
.jio 0 0
.jll 6 2
.jm 493 285
.jmp 1 0
.jnj 0 0
.jo 1774 842
.jobs 12415 5771
.joburg 1153 798
.jot 0 0
.joy 0 0
.jp 839358 183282
.jpmorgan 0 0
.jprs 2 0
.juegos 272 174
.juniper 2 0
.kaufen 4823 2767
.kddi 1 1
.ke 6331 3380
.kerryhotels 0 0
.kerrylogistics 0 0
.kerryproperties 0 0
.kfh 0 0
.kg 2377 324
.kh 1114 503
.ki 194 54
.kia 1 1
.kim 3773 1969
.kinder 2 0
.kindle 1 1
.kitchen 2518 1727
.kiwi 1465 943
.km 58 26
.kn 58 24
.koeln 265 166
.komatsu 9 4
.kosher 0 0
.kp 38 3
.kpmg 4 1
.kpn 4 1
.kr 241382 120692
.krd 71 21
.kred 9 4
.kuokgroup 0 0
.kw 1078 418
.ky 919 553
.kyoto 57 1
.kz 15708 2921
.la 5521 3025
.lacaixa 0 0
.ladbrokes 1 0
.lamborghini 26 9
.lamer 0 0
.lancaster 1 1
.lancia 0 0
.lancome 0 0
.land 6613 4606
.landrover 0 0
.lanxess 2 0
.lasalle 1 0
.lat 582 335
.latino 0 0
.latrobe 3 0
.law 2297 1124
.lawyer 4925 3697
.lb 1373 737
.lc 396 177
.lds 1 0
.lease 639 403
.leclerc 79 18
.lefrak 0 0
.legal 4365 2916
.lego 0 0
.lexus 3 1
.lgbt 68 24
.li 5741 3303
.liaison 1 0
.lidl 12 6
.life 21427 13930
.lifeinsurance 0 0
.lifestyle 0 0
.lighting 2726 1814
.like 10 0
.lilly 0 0
.limited 631 420
.limo 1112 787
.lincoln 0 0
.linde 35 0
.link 17735 9810
.lipsy 0 0
.live 12743 5874
.living 3 0
.lixil 3 1
.lk 6636 3731
.llc 29 23
.loan 5371 2517
.loans 1112 822
.locker 0 0
.locus 3 1
.loft 1 0
.lol 2359 1489
.london 10589 7220
.lotte 1 0
.lotto 1 0
.love 4140 2340
.lpl 0 0
.lplfinancial 0 0
.lr 93 46
.ls 208 96
.lt 93950 56378
.ltd 905 403
.ltda 29 24
.lu 20752 11053
.lundbeck 1 0
.lupin 1 1
.luxe 8 2
.luxury 430 200
.lv 34500 19048
.ly 4222 1613
.ma 7218 4078
.macys 1 0
.madrid 6 2
.maif 2 2
.maison 377 209
.makeup 1 0
.man 17 1
.management 4474 3138
.mango 31 1
.map 7 0
.market 2984 1722
.marketing 7118 4993
.markets 102 50
.marriott 0 0
.marshalls 0 0
.maserati 0 0
.mattel 0 0
.mba 717 473
.mc 829 442
.mckinsey 0 0
.md 6610 3257
.me 275661 205481
.med 8 0
.media 14357 9543
.meet 3 0
.melbourne 141 92
.meme 2 0
.memorial 143 91
.men 2595 729
.menu 140 52
.merckmsd 0 0
.metlife 1 0
.mg 777 317
.mh 6 0
.miami 87 36
.microsoft 29 1
.mil 3680 861
.mini 4 1
.mint 1 0
.mit 6 0
.mitsubishi 1 0
.mk 5882 1498
.ml 8756 3212
.mlb 1 0
.mls 2 0
.mm 781 170
.mma 9 1
.mn 4306 1018
.mo 940 247
.mobi 148527 132421
.mobile 9 0
.mobily 0 0
.moda 1043 596
.moe 1255 468
.moi 1 0
.mom 44 6
.monash 8 3
.money 2936 2140
.monster 9 6
.mopar 0 0
.mormon 1 0
.mortgage 837 610
.moscow 476 51
.moto 0 0
.motorcycles 3 1
.mov 14 0
.movie 668 250
.movistar 1 0
.mp 122 47
.mq 48 17
.mr 210 70
.ms 1813 671
.msd 0 0
.mt 1973 1292
.mtn 1 0
.mtr 0 0
.mu 1785 973
.museum 302 126
.mutual 0 0
.mv 559 336
.mw 416 161
.mx 349969 230643
.my 94138 57728
.mz 878 450
.na 907 508
.nab 0 0
.nadex 1 1
.nagoya 2009 875
.name 35404 21633
.nationwide 0 0
.natura 2 0
.navy 183 126
.nba 1 0
.nc 1217 693
.ne 202 45
.nec 5 1
.net 2918680 1578650
.netbank 0 0
.netflix 3 0
.network 8839 5525
.neustar 20 8
.new 49 1
.newholland 0 0
.news 19025 12561
.next 2 0
.nextdirect 0 0
.nexus 1 0
.nf 605 355
.nfl 0 0
.ng 10990 4370
.ngo 225 130
.nhk 1 1
.ni 1062 604
.nico 8 1
.nike 1 0
.nikon 1 0
.ninja 14014 10600
.nissan 3 2
.nissay 0 0
.nl 597480 422347
.no 298985 186997
.nokia 1 0
.northwesternmutual 0 0
.norton 0 0
.now 25 0
.nowruz 0 0
.nowtv 0 0
.np 3242 1311
.nr 2663 7
.nra 5 1
.nrw 261 145
.ntt 2 0
.nu 40788 25415
.nyc 19830 14117
.nz 325827 199878
.obi 0 0
.observer 19 5
.off 2 0
.office 31 1
.okinawa 1328 462
.olayan 0 0
.olayangroup 0 0
.oldnavy 1 0
.ollo 0 0
.om 1076 271
.omega 2 1
.one 19913 13958
.ong 42 21
.onl 202 95
.online 85456 51482
.onyourside 1 0
.ooo 1033 328
.open 3 0
.oracle 4 0
.orange 11 0
.org 2241890 1252364
.organic 20 8
.origins 0 0
.osaka 196 65
.otsuka 1 1
.ott 1 0
.ovh 15483 8056
.pa 1745 949
.page 103 40
.panasonic 0 0
.paris 10824 6165
.pars 0 0
.partners 2336 1533
.parts 2367 1448
.party 7260 4510
.passagens 0 0
.pay 4 1
.pccw 0 0
.pe 19986 12887
.pet 1068 631
.pf 638 368
.pfizer 0 0
.pg 520 259
.ph 21224 9862
.pharmacy 32 17
.phd 1 0
.philips 4 1
.phone 2 0
.photo 7508 4640
.photography 15237 10326
.photos 8636 5659
.physio 38 25
.piaget 1 1
.pics 3290 2152
.pictet 5 2
.pictures 2948 2074
.pid 0 0
.pin 2 0
.ping 2 1
.pink 2601 1005
.pioneer 4 0
.pizza 1810 1170
.pk 18738 11160
.pl 687150 443340
.place 1450 1010
.play 5 0
.playstation 2 0
.plumbing 1575 1204
.plus 2967 1608
.pm 670 209
.pn 128 38
.pnc 1 0
.pohl 0 0
.poker 97 15
.politie 1 0
.porn 5510 4502
.post 31 12
.pr 560 256
.pramerica 0 0
.praxi 4 0
.press 3108 1819
.prime 2 0
.pro 101285 54877
.prod 30 0
.productions 2016 1383
.prof 0 0
.progressive 0 0
.promo 138 28
.properties 4246 3227
.property 1528 1179
.protection 15 9
.pru 1 0
.prudential 0 0
.ps 2238 698
.pt 158549 94786
.pub 7114 3432
.pw 16855 6772
.pwc 1 0
.py 3600 2219
.qa 2181 722
.qpon 1 2
.quebec 1260 679
.quest 0 0
.qvc 2 0
.racing 696 400
.radio 66 30
.raid 0 0
.re 2384 1181
.read 5 0
.realestate 20 15
.realtor 129 73
.realty 17 10
.recipes 1507 1122
.red 11339 6868
.redstone 4 2
.redumbrella 0 0
.rehab 555 393
.reise 701 348
.reisen 2975 1643
.reit 50 14
.reliance 0 0
.ren 113 10
.rent 862 528
.rentals 4604 3364
.repair 2723 1765
.report 1991 1278
.republican 176 119
.rest 669 386
.restaurant 2472 1576
.review 2487 957
.reviews 6033 4341
.rexroth 0 0
.rich 7 4
.richardli 0 0
.ricoh 5 3
.rightathome 0 0
.ril 0 0
.rio 395 189
.rip 743 485
.rmit 2 1
.ro 283648 175550
.rocher 0 0
.rocks 26262 17677
.rodeo 13 9
.rogers 1 0
.room 2 0
.rs 27729 15527
.rsvp 0 0
.ru 671884 146922
.rugby 3 4
.ruhr 96 58
.run 3092 1642
.rw 652 343
.rwe 2 1
.ryukyu 162 56
.sa 8638 3607
.saarland 2164 1060
.safe 2 0
.safety 0 0
.sakura 1 1
.sale 2783 1627
.salon 51 24
.samsclub 0 0
.samsung 3 0
.sandvik 15 6
.sandvikcoromant 1 0
.sanofi 1 1
.sap 15 1
.sarl 266 156
.sas 2 0
.save 3 0
.saxo 12 3
.sb 110 54
.sbi 9 2
.sbs 0 0
.sc 1699 943
.sca 2 1
.scb 3 1
.schaeffler 0 0
.schmidt 6 3
.scholarships 1 1
.school 3596 2281
.schule 1212 656
.schwarz 2 1
.science 5341 3925
.scjohnson 0 0
.scor 1 1
.scot 5932 3728
.sd 643 136
.se 272318 163066
.search 2 0
.seat 198 97
.secure 1 0
.security 50 23
.seek 2 1
.select 3 0
.sener 15 6
.services 11298 7773
.ses 9 6
.seven 3 1
.sew 1 1
.sex 3512 2875
.sexy 3447 2325
.sfr 0 0
.sg 89051 47869
.sh 3204 967
.shangrila 0 0
.sharp 11 2
.shaw 3 0
.shell 3 0
.shia 0 0
.shiksha 192 102
.shoes 2411 1474
.shop 5608 1919
.shopping 58 20
.shouji 0 0
.show 1925 1101
.showtime 1 0
.shriram 15 10
.si 41821 24037
.silk 1 1
.sina 1 0
.singles 1799 1388
.site 41256 20024
.sj 0 0
.sk 264464 156494
.ski 2572 1417
.skin 1 0
.sky 31 1
.skype 6 1
.sl 251 110
.sling 0 0
.sm 524 273
.smart 5 1
.smile 2 0
.sn 1097 552
.sncf 17 9
.so 996 304
.soccer 749 533
.social 5673 3515
.softbank 2 1
.software 4411 2717
.sohu 1 0
.solar 2775 1999
.solutions 21815 15918
.song 1 0
.sony 6 2
.soy 403 286
.space 27195 16447
.sport 24 15
.spot 7 0
.spreadbetting 7 1
.sr 379 234
.srl 1069 692
.srt 2 0
.ss 1 0
.st 3390 1454
.stada 16 2
.staples 0 0
.star 6 0
.starhub 1 0
.statebank 0 0
.statefarm 1 0
.stc 3 2
.stcgroup 1 1
.stockholm 8 6
.storage 13 3
.store 4186 1301
.stream 1747 363
.studio 5360 3333
.study 295 155
.style 2120 1331
.su 61322 10540
.sucks 2798 2273
.supplies 935 675
.supply 1530 1110
.support 7628 4755
.surf 85 36
.surgery 792 607
.suzuki 3 1
.sv 1854 1125
.swatch 2 1
.swiftcover 0 0
.swiss 5633 2555
.sx 239 105
.sy 581 64
.sydney 2876 2066
.symantec 0 0
.systems 9394 6353
.sz 207 119
.tab 0 0
.taipei 332 38
.talk 1 0
.taobao 1 0
.target 3 0
.tatamotors 2 1
.tatar 19 2
.tattoo 935 573
.tax 2621 1658
.taxi 1862 1111
.tc 2411 1671
.tci 0 0
.td 47 14
.tdk 0 0
.team 3942 2511
.tech 23308 14553
.technology 10700 7626
.tel 8257 6893
.telefonica 9 0
.temasek 1 0
.tennis 567 320
.teva 2 0
.tf 349 105
.tg 178 92
.th 30372 9954
.thd 1 0
.theater 484 262
.theatre 17 6
.tiaa 0 0
.tickets 313 165
.tienda 1008 572
.tiffany 1 0
.tips 14694 10409
.tires 285 177
.tirol 223 172
.tj 913 110
.tjmaxx 0 0
.tjx 0 0
.tk 57924 36329
.tkmaxx 0 0
.tl 7734 5990
.tm 480 112
.tmall 1 0
.tn 4361 1880
.to 11373 3850
.today 21633 15057
.tokyo 21160 9514
.tools 3902 2374
.top 13223 2897
.toray 7 3
.toshiba 3 1
.total 7 2
.tours 2411 1629
.town 1320 944
.toyota 4 1
.toys 1684 1095
.tr 223035 152072
.trade 1338 605
.trading 113 60
.training 7642 5225
.travel 4816 2303
.travelchannel 0 0
.travelers 0 0
.travelersinsurance 0 0
.trust 5 1
.trv 2 0
.tt 919 417
.tube 165 37
.tui 1 1
.tunes 0 0
.tushu 0 0
.tv 223712 144419
.tvs 0 0
.tw 195314 49879
.tz 2434 1393
.ua 248130 54581
.ubank 0 0
.ubs 1 0
.uconnect 0 0
.ug 1664 872
.uk 1041962 600819
.unicom 0 0
.university 2222 1412
.uno 1682 1264
.uol 37 1
.ups 0 0
.us 362974 271719
.uy 12788 8842
.uz 4789 621
.va 238 50
.vacations 1724 1227
.vana 0 0
.vanguard 4 0
.vc 2294 986
.ve 9158 5371
.vegas 4496 3115
.ventures 3493 2572
.verisign 0 0
.versicherung 14 14
.vet 2169 1420
.vg 1593 1068
.vi 158 85
.viajes 530 299
.video 5111 3016
.vig 0 0
.viking 0 0
.villas 871 586
.vin 2281 1284
.vip 2034 422
.virgin 3 1
.visa 0 0
.vision 2347 1572
.vistaprint 2 1
.viva 1 1
.vivo 2 0
.vlaanderen 69 34
.vn 114281 23775
.vodka 12 8
.volkswagen 0 0
.volvo 1 0
.vote 528 343
.voting 3 1
.voto 85 26
.voyage 1240 835
.vu 1567 581
.vuelos 0 0
.wales 5624 3902
.walmart 4 0
.walter 3 1
.wang 9183 4748
.wanggou 0 0
.warman 0 0
.watch 2555 1434
.watches 1 1
.weather 1 0
.weatherchannel 1 0
.webcam 643 262
.weber 47 21
.website 30371 19469
.wed 6 6
.wedding 1987 1265
.weibo 1 0
.weir 72 1
.wf 145 74
.whoswho 2 1
.wien 3600 1960
.wiki 5183 3098
.williamhill 6 3
.win 21935 10278
.windows 2 1
.wine 4599 2934
.winners 0 0
.wme 1 1
.wolterskluwer 0 0
.woodside 2 0
.work 2904 749
.works 5315 3590
.world 31880 13729
.wow 6 0
.ws 133980 122131
.wtc 1 1
.wtf 3382 2357
.xbox 13 1
.xerox 0 0
.xfinity 0 0
.xihuan 0 0
.xin 6926 4789
.xn--11b4c3d 0 0
.xn--1ck2e1b 0 0
.xn--1qqw23a 0 0
.xn--2scrj9c 0 0
.xn--30rr7y 0 0
.xn--3bst00m 0 0
.xn--3ds443g 3 0
.xn--3e0b707e 51 18
.xn--3hcrj9c 0 0
.xn--3oq18vl8pn36a 0 0
.xn--3pxu8k 0 0
.xn--42c2d9a 0 0
.xn--45br5cyl 0 0
.xn--45brj9c 1 0
.xn--45q11c 2 0
.xn--4gbrim 1 0
.xn--54b7fta0cc 3 0
.xn--55qw42g 0 0
.xn--55qx5d 0 1
.xn--5su34j936bgsg 0 0
.xn--5tzm5g 0 0
.xn--6frz82g 329 114
.xn--6qq986b3xl 7 0
.xn--80adxhks 160 7
.xn--80ao21a 2 0
.xn--80aqecdr1a 0 0
.xn--80asehdb 857 138
.xn--80aswg 324 58
.xn--8y0a063a 0 0
.xn--90a3ac 12 2
.xn--90ae 24 1
.xn--90ais 149 2
.xn--9dbq2a 1 0
.xn--9et52u 0 0
.xn--9krt00a 0 0
.xn--b4w605ferd 1 0
.xn--bck1b9a5dre4c 0 0
.xn--c1avg 20 2
.xn--c2br7g 0 0
.xn--cck2b3b 0 0
.xn--cg4bki 0 0
.xn--clchc0ea0b2g2a9gcd 0 0
.xn--czr694b 1 1
.xn--czrs0t 0 0
.xn--czru2d 3 0
.xn--d1acj3b 24 0
.xn--d1alf 0 0
.xn--e1a4c 6 0
.xn--eckvdtc9d 0 0
.xn--efvy88h 0 0
.xn--estv75g 1 1
.xn--fct429k 0 0
.xn--fhbei 0 0
.xn--fiq228c5hs 0 0
.xn--fiq64b 0 0
.xn--fiqs8s 1220 44
.xn--fiqz9s 0 0
.xn--fjq720a 0 0
.xn--flw351e 1 0
.xn--fpcrj9c3d 0 1
.xn--fzc2c9e2c 1 0
.xn--fzys8d69uvgm 0 0
.xn--g2xx48c 0 0
.xn--gckr3f0f 0 0
.xn--gecrj9c 0 0
.xn--gk3at1e 0 0
.xn--h2breg3eve 0 0
.xn--h2brj9c 7 1
.xn--h2brj9c8c 0 0
.xn--hxt814e 59 27
.xn--i1b6b1a6a2e 0 0
.xn--imr513n 0 0
.xn--io0a7i 1 0
.xn--j1aef 7 0
.xn--j1amh 239 5
.xn--j6w193g 3 1
.xn--jlq61u9w7b 0 0
.xn--jvr189m 0 0
.xn--kcrx77d1x4a 1 1
.xn--kprw13d 0 0
.xn--kpry57d 10 1
.xn--kpu716f 0 0
.xn--kput3i 1 1
.xn--l1acc 1 1
.xn--lgbbat1ad8j 2 0
.xn--mgb9awbf 0 0
.xn--mgba3a3ejt 2 1
.xn--mgba3a4f16a 1 0
.xn--mgba7c0bbn0a 0 0
.xn--mgbaakc7dvf 0 0
.xn--mgbaam7a8h 5 1
.xn--mgbab2bd 101 19
.xn--mgbah1a3hjkrd 0 0
.xn--mgbai9azgqp6j 0 0
.xn--mgbayh7gpa 0 0
.xn--mgbb9fbpob 0 0
.xn--mgbbh1a 0 0
.xn--mgbbh1a71e 0 0
.xn--mgbc0a9azcg 0 0
.xn--mgbca7dzdo 0 0
.xn--mgberp4a5d4ar 0 0
.xn--mgbgu82a 0 0
.xn--mgbi4ecexp 0 0
.xn--mgbpl2fh 0 0
.xn--mgbt3dhd 0 0
.xn--mgbtx2b 0 0
.xn--mgbx4cd0ab 0 0
.xn--mix891f 0 0
.xn--mk1bu44c 6 0
.xn--mxtq1m 0 0
.xn--ngbc5azd 7 0
.xn--ngbe9e0a 0 0
.xn--ngbrx 0 0
.xn--node 2 0
.xn--nqv7f 0 0
.xn--nqv7fs00ema 0 0
.xn--nyqy26a 4 4
.xn--o3cw4h 9 1
.xn--ogbpf8fl 0 0
.xn--otu796d 0 0
.xn--p1acf 299 43
.xn--p1ai 299734 44067
.xn--pbt977c 0 0
.xn--pgbs0dh 2 0
.xn--pssy2u 0 0
.xn--q9jyb4c 287 158
.xn--qcka1pmc 1 0
.xn--qxam 9 2
.xn--rhqv96g 1 0
.xn--rovu88b 0 0
.xn--rvc1e0am3e 0 0
.xn--s9brj9c 0 0
.xn--ses554g 0 0
.xn--t60b56a 2 0
.xn--tckwe 85 26
.xn--tiq49xqyj 0 0
.xn--unup4y 1 0
.xn--vermgensberater-ctb 1 1
.xn--vermgensberatung-pwb 1 1
.xn--vhquv 0 0
.xn--vuq861b 3 1
.xn--w4r85el8fhu5dnra 0 0
.xn--w4rs40l 0 0
.xn--wgbh1c 2 0
.xn--wgbl6a 2 1
.xn--xhq521b 5 0
.xn--xkc2al3hye2a 0 0
.xn--xkc2dl3a5ee0h 0 0
.xn--y9a3aq 0 0
.xn--yfro4i67o 0 0
.xn--ygbi2ammx 0 0
.xn--zfr164b 3 2
.xxx 89175 3330
.xyz 289147 80131
.yachts 1 1
.yahoo 11 0
.yamaxun 1 1
.yandex 14 1
.ye 197 55
.yodobashi 1 1
.yoga 2964 2084
.yokohama 1481 684
.you 31 0
.youtube 7 0
.yt 214 77
.yun 0 0
.za 295219 203059
.zappos 1 1
.zara 2 0
.zero 3 0
.zip 63 0
.zm 571 295
.zone 7292 4840
.zuerich 1 1
.zw 1434 730
Total 44202501 23848767

WbSrch Offline Again

I put the WbSrch search engine back online in March of 2018.

I spent a lot of time improving it over the 16 months, but it’s the sort of thing that always manages to demand more time and energy. It’s time to stop giving it either — though it’s grown and improved a lot, it’s not something I could ever imagine doing full-time. The money isn’t there and the fun isn’t there anymore.

So I’ve taken it down. This time probably for good.

I’ll be focusing on my music software, electronic music, and acoustic guitar music instead.

Thanks for reading, and if you’re reading this you quite possibly participated in the experiment that was the WbSrch search engine. Thank you.

WbSrch Online Again

A while back I open-sourced the code for the WbSrch search engine.

It’s online now in a much-reduced form at wbsrch.com.

It’s not the full search engine. Far from it. It’s just a tiny database of about 10,000 or so URLs to demo the source code, but it’s possible you’ll actually find what you’re looking for in even that tiny amount of data if your search is sufficiently simple.

It probably won’t get any bigger — that’s about the size I can support “for free”, in that it doesn’t take enough resources on my inexpensive VPS to impact more important things. If you’re curious what the original WbSrch search engine was like, it’s a pretty good demo, at least visually.

2018 Is The Year That Twitter Ceases To Be Relevant

2018 is the year that Twitter ceases to be relevant.

It’s already stopped being relevant for me. I’ve stopped using it, and have deleted all of my tweets.

As a user, it’s just not worth it. It’s a miserable experience, made much worse by the userbase being made up primarily of Russian bots posing as MAGA idiots, actual alt-right MAGA idiots, and a small kernel of real people saying intelligent things that are drowned out by noise.

I’ve done (and still do) a lot of advertising on the web. For all of the different things I’ve been into, the worst ROI has consistently been via Twitter. Maybe some business types are viable via their ad platform, but none I’ve been involved in have been. It’s been a total waste of money. Mailing postcards would be a better value.

Most of the people I know in meatspace with accounts have stopped using it long ago. Some stopped in 2015, some in 2016, some in 2017. I can name maybe five people who use it regularly, and some of them echo their tweets to Facebook. I don’t have a huge circle of friends, but compare that to about 140 on Facebook with about 40% of them being active (50 or so people) and the order of magnitude population reduction makes it far less interesting. Facebook has its own problems, but it still manages to be relevant, unlike Twitter.

Even though Twitter is garbage to me, maybe it isn’t garbage to everyone else.

Nope.

There hasn’t been much recent coverage that I can find with about 15 seconds of effort, but these from last year don’t paint a rosy picture:

Twitter is now losing users in the U.S
http://money.cnn.com/2017/07/27/technology/business/twitter-earnings/index.html

Twitter revenues decline for first time as advertising falls away
https://www.theguardian.com/technology/2017/apr/26/twitter-revenues-fall-first-quarter-results-advertising

Library of Congress Gives Up Collecting All Tweets Because Twitter Is Garbage
https://gizmodo.com/library-of-congress-gives-up-on-twitter-because-twitter-1821581190

When Twitter finally dies, nearly nothing of value will be lost.

And if it doesn’t die, why care?

Quora Answer: How would you find the websites to build a search index from scratch?

I originally wrote this as an answer to a question on Quora.

It depends on the scale.

If you just want to experiment with web crawling and build a basic search index, it’s common to start with the Alexa top million websites, which can be downloaded in a CSV file via S3 at: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

The top million changes daily, and includes a lot of spam and porn. It’s easy to game the system to get into the bottom 500k, so you’ll need to decide what to include and how to weight it.

The DMOZ website dump used to be a good starting point. They shut down on March 17th, 2017, so that’s no longer really an option. It left a lot out, but contained about 4 million URLs with most of the low-quality sites filtered out. There may be a mirror that data somewhere (heck, I’d like to download their last URL dump if it’s available anywhere).

Real search engines have agreements in the place with the top-level registrars that let them get a zone file dump listing all registered domains. This involves jumping through some hoops and filling out some forms, and each registrar needs to be dealt with separately. Getting access to .social is completely separate from getting access to .net.

Since it takes a LOT of work to get access to all TLD domain files, a commercial service like Domains Index is probably your best bet if you want to do anything on a large scale. I’ve bought from them before and it’s a good service. They don’t have absolutely everything, but 200 million domains is far better than the 1 million you get from Alexa.

Quora Answer: What are the reasons for Google’s search engine low market share in Russia, South Korea, the US?

I originally wrote this as an answer to a question on Quora.

Full question:

“While I know that Google is banned in China and Yahoo Japan uses Google algorithm, I do not know why Google’s market share is so poor in Russia, the US, South Korea. I’m especially interested in bad performance due to linguistic reasons, if they apply. What about the US?”

In Russia, Yandex is dominant. It’s a very good search engine, and is primarily Russian-language focused. It’s been around nearly as long as Google, and people have a certain loyalty to it, especially since it was the first good Russian-language search engine. Russians also don’t tend to be very trusting of U.S. companies.

In Korea, the story is similar. They have good regional homegrown search engines, Naver and Daum, that have been around about as long as Google and that cater specifically to the Korean-language market. There’s no reason to use Google because the local options work quite well.

In these cases it’s a combination of being better in local languages for a long time combined with the tendency to “buy local”. They don’t use Google because they don’t need or want Google.

In the U.S. there are a lot of factors. There have been a lot of good English-language options for a long time, even though many have come and gone. Brand loyalty is a very American thing, even when other brands might do slightly better, and getting a “second opinion” is also pretty ingrained. There are also a lot of people who are uncomfortable with Google’s level of information gathering and profile building (“spying”). This combined with the tens of billions of dollars in the search space leaves plenty of room for competitors, even though it’s very expensive to build a decent search engine and very difficult to monetize one (DuckDuckGo being a good modern example of the difficulty of building one in modern times).

One of the big influences on U.S. market share is the existence of marketing deals. Money changes hands to be featured as the default search engine in browsers like Firefox, Safari, and others. This can move market share by a few percentage points overnight. If you can pay $1 billion for enough traffic to generate $3 billion in ad revenue from the, it’s a great deal. The Apple deals have been very public, but others have been privately arranged.

Interesting But Not A Business: The Story of the WbSrch Search Engine

I write this after having just shut down my almost-startup, the WbSrch search engine.

I started working on WbSrch for “fun” in the fall of 2013. AltaVista, my favorite search engine from “back in the day” had shut down that summer. Nostalgia combined with annoyance at how bad/annoying/intrusive/evil Google had become convinced me to try building my own version of AltaVista.

Well, a month of hacking later I had the core of something rudimentary but sort-of-functional. It was pretty terrible, but proved that I could get something built. I crawled a total of about 200,000 pages and had a bare skeleton of a search engine. I started by calling it the “anti-social search engine” because at the time, searching Google for almost anything would return so much social media drivel, clickbait garbage, and otherwise low-value spam-like content.

Getting to the first prototype was easier than I expected, so I continued to work on it, improving the crawler and search algorithms and growing the index. At around 2 million pages it outgrew the Linode VPS it was on and I set up hosting at a local colocation center using a $400 server I picked up on eBay (great deal – dual quad-core Xeons and 72GB of RAM – plenty to grow with).

Things progressed and I ended up announcing it to the public around the end of May 2015. It only had 5 million pages and the indexing algorithms were still pretty terrible, but it started getting some Human traffic.

And the bots discovered it. Every link analyzer SEO app in the world decided that WbSrch was a juicy crawl target. I considered blocking them since the SEO industry is complete garbage, but they were a decent source of Human traffic, and most of the traffic came from webmasters who would check to see whether their sites were indexed and run a few searches.

As it grew, maintenance became more time-consuming. I wanted to keep it from being too porn-heavy, from being full of Chinese and Russian sites, and pages categorized by language so the German-language front-end would only return pages in German.

After a year and a half of running the site as a hobby project, I decided to put it away because the mission had been accomplished – an index of 10 million pages, and it worked about as well as any other mid-1990’s search engine. For a few months the front page just pointed to Yandex.com (a Russian search engine – the third-best search engine after Google and Bing).

Well, one unanswered question kept nagging at me: “What if I could turn this into a real business?”

So I turned it back on, and started working on it pretty hard. I read a bunch of textbooks on information retrieval, text processing, statistical language processing, and a bunch of other search-related topics that I knew nothing about when I started the project.

Almost immediately the drives in the server failed.

So I replaced them and rebuilt the server. Didn’t lose much other than a week or two of crawl data because I had a backup, and source control. A few months later the drive controller failed, but there was no data loss – just a day of downtime and a lot of swearing.

As it grew and improved, I also started running some advertising, trying to build the audience and increase traffic. Visits were pretty cheap, but not very sticky. The bar is extremely high for getting someone to switch to a new search engine.

Still, there was some traffic, on the order of a consistent 5-digit number of pageviews per month. So I tried monetizing using a few different ad networks (around ten). The best one was able to earn about $3 per month. When you can’t use ad networks that don’t let you link to porn, gambling, or torrent sites but don’t want to advertise porn, gambling, or torrents, your income is low. Abysmal. $.05 CPM on the high end.

With the math for getting new visitors figured out (3-7 cents per click depending on the channel), and the math for monetizing those visitors figured out (about 5 cents for every 500 visitors), it was clear that it would cost $5 to earn $0.01. If those users were really sticky and would return over and over again, then maybe it would be worth the price. But they weren’t.

I also ran a crowdfunding campaign to gauge interest/demand. I raised a little money for hardware upgrades, but more importantly I learned a little more about how much people just don’t care about having another search option. I did manage to get one donation from someone I didn’t already know, but only one.

At this point I had about 47 million pages indexed and the search engine had grown to 3 servers. It had crawled only a tiny fraction of the internet, but it was still possible to find what you were looking for much of the time. It’s surprising how well you can do with a small index if you focus mainly on the most popular sites.

But to take things to the next level of quality I would need to build a system able to handle at least a billion pages.

That’s where things get expensive. I had only spent about $8,000 on WbSrch so far. Did I want to spend another $50,000 to get to that next level where users might be a bit stickier and ad revenue might be better (it tends to be lower when you’re low-volume — when you have enough traffic that algorithms can optimize, it gets better). Maybe it would only cost $2 to earn $0.01 and those users would return often enough that I could earn another $0.01.

And that’s where I decided to shut everything down. Math doesn’t lie.

Call it failure to validate. There is no search engine business to be had for me. Maybe someone else could do it. Like DuckDuckGo. Interestingly, they didn’t start with their own crawl. And they’ve partnered with Microsoft for advertising. So they’re essentially a privacy-focused variant of Bing with a different UI. That’s good for them, but the interesting part is developing your own proprietary technology, your own crawler and algorithms. Otherwise, one day Microsoft could decide that having an API is inconvenient and shut down your entire business.

At some point Apple will decide that having a search engine is important and build or buy one. Maybe they’re already building one if the rumors are to be believed.

Anyhow, it’s a little bit sad that it didn’t work, and a little bit sad that I spent all that time on it, but I did get smarter. And not just code. I learned a lot more about marketing and advertising in the process.

So now it’s on to the next thing.

Quora Answer: What is the minimum number of pages a modern general search engine would have to index to be useful?

I originally wrote this as an answer to a question on Quora.

“Useful” is a very subjective question. People who frequently ask deep and complex language-and-algorithm-specific software engineering questions will require different levels of depth than those who travel a lot and just want to find good prices on airfare and the top 5 restaurants and hotels in each city.

If you have /really/ good algorithms, you can build a search engine that is “good” for people who don’t require much depth with about 100 million pages. For people who require depth, you could probably be pretty useful at about 1 billion pages.

This depends HEAVILY on what you choose to include and exclude. Are these pages all in a single language? Or is this just 100 pages each from the top 1 million sites regardless of language and content?

Even though the web is phenomenally huge, much of it is duplication and/or computer-generated spam. There are millions of sites that are just scrapes/dumps of other sites (especially Wikipedia) and indexing 1000 copies of Wikipedia with different CSS isn’t going to get you very far.

Think about the sites you visit regularly, and about those that regularly turn up in searches. How many of those useful sites are below the top 100,000? Does it matter if there are 100 million+ domains when 99.9% of your needs are covered by the top 0.1%? With a smaller index, choosing what you leave out is pretty important.

There’s a site I like to play with when trying to find obscure results, it’s fun for experimenting with and it helps you understand how much the size/quality of your index affects your results: Million Short

Setting Up a Redash Dashboard

This was originally posted on wbsrch.com. It is reproduced here to preserve history.

The more WbSrch evolves, the more it becomes necessary to keep track of a bunch of metrics.

Until now we’ve been using a mix of simple report pages and raw SQL queries. It has worked well enough, but not having a clean way to track things in a single place is a nuisance.

That’s why I was happy to discover the redash.io open source project. It’s a query tool meant to be used for setting up business intelligence dashboards and it works with a wide range of databases.

No stranger to code, I tried to check out the GitHub source and get it running on my local machine. It didn’t quite work out. They have a bootstrap script, and it had some trouble with my particular system setup (it fell over when it came to configuring local database users).

But they also have EC2 AMI images you can launch to get running in AWS. I fired up an Amazon micro instance on the free tier and had the app running in seconds. It only took some minor configuration to get set up with my SSL certificate, and I was ready to go.

Adding a Database Connection to Redash

Connecting my three PostgreSQL databases was easy and the clean interface made it easy to find the query editor. After running a few queries I had the feel for how things worked well enough to save them. It also lets you set a refresh interval on your queries so you can have data refresh daily, hourly, or whatever. Results are cached so you’re not taxing your database gathering totals every page load.

Redash Query Editor

After you have a few queries, you can start adding them to a dashboard as panels. You just select the query name, the visualization type (you get table by default, but can add graphs and charts in the query builder), and the widget size.

This is a dashboard that I built to keep track of the search traffic and index state for the Somali-language version of WbSrch:

Redash Dashboard Example

I created dashboards for each supported language plus an overall meta-dashboard. It was fairly quick, taking about a day to set up 35 dashboards and about 200 queries.

Luckily the interface is pretty good, because once you have the software set up, that’s where the documentation ends. You can figure out most things with experimentation (trial-and-error), but it would be very helpful to have a few getting started tutorials, or at the least an explanation of how the various visualizations work.

A micro EC2 instance may stumble if you have some large queries (selecting an entire table is a bad idea, don’t do it), or a lot of things refreshing, but it kept up pretty well.

An Experiment with Project Wonderful

This was originally posted on wbsrch.com. It is reproduced here to preserve history.

I’m always looking for new and efficient ways to let people know about WbSrch. That’s why I decided to try advertising with Project Wonderful.

Project Wonderful was built as a banner ad network for web comics.

That doesn’t mean you can only advertise web comics or that advertising can only be placed on web comic sites, but that’s its core demographic.

As a trial, I ran an ad for WbSrch on a few sites that seemed like they’d have people who would be interested in trying out a new search engine. That means other search engines, SEO sites, and literature sites. I also wanted to find out whether webcomic readers were a good target audience.

I deposited $100, and after spending about $70, I think I have a pretty good idea of what works and what doesn’t.

If you want really fine-grained control over your campaigns and ad spending, this is the perfect network for you. You know exactly what you’re going to spend per day on a site, and you can bid on traffic on a per-region basis. Their regions are US, Canada, Europe, and Everywhere Else.

The search functionality is amazing. You can search for gaming sites that have traffic that is at least 50% from Germany and has between 100 and 10000 page views per day, for example.

As a publisher, you can set per-region bid minimums and can auto-approve bids, or require manual approval. This means that you don’t have to worry about running ads for things that you’d be opposed to, so no bacon ads on a veganism site.

Results have been mixed, and I’ve learned more about the types of people who are interested in trying WbSrch.

Some takeaways:

  • Webcomic sites have a high number of page views, but the number of unique users tends to be a fraction of that. The same goes for SEO tools.
  • Blogs tend to have more unique users and fewer page views.
  • Literature sites are somewhere in-between.

Here are my slightly-obfuscated results:

Site Pageviews Unique Views Clicks Spend CPM CPC
A Major Webcomic 581953 11126 35 13.34 0.02 0.38
An SEO Site 233489 17881 141 50.54 0.22 0.36
A Poetry Site 60780 4876 34 5.64 0.09 0.17
A Dutch Site 15584 305 4 0.73 0.05 0.18
A Hungarian Site 3485 730 1 0.35 0.10 0.35
A Search Engine 2711 963 40 0.62 0.23 0.02
A Swedish Site 2315 621 0 0.39 0.17 INF
A Movie Blog 1424 477 0 0.25 0.17 INF
A Knowledge Blog 1285 967 1 0.42 0.33 0.42
A Web Directory 296 83 0 0.05 0.17 INF
A Science Blog 231 124 1 0.29 1.24 0.29

The efficiency varies by site, but some are unbeatable deals for targeted traffic. Others are pricey, but just the type of people that will spend some time searching for themselves and the things they control. Hopefully we’ll be good enough for them to come back again.

There are some sites that I’ll run ads on as long as they exist even though the traffic is low. It’s easy to convince people trying a new search engine to try another new search engine.

I also suspect that my Hungarian and Swedish translations aren’t very good. I know basic Swedish, but the Hungarian is robot-translated.

One of the limitations of Project Wonderful is that if you have a large budget, you may run out of places to advertise efficiently, and for those things that are efficient, they may not get enough traffic to satisfy your hunger (2-cent clicks from your site? I’ll buy at least 1000 per day!). I could easily see struggling to spend a $1000/day budget effectively. If you’re prepared to work on a smaller scale, there is probably no better place to test-run ads because their data and reporting is good and you can learn a lot from your experiments. They also have enough fine-grained control that you can iterate and learn quickly.

$70 is hardly enough to get the full measure of an ad network, but I think I was able to get some useful data out of this experiment. Try Project Wonderful, you may just find it wonderful for your project, especially if your project plays well to webcomic audiences.

Analysis of Search Engine Crowdfunding Campaigns on IndieGoGo

This was originally posted on wbsrch.com. It is reproduced here to preserve history.

In the process of researching crowdfunding campaigns, I searched IndieGoGo for search engine pitches. I found 22 attempts to fund “actual search engines”.

Here is a list (with links to the IndieGoGo campaign):

TheNet101
Thumbar
Jixty.com
Xense
Iyiyes
Aspinosa
Rexyo
Asim Shah (unnamed project)
iSearchonline.tk
Slikk
ISearch2Help
MeSeek
Vexed Inc
Personalized Curated Mobile Search (no official name, so calling it PCMS)
Qrate
reSEARCH
Fedge No (unnamed project)
Crackerror
Aglepie
QuickVu
Chronologically
Nintag

Some of these have launched campaigns more than once, but I’m only counting them once. I’m also not counting niche and vertical search, only general search engines.

Aggregate Statistics

Project Asked Pledged Backers Date Comments FB Twitter G+ Pitch Quality
TheNet101 500 0 0 2014-03 1 0 0 0 C+
Thumbar 350000 1050 3 2013-12 2 2 183 0 B
Jixty.com 25000 25 1 2012-10 0 4 0 0 C-
Xense 2000000 0 0 Ongoing 2 0 0 0 D+
Iyiyes 250000 0 0 2012-09 0 0 0 0 C-
Aspinosa 35000 0 0 2012-03 0 0 0 0 C+
Rexyo 250000 0 0 2012-09 0 0 0 0 D*
Asim Shah 1500 40 8 2012-08 6 28 1 8 C-
iSearchonline 5000 0 0 2012-12 0 0 0 0 C+
Slikk 100000 40 2 2013-12 2 26 298 2 B
ISearch2Help 550 0 0 2013-05 0 0 0 0 D-
MeSeek 75000 77 5 2014-03 6 1000 517 0 A-*
Vexed Inc 500 0 0 2013-08 3 4 0 4 C-
PCMS 200000 0 0 2013-11 2 672 0 0 B-
Qrate 8500 105 5 2012-12 2 16 1 0 B+
reSEARCH 600000 0 0 2012-12 1 2 7 0 B
Fedge No 20000 0 0 2012-05 1 0 1 0 C-
Crackerror 5000 1 1 2014-07 4 13 1 1 D+
Aglepie 5000 0 0 2011-01 0 0 0 0 C-
QuickVu 867943 100 1 2013-11 4 2 0 0 D
Chronologically 3000 0 0 2013-09 2 0 0 0 D+
Nintag 100000 0 0 2013-03 0 0 0 0 C
* Video missing (deleted from YouTube)
** Some of these are denominated in GBP. I didn’t really pay attention to which, but it doesn’t change the numbers meaningfully, since they’re all nearly zero.

Statistics Summary

No campaigns were fully funded.

None with zero Facebook shares had any pledges.

Only four were shared on Google+ and of those, only 3 had pledges. As always, G+ is not relevant unless you’re a Google employee.

Only four had more than one person listed as being on the team.

14 of the 22 had no backers at all.

9 of the 22 asked for a six-figure or higher sum.

14 projects were not shared on Twitter. Of those, only 2 had pledges.

The biggest pledge was $1050. Of that $1025 was by someone related to (same last name as) the pitcher.

$64 average per campaign, or $19 not counting the bid by a relative.

I tried to find out what became of these pitches and whether they continued after the
failed campaign. Some point to domain parking pages, some to sites
not related to search at all, and at least one points to a malware
site. The best part about the malware site was that a popup said that my “Ubuntu needs
updating” and the update showed up as being for “Ubuntu by Microsoft, Inc.”
Hilarious. Can’t find it again or I’d share a screenshot.

Postmortem

Here are the ones I was able to find anything about:

Nintag (gone as of 2015-09) is a Nigerian search engine. If you search for the words Yoruba, Igbo, or Lagos you
find real results, most of which are based in Nigeria. If you search
for the word “cheese” you get zero results. I guess they don’t
have cheese in Nigeria, and that makes me a bit sad for them. Even though the results are somewhat questionable to a non-Nigerian,
they appear to be at least partially accomplishing their mission.
That’s good because Nigeria should have its own search engine. They
certainly have enough people to serve. I wish them well.

Jixty (gone as of 2015-09) exists. It’s not obvious at first glance, but based on the search results for
“pants” being essentially identical to Google, it looks like they’re a front end to Google (a google Custom Search Engine).

Iyiyes exists. It, too, appears to be a front-end to a Google custom search. I don’t
know why sites bother if a big bag of nothing is all they’re going to bring to the table. Google’s already doing Google.

MeSeek.com is also interesting. They had the strongest social effort and
scored the highest in my opinion of the different pitches, though
nobody had an “A” rating (but they could have had an amazing video presentation). MeSeek appears to let you rate search
results, but the results look like they came from Google (I can’t be sure). They have
news, horoscopes, and weather, a publisher program, stock quotes, an
advertising platform that only appears to show ads for MeSeek, and
even an article directory, too. You can even change your background on their site. It’s
also in the Alexa top million (around 200,000 as I write this). The campaign was started by a fella inamed Charles Forell, and I’d like to have a chat with him to see what he’s up to.

TheNet101 is also an interesting result. They’re a meta-search engine that includes results from Google, Bing, Blippex, Wikipedia, Blekko, Yelp, Archive.org, and Faroo.

Wait, what’s Blippex?

Blippex.org is a search engine that ranks results based on user engagement determined by using a browser plugin that measures “dwell time”. It’s a very Alexa-like system, one I understand well since I built one to gather data for Alytik, but it adds a bit more.

However, it appears that Blippex is the walking dead. Despite search being online and showing a URL count of 29.3 million, there haven’t been any blog posts since October 2013 and no Github activity since October 2013. This seems like an interesting idea that has been shelved, but not shut down yet. Neat, though, that you can change the search results by adjusting how much the dwell time and age metrics affect ranking.

Thank you, TheNet101.com, for introducing me a new and interesting search project, even if it may not still be active. Sorry your crowdfunding campaign didn’t work out.

Conclusions

In most of these pitches, the pitcher didn’t claim any particular domain knowledge required to build a search again, let alone claim software development skill. Combined with a scarcity of working demos, it’s unsurprising that none of these were funded.

Social sharing helps, and no sharing at all is pretty much a guarantee of failure.

Saying you are going to beat Google does not help. I wonder why.

IndieGoGo also doesn’t seem to be the best choice for funding a search engine.

Update September 2015: noted dead links

AdSense Alternatives for Startups and Small Websites

This was originally posted on wbsrch.com. It is reproduced here to preserve history.

In starting WbSrch, a search competitor to Google, I knew that at some point Google would find a way to “invite us to leave” AdSense. The Terms of Service make it clear that it is incompatible with a search engine (can’t have ads on pages that link to adult content, gambling, etc.)

That day came a little over a month ago when I received a message that ads were no longer running on the site because Google discovered a violation of their TOS in one of the result pages for a particular adult-oriented search term.

Sure, I could remove the offending link from the search results page (which I did because it also didn’t fit with the WbSrch inclusion policy), but that sort of thing would be sure to happen again. Around one sixth of the URLs on the web are porn, so it’s virtually impossible to exclude it all. Be very skeptical of anyone who claims they’re able to block all porn.

The Advertising Options

From my research, these are the notable companies that do online advertising:

Bidvertiser
Qadabra
Affinity
Infolinks
Advertising.com
Adversal
PulsePoint
Conversant (formerly ValueClick)
Clicksor
AdBlade
AdSide
Vibrant Media
Yahoo/Bing Ads (formerly Media.net)
Link Worth
Tribal Fusion
Kontera

Contacting the Advertisers

I looked into all of them, eliminating those that
require massive traffic volume to get started or have a reputation for
spreading malware.

These are the ones I tried to contact (at the end of April) asking whether their service would be compatible with WbSrch:

Qadabra
Chitika
Yahoo/Bing (formerly Media.net)
Conversant Media
Kontera
Bidvertiser
Affinity.com
Infolinks

I asked the same question of every site:

Hello,

I run a small but growing search engine at http://wbsrch.com.

I would like to know whether your service would be appropriate for use as the advertising provider for this search engine.

WbSrch.com indexes and links to most of the internet. We try to
exclude adult and other “icky” sites from the index, but that’s not
possible to do with an automated crawler. This means that at any given
time there will be links to things we don’t want to index per our policy
(http://wbsrch.com/policy/),
but that will eventually be removed. None of this content is hosted on
our site, but it is linked to depending on the search phrase used.

The search engine has indexes in 25 different languages, though most traffic is for the English-language index.

Given the nature of search engines, would WbSrch.com be compatible with your advertising platform?

What follows are the responses to this message and the action I took based on the responses.

Outright Failures

Bidvertiser had a broken captcha on their contact form, so I couldn’t contact them. Their policy says that they don’t allow linking to some content types, so
they probably would have said no.

Bing did not have a contact form. They might now. I think
they are still in alpha/beta/whatever. Even so, they’re still a
competitor, so not something wise to use long-term.

Non-Responses

Chitika never responded to my inquiry.

Conversant Media never responded to my inquiry.

Affinity.com never responded to my inquiry.

Responses

Qadabra responded the fastest, saying that they were totally
compatible with search engines and that they already had some search engine
customers. The message had a friendly tone.

Kontera was the second response. They said they they are not
compatible with search engines, but they were polite about it.

Infolinks replied three days later (on a Sunday) with a
fairly rude message that said “our quality assurance team found that your site
does not meet our publisher criteria” and “We at Infolinks
have the responsibility to keep our advertising environment up to certain
standards to ensure the success of Infolinks for our
publishers, advertisers and those viewing our ads.” OK, that’s fine if you don’t
want to work with a new site, but don’t be rude about it. At least now I know they’re too special and important to ever do business with.

The Winner

Based on these responses I went with Qadabra. They also said
that they work with traffic in all languages. Great!

Setup was easy, and ads started working immediately. I had a
few glitches with some ads behaving strangely, but it was a minor thing. Every
time I contacted them they were very helpful and friendly.

You don’t really get control over the types of ads that are
shown. Most of what I saw were ads for video games and the occasional ad for Russian
brides.

I did not enable any of their rich media ads, just banners,
so I have no experience with those. I know they earn more, but I’m generally opposed to popups, popovers, flyouts, videos, and things that make noise. If I visit a site that uses them I’m less likely to return.

Qadabra revenue was significantly less than AdSense, earning about
one sixth as much per thousand impressions. Their system documentation says
that they optimize it over time, so if I gave them a longer trial period, income
would probably go up.

Now that WbSrch has switched to SSL-only (inspired by Reset the Net), I can’t use
Qadabra. They don’t have SSL support, so even if their ads
were enabled, they wouldn’t load. If they add that I’ll consider using them
again, if not for WbSrch then for other sites.

I like the people at Qadabra, and I’m happy with their tech
support, but this experiment has ended after only one month, and there don’t
appear to be any other reasonable alternatives.

Qadabra is relatively new, created in 2011, so
they are still polishing their game. If you want a reasonable AdSense
alternative for lower-traffic sites and don’t require SSL, I recommend them.

The long-term plan always was to build an ad platform
internally to use with WbSrch. Not finding a platform that is a perfect fit for
us is just another motivating factor.

For now, I’m just going to focus on improving the search engine so we’re in a better position to monetize it later on. Since traffic has been increasing by around 100% per month for the last few months, it shouldn’t be too long.

Why I Decided to Build a Search Engine (And You Should Too)

This was originally posted on wbsrch.com. It is reproduced here to preserve history.

I’ve always wanted to build something big, but never had a burning desire to create any one specific thing. Instead I built lots of little things – small desktop apps, weekend websites, etc.

It wasn’t until AltaVista shut down that I realized that the world needs another search engine. Not just one, but dozens.

Almost all of the search greats have been shut down, bought and shelved, or replaced their engines with Google or Bing. The only real competition we have in English-language search is between Google and Bing, and Bing has been accused of copying Google’s results.

Bear with me for a bit. What I’m about to say will probably sound a bit tinfoil-hat. It’s all just speculation, but there are so many billions of dollars involved that at least some of this sounds plausible.

A search engine is the gateway to the world’s information. If you control
the gateway, you control the information. Knowledge is power, and
gatekeeping is big money. At this point, the leading gatekeeper has too
much money and power and the effects are causing real harm to businesses
and the world economy.

Each time Google changes its algorithm, hundreds or even thousands of businesses are damaged or destroyed. That’s the nature of godlike power – even if you’re just trying to use a little bit, there are side effects.

“Combatting spam” is the main reason cited for their algorithm changes. If you look
at it, much of what Google classifies as spam could just as easily be
classified as “things that don’t make us enough money”. They’re a public
company. They’re required to maximize revenue. To do otherwise would
expose them to shareholder lawsuits.

It’s easy to see the incentive for demoting sites that have ads that earn you 4 cents per
click in favor of sites that earn 12 you cents per click, or of doing
things that wipe out competing advertising companies. There has been an
ongoing war against “selling links”. That’s called advertising. What is
Google Adsense? It’s an ad service that sells links. Convenient, though,
that competing ad service text-link-ads.com was a casualty of this war on paid links, isn’t it?

Other things also seem suspicious, like the war on guest blogging. Is this Google being jealous of people finding other sites without going through their search engine?

Now we have another wave of chaos, which some people have referred to as “breaking the internet”.
Google added a “disavow links” tool and has many webmasters afraid to
link or be linked to for fear of an “unnatural link penalty”. I get one
or two emails a day asking to remove a link to a website because the
webmaster is afraid Google might not approve. Google has said (or
really, implied) that you don’t need to have links removed, but there’s
one benefit to less linking on the web: You’re less likely to find a
site without going through Google if there are no other links to it.

While this is going on, Google is also adding more direct answers to
searches. Things that you might click on a link to find before now show
up directly as answers so you never have to leave Google’s site. A
question like “How tall is the Eiffel Tower?” could have led you to an
exploration of lots of wonderful information about the Eiffel Tower. Now
it just gives you a fact, and you can go on to the next search. This
means Google is driving less traffic to websites and everyone who isn’t
Google is suffering, slowly but surely getting less traffic.

All of this sounds very much like anti-competitive practices to me, the kind
that you get hit with a regulatory hammer for. Even if only some of it
is true or intentional, it’s a dangerous abuse of power that needs to be
investigated. It’s fine if you want to be the biggest and best search
engine, but don’t be evil.

Why do they get away with it? Because webmasters and users let them. Websites are rarely built for Humans
anymore. Instead, they’re written for the Googlebot with Humans as an
afterthought. That’s why Demand Media was so successful – they designed
and created their content for Google search.

The world needs more search engines. That much power should not be concentrated in one
company’s hands. Next time you search, consider using something other
than the market leader. Bing, Blekko, DuckDuckGo, WbSrch, and Gigablast
are all options. I wish there were even more. If you have the skills,
now has never been a more important time to start building. It’ll take a
while to build something good, but it’s needed. Just don’t try to do it
the way Cuil did – trying to index more pages than Google and running
out of money before they figured out how to be useful. Find a way to be
useful to people, and focus on that.