Category Archives: Search Engines

Search engine related posts.

Google Is Not Very Smart (or Incredibly Smart)

To the layperson, Google might seem like the smartest company in the world.

Once you understand technology, it’s obvious that they’re not very smart. Or incredible geniuses.

If you’ve spent more than a few years building websites, you know very well the types of ignorant robot-stupid mistakes that Google can make. An example from this very blog is that I spent quite a few years working on multi-user dungeons (MUDs for short). There are a lot of posts on the subject, and Google’s ad system thinks that one of the main topics of this blog is digging holes in wet soil, and it’s not uncommon to see an ad for such machines on this site.

Google is PHENOMENALLY good at miscategorizing things. If there are two meanings to a phrase or topic, and 80% of the discussions in the world focus on the more popular meaning and 20% are about the less-popular meaning, the less-popular topic will be buried in the noise, because the systems will assume you mean “computer keyboards” when you meant “music keyboards”.

When you combine this with what I call “computational laziness”, this means that if your site happens to be categorized in a certain manner, you’re basically stuck there. Google doesn’t put much effort into revising its categorization models or re-analyzing sites based on new information.

What does that mean for web developers? Well, for a newer site, if Google puts you in a place you don’t want to be, you’re probably better off starting over from a different angle.

For people searching the web, it’s a little more complicated. Google is NOT designed to be the best search engine on the planet. Now that the search engine wars are over, the design has changed to focus on revenue maximization. What does that mean?

It means that Google as a search engine is designed to show you just-barely-adequate results that kind-of-but-not-really satisfy the question you were asking. It’s designed to be mostly-accurate but slightly frustrating, so that you are tempted to click on ads that seem likely to answer your question.

In a perfect world, a search engine would give you exactly the information you were looking for, as quickly and as accurately as possible. In THIS world, search engines are designed to give you the plausibly-relevant information that will benefit the search engine the most.

In a world where the profit motive rules over everything, product quality must necessarily suffer for the purpose of maximizing margins.

As something used by most of the people in the world pretty much daily, when does it make sense for search to become a public utility? It’s an interesting question worth pondering, but the mathematics and economics are far too complicated for a sound-bite answer.

The number of websites in existence has been relatively flat since 2017, not growing any faster than the world’s population, but processing power has effectively tripled. If Moore’s Law was still in effect, it should have grown 16x, but that ship has sailed. What this means is that, even though sites today have many more pages and much more data on average than six year ago, the ability to organize that information has grown faster than the actual quantity of information.

Google is not special. They’re just another business. And, with their original core patents being expired or expiring soon, there’s a lot of room to build something of higher quality with lower cost. Given the level of mind-share that they have (as any good look at Bing’s market share will confirm), is it worth it to build a competitor?

It depends.

MusicSrch Improvements

It’s been a long time since I’ve worked on MusicSrch.

For the past few years, I really didn’t have any time for side projects, and a few searches were very broken on the site due to changes in the various third-party websites. It needed some serious rehab.

I spent a few days fixing things and improving various features, and now it’s a lot better.

Give it a try at https://musicsrch.com

I plan to spend a lot more time working on it this year, and there are more sites I want to add to search.

The WbSrch Experiment

Off-and-on over the last 8 years I’ve worked on an independent search engine called WbSrch. It made it as far as being as good as the late-1990s search engines, which is great, because the original goal was to build something much like Altavista. That was my first “main” search engine.

At one point I tried to turn it into a real business. That went poorly and I shut it down. Then I brought it back to work on as a hobby/fun project. That was interesting and fun for a while, but it’s run its course. I’ve done all the things I set out to do and learned all the things I wanted to learn. I’ve had my fun, so there’s no need to tinker with web search anymore. It did keep me busy toward the end of the pandemic as I was starting to go stir crazy, and I’m grateful for that.

If you’d like to see what it looked like when I finished with it, take a look at this capture on archive.org.

If you’d like to use a pretty good alternative search engine, I suggest Mojeek or Yandex. The MusicSrch music search engine is still going, too.

And if you’d like to get a copy of some of the data I collected, there are a few inexpensive data downloads available.

Now that you’re here, feel free to explore the blog a bit. I have a bunch of websites and music projects I’ve created, and you might find some of them interesting (under the “My Stuff” section of the sidebar).

WbSrch Search Engine Releases New Data Offerings

Reprint of a press release originally published on https://www.prweb.com/releases/wbsrch_search_engine_releases_new_data_offerings/prweb17779070.htm.

The WbSrch search engine has released new data offerings.

Two new domain lists are available for purchase. The first is a list of all of the internet domains with pages in the WbSrch search engine index.

The second is a list of all adult-content domains that have been excluded from the search engine.

In addition to these, WbSrch also sells a list of the top 1 million most-linked-to domains based on its link index.

This data will be a useful tool for data scientists, SEO specialists, and for entrepreneurs who want to build domain-information-based product offerings. Founder Jason Champion had this to say about the project launch:

“When I started this search engine, there was a terrible shortage of good data sources for getting started with organizing the web. With this release we hope to make it easier to start web-data-driven projects.”

These new files are available in the WbSrch data shop at https://wbsrch.onfastspring.com/ priced at $39.99 and $4.99, respectively.

Smarthost.net: Ultimately Just a Waste of Time

Smarthost.net has some nice “storage server” deals with some very configurable options. If you want a VPS with 1TB of disk space, their offerings are pretty attractive.

For my search engine, I need hosting with a good chunk of disk space in order to hold the index. It doesn’t need to be fast storage, and it doesn’t need a lot of CPU and RAM — retrieval of index entries is fast and efficient.

This made Smarthost look pretty ideal, so I signed up and got a web server going. It worked well for about a month. So I decided to set up a second one to do some light web crawling (you don’t get enough cores on their plans to do anything heavy).

After about a week, both machines were unreachable. I contacted support and found out that the drive array had failed on the machine hosting both of them. Support tried to recover it, but ultimately it was a total failure. So the little bit of web crawling data and the search engine log data for about 2 weeks (since the last time I pulled it) was destroyed.

Annoying, but hardware failures happen.

A week later I found my crawler machine suspended because of a false positive on Spamhaus. Apparently their system is so badly-written that just visiting a domain with a web crawler can get you on a “bad list” for supposedly hosting a virus/malware. Many hosting providers, Smarthost included, will auto-suspend service for any box that gets on that list.

I got that machine removed from Spamhaus the same day and had it reactivated a few hours later to download the 200k pages or so that had been crawled, but support was pretty snarky about it. Clearly Smarthost is not a service that is compatible with what I do.

I ended up moving the web server to 1tbvps, which is slightly more expensive, but has more CPU cores and RAM, which is always nice. I moved the crawler to Digital Ocean, which is a very data-science-friendly service. We’ll see whether I have issues with those, but I suspect they will work better for my purposes.

Ultimately my 2 month experience with Smarthost ended up being a complete waste of time.

 

Media.net Didn’t Work For Me

I received a message that my site was disapproved today, so they’re the first ones to fail out of my newest ad network comparison experiment.

Looking closer at their policies I see this (bold added by me):

“Our program has been designed for sites with premium content. Sites that promote, contain, or link directly to the following types of content shall not be approved.

  • Adult, Pornographic or any illegal content
  • Tobacco, alcohol, ammunition, hazardous substances, illegal drugs, gore, violence, gambling and racism content
  • Pages containing profanity or content that and/or discriminates or is offensive to any section of people
  • Hate, violence, racial intolerance, or advocate against any individual, group, or organization
  • Sale of prescription drugs
  • Sale of counterfeit products, imitations of designer or other goods, stolen items or any products that infringe intellectual property rights of other parties
  • Contain programs which promote invalid click activity by paying users to clicking on ads, browse websites, read email etc.
  • Websites that contain forums, discussion boards, chat rooms, or any content area that is open to public updates without adequate moderation
  • Sites with content that has been generated using computer programs and hence may not be comprehendible.
  • Bulk of the content is user-generated
  • Sites with fake news
  • Any other content that we believe in our sole discretion to be illegal”

So their network is ABSOLUTELY incompatible with a search engine since it links to everything on that list.

Still in the running are Adsaro, Adsterra, Bidvertiser, Galaksion, and RevenueHits.

WbSrch Online Again

I found a way to get WbSrch online inexpensively, through a combination of code optimizations and an inexpensive high-disk-space internet provider. It doesn’t need fast SSD storage to serve the index data, so it works just fine on a mechanical hard drive, and it’s easier to get a lot of space inexpensively with on of those. Through a bunch of memory and query optimizations, it’s more zippy now on an inexpensive VPS than it was on a 12-core server with 192GB of memory and 8 SSDs. For now I’m running the crawler and indexer from home and pushing index updates to the server as they’re done.

I’ve been using it as my main search engine even though the indexes are a bit out of date, and the results have better than I expect. It has definitely improved over the years.

Try it out:

https://wbsrch.com

PageRank Lives: OpenPageRank by Domcop

In the early days of Google, PageRank was a very important piece of information about a website. It let you know the general authority level of a site and how well it would tend to rank against similar content on another site. The PageRank toolbar, released in 2000, became an important tool in the SEO world.

Over time, Google de-emphasized PageRank, partly because people were gaming the system and partly because they switched to emphasizing other factors when ranking a site. They eventually stopped updating PageRank data and in 2016, they finally shut off the toolbar.

There have been other metrics created, such as Domain Authority by Moz, but nothing has quite been a proper replacement.

Now that the PageRank patent has expired, companies are free to implement their own versions. Domcop has done just that, using the Common Crawl data to calculate PageRank for the top 10 million domains on the web.

You can use it here: https://www.domcop.com/openpagerank/

As of the time of this post, Xangis.com has an OpenPageRank of 3.40.

Top-Level Domain Popularity

In a crawl of just over 32 million pages, this is the number of domains that I discovered for each top-level domain (TLD). The “Known Domains” is the number of domains with that extension that were found in links, while the “Crawled Domains” is the number of domains where pages were retrieved from.

Extension Known Domains Crawled Domains
.aaa 5 2
.aarp 1 0
.abarth 1 1
.abb 9 1
.abbott 74 32
.abbvie 0 0
.abc 9 1
.able 0 0
.abogado 82 47
.abudhabi 12 2
.ac 1404 540
.academy 9638 6340
.accenture 1 0
.accountant 550 347
.accountants 611 417
.aco 6 6
.actor 682 444
.ad 726 143
.adac 1 1
.ads 29 0
.adult 4438 3832
.ae 15743 8579
.aeg 1 1
.aero 4000 2179
.aetna 0 0
.af 1279 435
.afamilycompany 0 0
.afl 13 8
.africa 305 152
.ag 2497 1325
.agakhan 0 0
.agency 14562 9300
.ai 4685 2083
.aig 1 0
.aigo 0 0
.airbus 0 0
.airforce 101 60
.airtel 0 0
.akdn 0 0
.al 3944 1224
.alfaromeo 0 0
.alibaba 0 0
.alipay 0 0
.allfinanz 1 0
.allstate 0 0
.ally 1 0
.alsace 85 43
.alstom 0 0
.am 8687 3314
.americanexpress 0 0
.americanfamily 0 0
.amex 4 0
.amfam 0 0
.amica 3 1
.amsterdam 15930 13438
.analytics 3 0
.android 2 0
.anquan 0 0
.anz 1 0
.ao 1051 541
.aol 1 0
.apartments 1395 963
.app 3397 1067
.apple 13 0
.aq 60 31
.aquarelle 13 12
.ar 290350 193365
.arab 0 0
.aramco 2 1
.archi 934 633
.army 384 241
.arpa 2 0
.art 1305 467
.arte 4 1
.as 2027 1219
.asda 1 0
.asia 87348 48153
.associates 959 693
.at 482762 283611
.athleta 0 0
.attorney 3192 2492
.au 676566 422310
.auction 1032 694
.audi 146 53
.audible 1 1
.audio 2662 1675
.auspost 6 3
.author 3 0
.auto 143 69
.autos 2 0
.avianca 0 0
.aw 209 141
.aws 30 2
.ax 627 343
.axa 9 4
.az 9198 2090
.azure 3 1
.ba 8972 5858
.baby 10 2
.baidu 3 0
.banamex 0 0
.bananarepublic 0 0
.band 2997 1914
.bank 493 277
.bar 2210 1382
.barcelona 144 56
.barclaycard 9 3
.barclays 21 4
.barefoot 0 0
.bargains 895 597
.baseball 0 0
.basketball 36 15
.bauhaus 1 1
.bayern 267 168
.bb 362 225
.bbc 4 0
.bbt 2 0
.bbva 1 1
.bcg 0 0
.bcn 3 0
.bd 3134 986
.be 394979 286735
.beats 1 0
.beauty 4 0
.beer 315 160
.bentley 1 0
.berlin 15650 9387
.best 249 135
.bestbuy 1 0
.bet 627 229
.bf 386 175
.bg 26043 3364
.bh 777 336
.bharti 0 0
.bi 379 157
.bible 113 50
.bid 2039 539
.bike 6624 4226
.bing 2 1
.bingo 428 240
.bio 6462 4167
.biz 966268 563218
.bj 176 75
.black 1335 904
.blackfriday 505 325
.blockbuster 0 0
.blog 5342 1900
.bloomberg 73 0
.blue 4297 2433
.bm 1284 747
.bms 1 1
.bmw 7 2
.bn 375 137
.bnl 0 0
.bnpparibas 41 20
.bo 3129 2072
.boats 3 2
.boehringer 0 0
.bofa 1 0
.bom 3 1
.bond 1 1
.boo 2 0
.book 2 0
.booking 3 0
.bosch 0 0
.bostik 1 0
.boston 16 6
.bot 55 22
.boutique 3123 1923
.box 26 0
.br 661574 387097
.bradesco 37 20
.bridgestone 3 1
.broadway 0 0
.broker 51 29
.brother 5 2
.brussels 568 182
.bs 209 80
.bt 488 220
.budapest 1 1
.bugatti 4 1
.build 217 137
.builders 1551 1086
.business 4597 3145
.buy 1 0
.buzz 1232 920
.bv 2 0
.bw 582 259
.by 20281 2547
.bz 6646 3668
.bzh 640 310
.ca 494823 303073
.cab 103 18
.cafe 3211 2111
.cal 5 0
.call 9 0
.calvinklein 0 0
.cam 143 34
.camera 1913 1259
.camp 2508 1595
.cancerresearch 1 1
.canon 27 9
.capetown 1731 1270
.capital 2808 1984
.capitalone 2 0
.car 100 36
.caravan 1 1
.cards 1905 1182
.care 5833 4047
.career 14 8
.careers 3172 2019
.cars 103 34
.cartier 1 1
.casa 103 47
.case 0 0
.caseih 0 0
.cash 2524 1701
.casino 1551 999
.cat 30073 8075
.catering 1397 982
.catholic 0 0
.cba 9 0
.cbn 2 1
.cbre 1 0
.cbs 1 0
.cc 109058 39325
.cd 638 243
.ceb 7 1
.center 14026 8734
.ceo 47 18
.cern 19 12
.cf 7736 3354
.cfa 5 1
.cfd 3 1
.cg 114 32
.ch 529953 328052
.chanel 1 1
.channel 1 0
.charity 3 2
.chase 1 0
.chat 2957 1875
.cheap 1275 880
.chintai 1 0
.christmas 684 462
.chrome 2 0
.chrysler 0 0
.church 8726 5577
.ci 1038 447
.cipriani 1 1
.circle 0 0
.cisco 4 0
.citadel 0 0
.citi 3 0
.citic 10 1
.city 8574 5998
.cityeats 0 0
.ck 113 65
.cl 191023 121801
.claims 625 450
.cleaning 891 596
.click 9651 5813
.clinic 2513 1558
.clinique 0 0
.clothing 4731 3183
.cloud 19563 11822
.club 97931 58118
.clubmed 6 3
.cm 1346 494
.cn 265238 56575
.co 384542 252055
.coach 3315 2318
.codes 3058 2344
.coffee 5140 3385
.college 1437 1034
.cologne 1071 499
.com 13873332 7043596
.comcast 1 0
.commbank 0 0
.community 4093 2578
.company 18672 12938
.compare 1 0
.computer 2134 1324
.comsec 0 0
.condos 922 722
.construction 2889 2029
.consulting 8542 5905
.contact 2 0
.contractors 1625 1141
.cooking 21 7
.cookingchannel 0 0
.cool 5408 3315
.coop 5437 2905
.corsica 65 29
.country 86 51
.coupon 1 0
.coupons 600 422
.courses 384 224
.cr 4117 2143
.credit 931 638
.creditcard 305 213
.creditunion 2 1
.cricket 537 329
.crown 3 2
.crs 29 22
.cruise 0 0
.cruises 1138 868
.csc 2 0
.cu 1310 582
.cuisinella 2 1
.cv 485 291
.cw 133 64
.cx 1859 710
.cy 2286 1211
.cymru 2962 1880
.cyou 1 0
.cz 511459 318754
.dabur 1 1
.dad 1 0
.dance 2245 1420
.data 6 0
.date 2654 1153
.dating 1204 813
.datsun 1 1
.day 1 0
.dclk 1 0
.dds 1 0
.de 1549945 947359
.deal 0 0
.dealer 0 0
.deals 3135 2193
.degree 176 97
.delivery 970 619
.dell 0 0
.deloitte 0 0
.delta 2 1
.democrat 260 191
.dental 3232 2110
.dentist 1156 861
.desi 575 403
.design 15331 10090
.dev 893 485
.dhl 9 4
.diamonds 1229 889
.diet 779 544
.digital 8922 5915
.direct 4063 2927
.directory 6229 4730
.discount 1844 1173
.discover 1 0
.dish 0 0
.diy 2 0
.dj 500 268
.dk 343151 280337
.dm 129 64
.dnp 6 2
.do 3578 1635
.docs 2 0
.doctor 522 120
.dodge 0 0
.dog 2630 1763
.domains 2811 1715
.dot 6 0
.download 1848 613
.drive 4 0
.dtv 0 0
.dubai 0 0
.duck 0 0
.dunlop 0 0
.duns 0 0
.dupont 0 0
.durban 832 566
.dvag 10 2
.dvr 0 0
.dz 1767 814
.earth 1840 1318
.eat 1 0
.ec 7957 5105
.eco 135 69
.edeka 8 2
.edu 187599 100044
.education 8609 5628
.ee 35647 22095
.eg 3556 1038
.email 3779 2700
.emerck 0 0
.energy 2445 1643
.engineer 786 516
.engineering 1945 1261
.enterprises 2086 1510
.epson 1 1
.equipment 2667 1857
.er 14 2
.ericsson 0 0
.erni 5 2
.es 503765 286165
.esq 0 0
.estate 3769 2660
.esurance 0 0
.et 576 196
.etisalat 0 0
.eu 419001 235690
.eurovision 1 1
.eus 4185 2282
.events 8968 5898
.everbank 2 1
.exchange 2294 1528
.expert 13573 9239
.exposed 905 701
.express 1490 967
.extraspace 0 0
.fage 8 6
.fail 1112 813
.fairwinds 0 0
.faith 677 339
.family 3265 2487
.fan 23 10
.fans 432 247
.farm 4594 3258
.farmers 1 0
.fashion 2460 1662
.fast 3 0
.fedex 0 0
.feedback 197 78
.ferrari 0 0
.ferrero 0 0
.fi 220758 127408
.fiat 0 0
.fidelity 0 0
.fido 0 0
.film 953 568
.final 2 1
.finance 2168 1493
.financial 1348 929
.fire 4 1
.firestone 0 0
.firmdale 41 1
.fish 578 356
.fishing 26 13
.fit 3189 2207
.fitness 3269 2244
.fj 482 265
.fk 42 13
.flickr 1 0
.flights 808 575
.flir 0 0
.florist 1132 771
.flowers 756 481
.fly 1 0
.fm 12330 8030
.fo 1467 886
.foo 9 0
.food 3 0
.foodnetwork 0 0
.football 1770 1351
.ford 3 0
.forex 21 9
.forsale 1980 1507
.forum 2 0
.foundation 3320 2312
.fox 11 2
.fr 643262 334399
.free 12 0
.fresenius 0 0
.frl 9305 7762
.frogans 1 0
.frontdoor 0 0
.frontier 0 0
.ftr 0 0
.fujitsu 0 0
.fujixerox 0 0
.fun 3122 686
.fund 2088 1378
.furniture 868 597
.futbol 896 653
.fyi 2502 1806
.ga 8892 3830
.gal 1844 1038
.gallery 7602 5129
.gallo 0 0
.gallup 0 0
.game 148 38
.games 782 248
.gap 1 0
.garden 618 413
.gb 6 0
.gbiz 0 0
.gd 1053 481
.gdn 505 45
.ge 8048 2317
.gea 0 0
.gent 857 467
.genting 1 1
.george 0 0
.gf 40 11
.gg 6276 3820
.ggee 1 1
.gh 1083 527
.gi 474 265
.gift 1911 1169
.gifts 1150 735
.gives 358 229
.giving 2 1
.gl 989 443
.glade 0 0
.glass 1516 1039
.gle 2 0
.global 2250 1360
.globo 57 3
.gm 227 120
.gmail 3 0
.gmbh 280 257
.gmo 1 1
.gmx 0 0
.gn 38 11
.godaddy 2 0
.gold 1738 739
.goldpoint 1 1
.golf 2307 1626
.goo 15 1
.goodyear 0 0
.goog 20 0
.google 41 7
.gop 137 67
.got 2 0
.gov 39295 19002
.gp 542 199
.gq 4137 1674
.gr 255119 146141
.grainger 0 0
.graphics 3280 2144
.gratis 1751 1186
.green 131 61
.gripe 363 280
.grocery 0 0
.group 1186 706
.gs 1077 420
.gt 4323 2647
.gu 24 11
.guardian 4 0
.gucci 1 0
.guge 1 0
.guide 6145 4430
.guitars 535 381
.guru 23285 17219
.gw 15 6
.gy 523 243
.hair 1 0
.hamburg 305 189
.hangout 1 0
.haus 1796 1215
.hbo 0 0
.hdfc 0 0
.hdfcbank 0 0
.health 229 151
.healthcare 2589 1853
.help 4430 2641
.helsinki 0 0
.here 18 0
.hermes 1 0
.hgtv 0 0
.hiphop 342 234
.hisamitsu 3 1
.hitachi 3 1
.hiv 150 86
.hk 42346 15826
.hkt 0 0
.hm 134 50
.hn 1191 663
.hockey 388 218
.holdings 1982 1358
.holiday 2578 1650
.homedepot 3 0
.homegoods 0 0
.homes 18 14
.homesense 0 0
.honda 4 2
.honeywell 0 0
.horse 205 97
.hospital 8 2
.host 2805 792
.hosting 1590 927
.hot 18 0
.hoteles 1 0
.hotels 1 0
.hotmail 2 1
.house 6318 4484
.how 785 489
.hr 36459 22489
.hsbc 7 1
.ht 491 231
.hu 359105 254012
.hughes 1 0
.hyatt 0 0
.hyundai 1 1
.ibm 3 0
.icbc 1 1
.ice 11 2
.icu 9620 2978
.id 54716 12406
.ie 154458 91810
.ieee 0 0
.ifm 2 0
.ikano 2 2
.il 150023 34255
.im 4018 1921
.imamat 1 0
.imdb 2 1
.immo 5846 3124
.immobilien 4698 2950
.in 399200 235689
.inc 22 6
.industries 1262 849
.infiniti 1 1
.info 519439 280079
.ing 0 0
.ink 3898 2211
.institute 3610 2533
.insurance 10 8
.insure 1497 1028
.int 1671 569
.intel 0 0
.international 8481 5800
.intuit 1 0
.investments 1168 858
.io 154605 60348
.ipiranga 6 1
.iq 717 88
.ir 250770 45464
.irish 641 487
.is 16493 9459
.iselect 0 0
.ismaili 2 1
.ist 147 57
.istanbul 163 62
.it 790513 470658
.itau 9 2
.itv 1 0
.iveco 0 0
.jaguar 0 0
.java 13 0
.jcb 8 3
.jcp 0 0
.je 730 396
.jeep 0 0
.jetzt 1411 810
.jewelry 884 583
.jio 0 0
.jll 6 2
.jm 493 285
.jmp 1 0
.jnj 0 0
.jo 1774 842
.jobs 12415 5771
.joburg 1153 798
.jot 0 0
.joy 0 0
.jp 839358 183282
.jpmorgan 0 0
.jprs 2 0
.juegos 272 174
.juniper 2 0
.kaufen 4823 2767
.kddi 1 1
.ke 6331 3380
.kerryhotels 0 0
.kerrylogistics 0 0
.kerryproperties 0 0
.kfh 0 0
.kg 2377 324
.kh 1114 503
.ki 194 54
.kia 1 1
.kim 3773 1969
.kinder 2 0
.kindle 1 1
.kitchen 2518 1727
.kiwi 1465 943
.km 58 26
.kn 58 24
.koeln 265 166
.komatsu 9 4
.kosher 0 0
.kp 38 3
.kpmg 4 1
.kpn 4 1
.kr 241382 120692
.krd 71 21
.kred 9 4
.kuokgroup 0 0
.kw 1078 418
.ky 919 553
.kyoto 57 1
.kz 15708 2921
.la 5521 3025
.lacaixa 0 0
.ladbrokes 1 0
.lamborghini 26 9
.lamer 0 0
.lancaster 1 1
.lancia 0 0
.lancome 0 0
.land 6613 4606
.landrover 0 0
.lanxess 2 0
.lasalle 1 0
.lat 582 335
.latino 0 0
.latrobe 3 0
.law 2297 1124
.lawyer 4925 3697
.lb 1373 737
.lc 396 177
.lds 1 0
.lease 639 403
.leclerc 79 18
.lefrak 0 0
.legal 4365 2916
.lego 0 0
.lexus 3 1
.lgbt 68 24
.li 5741 3303
.liaison 1 0
.lidl 12 6
.life 21427 13930
.lifeinsurance 0 0
.lifestyle 0 0
.lighting 2726 1814
.like 10 0
.lilly 0 0
.limited 631 420
.limo 1112 787
.lincoln 0 0
.linde 35 0
.link 17735 9810
.lipsy 0 0
.live 12743 5874
.living 3 0
.lixil 3 1
.lk 6636 3731
.llc 29 23
.loan 5371 2517
.loans 1112 822
.locker 0 0
.locus 3 1
.loft 1 0
.lol 2359 1489
.london 10589 7220
.lotte 1 0
.lotto 1 0
.love 4140 2340
.lpl 0 0
.lplfinancial 0 0
.lr 93 46
.ls 208 96
.lt 93950 56378
.ltd 905 403
.ltda 29 24
.lu 20752 11053
.lundbeck 1 0
.lupin 1 1
.luxe 8 2
.luxury 430 200
.lv 34500 19048
.ly 4222 1613
.ma 7218 4078
.macys 1 0
.madrid 6 2
.maif 2 2
.maison 377 209
.makeup 1 0
.man 17 1
.management 4474 3138
.mango 31 1
.map 7 0
.market 2984 1722
.marketing 7118 4993
.markets 102 50
.marriott 0 0
.marshalls 0 0
.maserati 0 0
.mattel 0 0
.mba 717 473
.mc 829 442
.mckinsey 0 0
.md 6610 3257
.me 275661 205481
.med 8 0
.media 14357 9543
.meet 3 0
.melbourne 141 92
.meme 2 0
.memorial 143 91
.men 2595 729
.menu 140 52
.merckmsd 0 0
.metlife 1 0
.mg 777 317
.mh 6 0
.miami 87 36
.microsoft 29 1
.mil 3680 861
.mini 4 1
.mint 1 0
.mit 6 0
.mitsubishi 1 0
.mk 5882 1498
.ml 8756 3212
.mlb 1 0
.mls 2 0
.mm 781 170
.mma 9 1
.mn 4306 1018
.mo 940 247
.mobi 148527 132421
.mobile 9 0
.mobily 0 0
.moda 1043 596
.moe 1255 468
.moi 1 0
.mom 44 6
.monash 8 3
.money 2936 2140
.monster 9 6
.mopar 0 0
.mormon 1 0
.mortgage 837 610
.moscow 476 51
.moto 0 0
.motorcycles 3 1
.mov 14 0
.movie 668 250
.movistar 1 0
.mp 122 47
.mq 48 17
.mr 210 70
.ms 1813 671
.msd 0 0
.mt 1973 1292
.mtn 1 0
.mtr 0 0
.mu 1785 973
.museum 302 126
.mutual 0 0
.mv 559 336
.mw 416 161
.mx 349969 230643
.my 94138 57728
.mz 878 450
.na 907 508
.nab 0 0
.nadex 1 1
.nagoya 2009 875
.name 35404 21633
.nationwide 0 0
.natura 2 0
.navy 183 126
.nba 1 0
.nc 1217 693
.ne 202 45
.nec 5 1
.net 2918680 1578650
.netbank 0 0
.netflix 3 0
.network 8839 5525
.neustar 20 8
.new 49 1
.newholland 0 0
.news 19025 12561
.next 2 0
.nextdirect 0 0
.nexus 1 0
.nf 605 355
.nfl 0 0
.ng 10990 4370
.ngo 225 130
.nhk 1 1
.ni 1062 604
.nico 8 1
.nike 1 0
.nikon 1 0
.ninja 14014 10600
.nissan 3 2
.nissay 0 0
.nl 597480 422347
.no 298985 186997
.nokia 1 0
.northwesternmutual 0 0
.norton 0 0
.now 25 0
.nowruz 0 0
.nowtv 0 0
.np 3242 1311
.nr 2663 7
.nra 5 1
.nrw 261 145
.ntt 2 0
.nu 40788 25415
.nyc 19830 14117
.nz 325827 199878
.obi 0 0
.observer 19 5
.off 2 0
.office 31 1
.okinawa 1328 462
.olayan 0 0
.olayangroup 0 0
.oldnavy 1 0
.ollo 0 0
.om 1076 271
.omega 2 1
.one 19913 13958
.ong 42 21
.onl 202 95
.online 85456 51482
.onyourside 1 0
.ooo 1033 328
.open 3 0
.oracle 4 0
.orange 11 0
.org 2241890 1252364
.organic 20 8
.origins 0 0
.osaka 196 65
.otsuka 1 1
.ott 1 0
.ovh 15483 8056
.pa 1745 949
.page 103 40
.panasonic 0 0
.paris 10824 6165
.pars 0 0
.partners 2336 1533
.parts 2367 1448
.party 7260 4510
.passagens 0 0
.pay 4 1
.pccw 0 0
.pe 19986 12887
.pet 1068 631
.pf 638 368
.pfizer 0 0
.pg 520 259
.ph 21224 9862
.pharmacy 32 17
.phd 1 0
.philips 4 1
.phone 2 0
.photo 7508 4640
.photography 15237 10326
.photos 8636 5659
.physio 38 25
.piaget 1 1
.pics 3290 2152
.pictet 5 2
.pictures 2948 2074
.pid 0 0
.pin 2 0
.ping 2 1
.pink 2601 1005
.pioneer 4 0
.pizza 1810 1170
.pk 18738 11160
.pl 687150 443340
.place 1450 1010
.play 5 0
.playstation 2 0
.plumbing 1575 1204
.plus 2967 1608
.pm 670 209
.pn 128 38
.pnc 1 0
.pohl 0 0
.poker 97 15
.politie 1 0
.porn 5510 4502
.post 31 12
.pr 560 256
.pramerica 0 0
.praxi 4 0
.press 3108 1819
.prime 2 0
.pro 101285 54877
.prod 30 0
.productions 2016 1383
.prof 0 0
.progressive 0 0
.promo 138 28
.properties 4246 3227
.property 1528 1179
.protection 15 9
.pru 1 0
.prudential 0 0
.ps 2238 698
.pt 158549 94786
.pub 7114 3432
.pw 16855 6772
.pwc 1 0
.py 3600 2219
.qa 2181 722
.qpon 1 2
.quebec 1260 679
.quest 0 0
.qvc 2 0
.racing 696 400
.radio 66 30
.raid 0 0
.re 2384 1181
.read 5 0
.realestate 20 15
.realtor 129 73
.realty 17 10
.recipes 1507 1122
.red 11339 6868
.redstone 4 2
.redumbrella 0 0
.rehab 555 393
.reise 701 348
.reisen 2975 1643
.reit 50 14
.reliance 0 0
.ren 113 10
.rent 862 528
.rentals 4604 3364
.repair 2723 1765
.report 1991 1278
.republican 176 119
.rest 669 386
.restaurant 2472 1576
.review 2487 957
.reviews 6033 4341
.rexroth 0 0
.rich 7 4
.richardli 0 0
.ricoh 5 3
.rightathome 0 0
.ril 0 0
.rio 395 189
.rip 743 485
.rmit 2 1
.ro 283648 175550
.rocher 0 0
.rocks 26262 17677
.rodeo 13 9
.rogers 1 0
.room 2 0
.rs 27729 15527
.rsvp 0 0
.ru 671884 146922
.rugby 3 4
.ruhr 96 58
.run 3092 1642
.rw 652 343
.rwe 2 1
.ryukyu 162 56
.sa 8638 3607
.saarland 2164 1060
.safe 2 0
.safety 0 0
.sakura 1 1
.sale 2783 1627
.salon 51 24
.samsclub 0 0
.samsung 3 0
.sandvik 15 6
.sandvikcoromant 1 0
.sanofi 1 1
.sap 15 1
.sarl 266 156
.sas 2 0
.save 3 0
.saxo 12 3
.sb 110 54
.sbi 9 2
.sbs 0 0
.sc 1699 943
.sca 2 1
.scb 3 1
.schaeffler 0 0
.schmidt 6 3
.scholarships 1 1
.school 3596 2281
.schule 1212 656
.schwarz 2 1
.science 5341 3925
.scjohnson 0 0
.scor 1 1
.scot 5932 3728
.sd 643 136
.se 272318 163066
.search 2 0
.seat 198 97
.secure 1 0
.security 50 23
.seek 2 1
.select 3 0
.sener 15 6
.services 11298 7773
.ses 9 6
.seven 3 1
.sew 1 1
.sex 3512 2875
.sexy 3447 2325
.sfr 0 0
.sg 89051 47869
.sh 3204 967
.shangrila 0 0
.sharp 11 2
.shaw 3 0
.shell 3 0
.shia 0 0
.shiksha 192 102
.shoes 2411 1474
.shop 5608 1919
.shopping 58 20
.shouji 0 0
.show 1925 1101
.showtime 1 0
.shriram 15 10
.si 41821 24037
.silk 1 1
.sina 1 0
.singles 1799 1388
.site 41256 20024
.sj 0 0
.sk 264464 156494
.ski 2572 1417
.skin 1 0
.sky 31 1
.skype 6 1
.sl 251 110
.sling 0 0
.sm 524 273
.smart 5 1
.smile 2 0
.sn 1097 552
.sncf 17 9
.so 996 304
.soccer 749 533
.social 5673 3515
.softbank 2 1
.software 4411 2717
.sohu 1 0
.solar 2775 1999
.solutions 21815 15918
.song 1 0
.sony 6 2
.soy 403 286
.space 27195 16447
.sport 24 15
.spot 7 0
.spreadbetting 7 1
.sr 379 234
.srl 1069 692
.srt 2 0
.ss 1 0
.st 3390 1454
.stada 16 2
.staples 0 0
.star 6 0
.starhub 1 0
.statebank 0 0
.statefarm 1 0
.stc 3 2
.stcgroup 1 1
.stockholm 8 6
.storage 13 3
.store 4186 1301
.stream 1747 363
.studio 5360 3333
.study 295 155
.style 2120 1331
.su 61322 10540
.sucks 2798 2273
.supplies 935 675
.supply 1530 1110
.support 7628 4755
.surf 85 36
.surgery 792 607
.suzuki 3 1
.sv 1854 1125
.swatch 2 1
.swiftcover 0 0
.swiss 5633 2555
.sx 239 105
.sy 581 64
.sydney 2876 2066
.symantec 0 0
.systems 9394 6353
.sz 207 119
.tab 0 0
.taipei 332 38
.talk 1 0
.taobao 1 0
.target 3 0
.tatamotors 2 1
.tatar 19 2
.tattoo 935 573
.tax 2621 1658
.taxi 1862 1111
.tc 2411 1671
.tci 0 0
.td 47 14
.tdk 0 0
.team 3942 2511
.tech 23308 14553
.technology 10700 7626
.tel 8257 6893
.telefonica 9 0
.temasek 1 0
.tennis 567 320
.teva 2 0
.tf 349 105
.tg 178 92
.th 30372 9954
.thd 1 0
.theater 484 262
.theatre 17 6
.tiaa 0 0
.tickets 313 165
.tienda 1008 572
.tiffany 1 0
.tips 14694 10409
.tires 285 177
.tirol 223 172
.tj 913 110
.tjmaxx 0 0
.tjx 0 0
.tk 57924 36329
.tkmaxx 0 0
.tl 7734 5990
.tm 480 112
.tmall 1 0
.tn 4361 1880
.to 11373 3850
.today 21633 15057
.tokyo 21160 9514
.tools 3902 2374
.top 13223 2897
.toray 7 3
.toshiba 3 1
.total 7 2
.tours 2411 1629
.town 1320 944
.toyota 4 1
.toys 1684 1095
.tr 223035 152072
.trade 1338 605
.trading 113 60
.training 7642 5225
.travel 4816 2303
.travelchannel 0 0
.travelers 0 0
.travelersinsurance 0 0
.trust 5 1
.trv 2 0
.tt 919 417
.tube 165 37
.tui 1 1
.tunes 0 0
.tushu 0 0
.tv 223712 144419
.tvs 0 0
.tw 195314 49879
.tz 2434 1393
.ua 248130 54581
.ubank 0 0
.ubs 1 0
.uconnect 0 0
.ug 1664 872
.uk 1041962 600819
.unicom 0 0
.university 2222 1412
.uno 1682 1264
.uol 37 1
.ups 0 0
.us 362974 271719
.uy 12788 8842
.uz 4789 621
.va 238 50
.vacations 1724 1227
.vana 0 0
.vanguard 4 0
.vc 2294 986
.ve 9158 5371
.vegas 4496 3115
.ventures 3493 2572
.verisign 0 0
.versicherung 14 14
.vet 2169 1420
.vg 1593 1068
.vi 158 85
.viajes 530 299
.video 5111 3016
.vig 0 0
.viking 0 0
.villas 871 586
.vin 2281 1284
.vip 2034 422
.virgin 3 1
.visa 0 0
.vision 2347 1572
.vistaprint 2 1
.viva 1 1
.vivo 2 0
.vlaanderen 69 34
.vn 114281 23775
.vodka 12 8
.volkswagen 0 0
.volvo 1 0
.vote 528 343
.voting 3 1
.voto 85 26
.voyage 1240 835
.vu 1567 581
.vuelos 0 0
.wales 5624 3902
.walmart 4 0
.walter 3 1
.wang 9183 4748
.wanggou 0 0
.warman 0 0
.watch 2555 1434
.watches 1 1
.weather 1 0
.weatherchannel 1 0
.webcam 643 262
.weber 47 21
.website 30371 19469
.wed 6 6
.wedding 1987 1265
.weibo 1 0
.weir 72 1
.wf 145 74
.whoswho 2 1
.wien 3600 1960
.wiki 5183 3098
.williamhill 6 3
.win 21935 10278
.windows 2 1
.wine 4599 2934
.winners 0 0
.wme 1 1
.wolterskluwer 0 0
.woodside 2 0
.work 2904 749
.works 5315 3590
.world 31880 13729
.wow 6 0
.ws 133980 122131
.wtc 1 1
.wtf 3382 2357
.xbox 13 1
.xerox 0 0
.xfinity 0 0
.xihuan 0 0
.xin 6926 4789
.xn--11b4c3d 0 0
.xn--1ck2e1b 0 0
.xn--1qqw23a 0 0
.xn--2scrj9c 0 0
.xn--30rr7y 0 0
.xn--3bst00m 0 0
.xn--3ds443g 3 0
.xn--3e0b707e 51 18
.xn--3hcrj9c 0 0
.xn--3oq18vl8pn36a 0 0
.xn--3pxu8k 0 0
.xn--42c2d9a 0 0
.xn--45br5cyl 0 0
.xn--45brj9c 1 0
.xn--45q11c 2 0
.xn--4gbrim 1 0
.xn--54b7fta0cc 3 0
.xn--55qw42g 0 0
.xn--55qx5d 0 1
.xn--5su34j936bgsg 0 0
.xn--5tzm5g 0 0
.xn--6frz82g 329 114
.xn--6qq986b3xl 7 0
.xn--80adxhks 160 7
.xn--80ao21a 2 0
.xn--80aqecdr1a 0 0
.xn--80asehdb 857 138
.xn--80aswg 324 58
.xn--8y0a063a 0 0
.xn--90a3ac 12 2
.xn--90ae 24 1
.xn--90ais 149 2
.xn--9dbq2a 1 0
.xn--9et52u 0 0
.xn--9krt00a 0 0
.xn--b4w605ferd 1 0
.xn--bck1b9a5dre4c 0 0
.xn--c1avg 20 2
.xn--c2br7g 0 0
.xn--cck2b3b 0 0
.xn--cg4bki 0 0
.xn--clchc0ea0b2g2a9gcd 0 0
.xn--czr694b 1 1
.xn--czrs0t 0 0
.xn--czru2d 3 0
.xn--d1acj3b 24 0
.xn--d1alf 0 0
.xn--e1a4c 6 0
.xn--eckvdtc9d 0 0
.xn--efvy88h 0 0
.xn--estv75g 1 1
.xn--fct429k 0 0
.xn--fhbei 0 0
.xn--fiq228c5hs 0 0
.xn--fiq64b 0 0
.xn--fiqs8s 1220 44
.xn--fiqz9s 0 0
.xn--fjq720a 0 0
.xn--flw351e 1 0
.xn--fpcrj9c3d 0 1
.xn--fzc2c9e2c 1 0
.xn--fzys8d69uvgm 0 0
.xn--g2xx48c 0 0
.xn--gckr3f0f 0 0
.xn--gecrj9c 0 0
.xn--gk3at1e 0 0
.xn--h2breg3eve 0 0
.xn--h2brj9c 7 1
.xn--h2brj9c8c 0 0
.xn--hxt814e 59 27
.xn--i1b6b1a6a2e 0 0
.xn--imr513n 0 0
.xn--io0a7i 1 0
.xn--j1aef 7 0
.xn--j1amh 239 5
.xn--j6w193g 3 1
.xn--jlq61u9w7b 0 0
.xn--jvr189m 0 0
.xn--kcrx77d1x4a 1 1
.xn--kprw13d 0 0
.xn--kpry57d 10 1
.xn--kpu716f 0 0
.xn--kput3i 1 1
.xn--l1acc 1 1
.xn--lgbbat1ad8j 2 0
.xn--mgb9awbf 0 0
.xn--mgba3a3ejt 2 1
.xn--mgba3a4f16a 1 0
.xn--mgba7c0bbn0a 0 0
.xn--mgbaakc7dvf 0 0
.xn--mgbaam7a8h 5 1
.xn--mgbab2bd 101 19
.xn--mgbah1a3hjkrd 0 0
.xn--mgbai9azgqp6j 0 0
.xn--mgbayh7gpa 0 0
.xn--mgbb9fbpob 0 0
.xn--mgbbh1a 0 0
.xn--mgbbh1a71e 0 0
.xn--mgbc0a9azcg 0 0
.xn--mgbca7dzdo 0 0
.xn--mgberp4a5d4ar 0 0
.xn--mgbgu82a 0 0
.xn--mgbi4ecexp 0 0
.xn--mgbpl2fh 0 0
.xn--mgbt3dhd 0 0
.xn--mgbtx2b 0 0
.xn--mgbx4cd0ab 0 0
.xn--mix891f 0 0
.xn--mk1bu44c 6 0
.xn--mxtq1m 0 0
.xn--ngbc5azd 7 0
.xn--ngbe9e0a 0 0
.xn--ngbrx 0 0
.xn--node 2 0
.xn--nqv7f 0 0
.xn--nqv7fs00ema 0 0
.xn--nyqy26a 4 4
.xn--o3cw4h 9 1
.xn--ogbpf8fl 0 0
.xn--otu796d 0 0
.xn--p1acf 299 43
.xn--p1ai 299734 44067
.xn--pbt977c 0 0
.xn--pgbs0dh 2 0
.xn--pssy2u 0 0
.xn--q9jyb4c 287 158
.xn--qcka1pmc 1 0
.xn--qxam 9 2
.xn--rhqv96g 1 0
.xn--rovu88b 0 0
.xn--rvc1e0am3e 0 0
.xn--s9brj9c 0 0
.xn--ses554g 0 0
.xn--t60b56a 2 0
.xn--tckwe 85 26
.xn--tiq49xqyj 0 0
.xn--unup4y 1 0
.xn--vermgensberater-ctb 1 1
.xn--vermgensberatung-pwb 1 1
.xn--vhquv 0 0
.xn--vuq861b 3 1
.xn--w4r85el8fhu5dnra 0 0
.xn--w4rs40l 0 0
.xn--wgbh1c 2 0
.xn--wgbl6a 2 1
.xn--xhq521b 5 0
.xn--xkc2al3hye2a 0 0
.xn--xkc2dl3a5ee0h 0 0
.xn--y9a3aq 0 0
.xn--yfro4i67o 0 0
.xn--ygbi2ammx 0 0
.xn--zfr164b 3 2
.xxx 89175 3330
.xyz 289147 80131
.yachts 1 1
.yahoo 11 0
.yamaxun 1 1
.yandex 14 1
.ye 197 55
.yodobashi 1 1
.yoga 2964 2084
.yokohama 1481 684
.you 31 0
.youtube 7 0
.yt 214 77
.yun 0 0
.za 295219 203059
.zappos 1 1
.zara 2 0
.zero 3 0
.zip 63 0
.zm 571 295
.zone 7292 4840
.zuerich 1 1
.zw 1434 730
Total 44202501 23848767

WbSrch Offline Again

I put the WbSrch search engine back online in March of 2018.

I spent a lot of time improving it over the 16 months, but it’s the sort of thing that always manages to demand more time and energy. It’s time to stop giving it either — though it’s grown and improved a lot, it’s not something I could ever imagine doing full-time. The money isn’t there and the fun isn’t there anymore.

So I’ve taken it down. This time probably for good.

I’ll be focusing on my music software, electronic music, and acoustic guitar music instead.

Thanks for reading, and if you’re reading this you quite possibly participated in the experiment that was the WbSrch search engine. Thank you.

WbSrch Online Again

A while back I open-sourced the code for the WbSrch search engine.

It’s online now in a much-reduced form at wbsrch.com.

It’s not the full search engine. Far from it. It’s just a tiny database of about 10,000 or so URLs to demo the source code, but it’s possible you’ll actually find what you’re looking for in even that tiny amount of data if your search is sufficiently simple.

It probably won’t get any bigger — that’s about the size I can support “for free”, in that it doesn’t take enough resources on my inexpensive VPS to impact more important things. If you’re curious what the original WbSrch search engine was like, it’s a pretty good demo, at least visually.

2018 Is The Year That Twitter Ceases To Be Relevant

2018 is the year that Twitter ceases to be relevant.

It’s already stopped being relevant for me. I’ve stopped using it, and have deleted all of my tweets.

As a user, it’s just not worth it. It’s a miserable experience, made much worse by the userbase being made up primarily of Russian bots posing as MAGA idiots, actual alt-right MAGA idiots, and a small kernel of real people saying intelligent things that are drowned out by noise.

I’ve done (and still do) a lot of advertising on the web. For all of the different things I’ve been into, the worst ROI has consistently been via Twitter. Maybe some business types are viable via their ad platform, but none I’ve been involved in have been. It’s been a total waste of money. Mailing postcards would be a better value.

Most of the people I know in meatspace with accounts have stopped using it long ago. Some stopped in 2015, some in 2016, some in 2017. I can name maybe five people who use it regularly, and some of them echo their tweets to Facebook. I don’t have a huge circle of friends, but compare that to about 140 on Facebook with about 40% of them being active (50 or so people) and the order of magnitude population reduction makes it far less interesting. Facebook has its own problems, but it still manages to be relevant, unlike Twitter.

Even though Twitter is garbage to me, maybe it isn’t garbage to everyone else.

Nope.

There hasn’t been much recent coverage that I can find with about 15 seconds of effort, but these from last year don’t paint a rosy picture:

Twitter is now losing users in the U.S
http://money.cnn.com/2017/07/27/technology/business/twitter-earnings/index.html

Twitter revenues decline for first time as advertising falls away
https://www.theguardian.com/technology/2017/apr/26/twitter-revenues-fall-first-quarter-results-advertising

Library of Congress Gives Up Collecting All Tweets Because Twitter Is Garbage
https://gizmodo.com/library-of-congress-gives-up-on-twitter-because-twitter-1821581190

When Twitter finally dies, nearly nothing of value will be lost.

And if it doesn’t die, why care?

Quora Answer: How would you find the websites to build a search index from scratch?

I originally wrote this as an answer to a question on Quora.

It depends on the scale.

If you just want to experiment with web crawling and build a basic search index, it’s common to start with the Alexa top million websites, which can be downloaded in a CSV file via S3 at: http://s3.amazonaws.com/alexa-static/top-1m.csv.zip

The top million changes daily, and includes a lot of spam and porn. It’s easy to game the system to get into the bottom 500k, so you’ll need to decide what to include and how to weight it.

The DMOZ website dump used to be a good starting point. They shut down on March 17th, 2017, so that’s no longer really an option. It left a lot out, but contained about 4 million URLs with most of the low-quality sites filtered out. There may be a mirror that data somewhere (heck, I’d like to download their last URL dump if it’s available anywhere).

Real search engines have agreements in the place with the top-level registrars that let them get a zone file dump listing all registered domains. This involves jumping through some hoops and filling out some forms, and each registrar needs to be dealt with separately. Getting access to .social is completely separate from getting access to .net.

Since it takes a LOT of work to get access to all TLD domain files, a commercial service like Domains Index is probably your best bet if you want to do anything on a large scale. I’ve bought from them before and it’s a good service. They don’t have absolutely everything, but 200 million domains is far better than the 1 million you get from Alexa.

Quora Answer: What are the reasons for Google’s search engine low market share in Russia, South Korea, the US?

I originally wrote this as an answer to a question on Quora.

Full question:

“While I know that Google is banned in China and Yahoo Japan uses Google algorithm, I do not know why Google’s market share is so poor in Russia, the US, South Korea. I’m especially interested in bad performance due to linguistic reasons, if they apply. What about the US?”

In Russia, Yandex is dominant. It’s a very good search engine, and is primarily Russian-language focused. It’s been around nearly as long as Google, and people have a certain loyalty to it, especially since it was the first good Russian-language search engine. Russians also don’t tend to be very trusting of U.S. companies.

In Korea, the story is similar. They have good regional homegrown search engines, Naver and Daum, that have been around about as long as Google and that cater specifically to the Korean-language market. There’s no reason to use Google because the local options work quite well.

In these cases it’s a combination of being better in local languages for a long time combined with the tendency to “buy local”. They don’t use Google because they don’t need or want Google.

In the U.S. there are a lot of factors. There have been a lot of good English-language options for a long time, even though many have come and gone. Brand loyalty is a very American thing, even when other brands might do slightly better, and getting a “second opinion” is also pretty ingrained. There are also a lot of people who are uncomfortable with Google’s level of information gathering and profile building (“spying”). This combined with the tens of billions of dollars in the search space leaves plenty of room for competitors, even though it’s very expensive to build a decent search engine and very difficult to monetize one (DuckDuckGo being a good modern example of the difficulty of building one in modern times).

One of the big influences on U.S. market share is the existence of marketing deals. Money changes hands to be featured as the default search engine in browsers like Firefox, Safari, and others. This can move market share by a few percentage points overnight. If you can pay $1 billion for enough traffic to generate $3 billion in ad revenue from the, it’s a great deal. The Apple deals have been very public, but others have been privately arranged.

Interesting But Not A Business: The Story of the WbSrch Search Engine

I write this after having just shut down my almost-startup, the WbSrch search engine.

I started working on WbSrch for “fun” in the fall of 2013. AltaVista, my favorite search engine from “back in the day” had shut down that summer. Nostalgia combined with annoyance at how bad/annoying/intrusive/evil Google had become convinced me to try building my own version of AltaVista.

Well, a month of hacking later I had the core of something rudimentary but sort-of-functional. It was pretty terrible, but proved that I could get something built. I crawled a total of about 200,000 pages and had a bare skeleton of a search engine. I started by calling it the “anti-social search engine” because at the time, searching Google for almost anything would return so much social media drivel, clickbait garbage, and otherwise low-value spam-like content.

Getting to the first prototype was easier than I expected, so I continued to work on it, improving the crawler and search algorithms and growing the index. At around 2 million pages it outgrew the Linode VPS it was on and I set up hosting at a local colocation center using a $400 server I picked up on eBay (great deal – dual quad-core Xeons and 72GB of RAM – plenty to grow with).

Things progressed and I ended up announcing it to the public around the end of May 2015. It only had 5 million pages and the indexing algorithms were still pretty terrible, but it started getting some Human traffic.

And the bots discovered it. Every link analyzer SEO app in the world decided that WbSrch was a juicy crawl target. I considered blocking them since the SEO industry is complete garbage, but they were a decent source of Human traffic, and most of the traffic came from webmasters who would check to see whether their sites were indexed and run a few searches.

As it grew, maintenance became more time-consuming. I wanted to keep it from being too porn-heavy, from being full of Chinese and Russian sites, and pages categorized by language so the German-language front-end would only return pages in German.

After a year and a half of running the site as a hobby project, I decided to put it away because the mission had been accomplished – an index of 10 million pages, and it worked about as well as any other mid-1990’s search engine. For a few months the front page just pointed to Yandex.com (a Russian search engine – the third-best search engine after Google and Bing).

Well, one unanswered question kept nagging at me: “What if I could turn this into a real business?”

So I turned it back on, and started working on it pretty hard. I read a bunch of textbooks on information retrieval, text processing, statistical language processing, and a bunch of other search-related topics that I knew nothing about when I started the project.

Almost immediately the drives in the server failed.

So I replaced them and rebuilt the server. Didn’t lose much other than a week or two of crawl data because I had a backup, and source control. A few months later the drive controller failed, but there was no data loss – just a day of downtime and a lot of swearing.

As it grew and improved, I also started running some advertising, trying to build the audience and increase traffic. Visits were pretty cheap, but not very sticky. The bar is extremely high for getting someone to switch to a new search engine.

Still, there was some traffic, on the order of a consistent 5-digit number of pageviews per month. So I tried monetizing using a few different ad networks (around ten). The best one was able to earn about $3 per month. When you can’t use ad networks that don’t let you link to porn, gambling, or torrent sites but don’t want to advertise porn, gambling, or torrents, your income is low. Abysmal. $.05 CPM on the high end.

With the math for getting new visitors figured out (3-7 cents per click depending on the channel), and the math for monetizing those visitors figured out (about 5 cents for every 500 visitors), it was clear that it would cost $5 to earn $0.01. If those users were really sticky and would return over and over again, then maybe it would be worth the price. But they weren’t.

I also ran a crowdfunding campaign to gauge interest/demand. I raised a little money for hardware upgrades, but more importantly I learned a little more about how much people just don’t care about having another search option. I did manage to get one donation from someone I didn’t already know, but only one.

At this point I had about 47 million pages indexed and the search engine had grown to 3 servers. It had crawled only a tiny fraction of the internet, but it was still possible to find what you were looking for much of the time. It’s surprising how well you can do with a small index if you focus mainly on the most popular sites.

But to take things to the next level of quality I would need to build a system able to handle at least a billion pages.

That’s where things get expensive. I had only spent about $8,000 on WbSrch so far. Did I want to spend another $50,000 to get to that next level where users might be a bit stickier and ad revenue might be better (it tends to be lower when you’re low-volume — when you have enough traffic that algorithms can optimize, it gets better). Maybe it would only cost $2 to earn $0.01 and those users would return often enough that I could earn another $0.01.

And that’s where I decided to shut everything down. Math doesn’t lie.

Call it failure to validate. There is no search engine business to be had for me. Maybe someone else could do it. Like DuckDuckGo. Interestingly, they didn’t start with their own crawl. And they’ve partnered with Microsoft for advertising. So they’re essentially a privacy-focused variant of Bing with a different UI. That’s good for them, but the interesting part is developing your own proprietary technology, your own crawler and algorithms. Otherwise, one day Microsoft could decide that having an API is inconvenient and shut down your entire business.

At some point Apple will decide that having a search engine is important and build or buy one. Maybe they’re already building one if the rumors are to be believed.

Anyhow, it’s a little bit sad that it didn’t work, and a little bit sad that I spent all that time on it, but I did get smarter. And not just code. I learned a lot more about marketing and advertising in the process.

So now it’s on to the next thing.

WbSrch Acquires Music Search Engine

Reprint of a press release originally published on PRWeb at https://www.prweb.com/releases/2016/03/prweb13295857.htm.

WbSrch has purchased the music search engine Snagr.io for an undisclosed sum and has rebranded it as MusicSrch.com.

The search engine, originally developed by Oto Brglez from Slovenia, searches a selection of popular social media and music sites such as Spotify, Tidal, Musicbrainz, and Twitter for the online presences of a band, DJ, or musician.

WbSrch founder Jason Champion had this to say about the purchase:

“Oto created a great, fast, responsive site but didn’t have the time, people, or resources to grow its audience. With this purchase, we hope to continue to expand on the great foundation that he created.

The addition of music search to our existing search capabilities will help us increase our reach online and we look forward to sharing the joy of discovering new music with a larger audience.”

The music search site is currently live at http://musicsrch.com.

WbSrch is a general-purpose search engine based in Portland, OR, that was created in 2013 and launched in 2014. It is online at https://wbsrch.com.

WbSrch Launches New WbBrowse Web Browser for Windows, Mac, and Linux

Reprint of a press release originally published on PRWeb at https://www.prweb.com/releases/2016/01/prweb13179551.htm.

The independent search engine WbSrch just launched a new desktop web browser called WbBrowse.

WbBrowse supports tabbed browsing, is free, runs on Windows, Mac, and Linux, and is available at http://wbbrowse.com

“The major search engines all have their own browsers. It’s a good way for new users to discover your service, and an easy way to for them to return to it later. Even though the first release of WbBrowse doesn’t have all of the features of the top browsers yet, we think it’s pretty good for a 1.0 release.”

WbSrch also has OpenSearch plugins that can be added to any OpenSearch-capable browser, such as Firefox or Internet Explorer.

WbSrch is a general-purpose search engine based in Portland, OR, that was created in 2013 and launched in 2014.

Quora Answer: What is the minimum number of pages a modern general search engine would have to index to be useful?

I originally wrote this as an answer to a question on Quora.

“Useful” is a very subjective question. People who frequently ask deep and complex language-and-algorithm-specific software engineering questions will require different levels of depth than those who travel a lot and just want to find good prices on airfare and the top 5 restaurants and hotels in each city.

If you have /really/ good algorithms, you can build a search engine that is “good” for people who don’t require much depth with about 100 million pages. For people who require depth, you could probably be pretty useful at about 1 billion pages.

This depends HEAVILY on what you choose to include and exclude. Are these pages all in a single language? Or is this just 100 pages each from the top 1 million sites regardless of language and content?

Even though the web is phenomenally huge, much of it is duplication and/or computer-generated spam. There are millions of sites that are just scrapes/dumps of other sites (especially Wikipedia) and indexing 1000 copies of Wikipedia with different CSS isn’t going to get you very far.

Think about the sites you visit regularly, and about those that regularly turn up in searches. How many of those useful sites are below the top 100,000? Does it matter if there are 100 million+ domains when 99.9% of your needs are covered by the top 0.1%? With a smaller index, choosing what you leave out is pretty important.

There’s a site I like to play with when trying to find obscure results, it’s fun for experimenting with and it helps you understand how much the size/quality of your index affects your results: Million Short

Setting Up a Redash Dashboard

This was originally posted on wbsrch.com. It is reproduced here to preserve history.

The more WbSrch evolves, the more it becomes necessary to keep track of a bunch of metrics.

Until now we’ve been using a mix of simple report pages and raw SQL queries. It has worked well enough, but not having a clean way to track things in a single place is a nuisance.

That’s why I was happy to discover the redash.io open source project. It’s a query tool meant to be used for setting up business intelligence dashboards and it works with a wide range of databases.

No stranger to code, I tried to check out the GitHub source and get it running on my local machine. It didn’t quite work out. They have a bootstrap script, and it had some trouble with my particular system setup (it fell over when it came to configuring local database users).

But they also have EC2 AMI images you can launch to get running in AWS. I fired up an Amazon micro instance on the free tier and had the app running in seconds. It only took some minor configuration to get set up with my SSL certificate, and I was ready to go.

Adding a Database Connection to Redash

Connecting my three PostgreSQL databases was easy and the clean interface made it easy to find the query editor. After running a few queries I had the feel for how things worked well enough to save them. It also lets you set a refresh interval on your queries so you can have data refresh daily, hourly, or whatever. Results are cached so you’re not taxing your database gathering totals every page load.

Redash Query Editor

After you have a few queries, you can start adding them to a dashboard as panels. You just select the query name, the visualization type (you get table by default, but can add graphs and charts in the query builder), and the widget size.

This is a dashboard that I built to keep track of the search traffic and index state for the Somali-language version of WbSrch:

Redash Dashboard Example

I created dashboards for each supported language plus an overall meta-dashboard. It was fairly quick, taking about a day to set up 35 dashboards and about 200 queries.

Luckily the interface is pretty good, because once you have the software set up, that’s where the documentation ends. You can figure out most things with experimentation (trial-and-error), but it would be very helpful to have a few getting started tutorials, or at the least an explanation of how the various visualizations work.

A micro EC2 instance may stumble if you have some large queries (selecting an entire table is a bad idea, don’t do it), or a lot of things refreshing, but it kept up pretty well.

WbSrch, the Independent Search Engine, Expands

Reprint of a press release originally published on PRWeb at https://www.prweb.com/releases/2015/11/prweb13073007.htm.

WbSrch, an independent search engine based in Oregon, has expanded its data center, growing from a single dedicated server to three.

Founder Jason Champion had this to say about the expansion:

“We’ve grown enough that a single server no longer meets our needs. Tripling our footprint will allow us to continue growing and improving our algorithms throughout 2016.

We chose colocating servers over cloud hosting like EC2 because we’re running a very RAM-intensive operation. Most cloud hosting companies make their profits on memory, so once you go beyond a certain point, it’s much less expensive to host your own servers.

Our hosting provider, Opus Interactive, has been great. Their price, quality of service, and reliability has enabled us to continue improving our algorithms without worrying too much about infrastructure costs, and they have plenty of rack space available if we need to grow quickly.”

Started in 2013 and launched in 2014, WbSrch is still quite young, with an index of 10 million pages grouped into more than 30 languages. Traffic has been steadily increasing as the index grows and algorithms improve.

Rather than trying to crawl and index every page on the web, WbSrch aims to build quality results by weighing what is excluded just as heavily as what is included. More than 500,000 domains have been excluded, based primarily on content language and a few other criteria. The number of crawled pages, indexed keywords, and excluded sites is published on the site.