Yearly Archives: 2016

Interesting But Not A Business: The Story of the WbSrch Search Engine

I write this after having just shut down my almost-startup, the WbSrch search engine.

I started working on WbSrch for “fun” in the fall of 2013. AltaVista, my favorite search engine from “back in the day” had shut down that summer. Nostalgia combined with annoyance at how bad/annoying/intrusive/evil Google had become convinced me to try building my own version of AltaVista.

Well, a month of hacking later I had the core of something rudimentary but sort-of-functional. It was pretty terrible, but proved that I could get something built. I crawled a total of about 200,000 pages and had a bare skeleton of a search engine. I started by calling it the “anti-social search engine” because at the time, searching Google for almost anything would return so much social media drivel, clickbait garbage, and otherwise low-value spam-like content.

Getting to the first prototype was easier than I expected, so I continued to work on it, improving the crawler and search algorithms and growing the index. At around 2 million pages it outgrew the Linode VPS it was on and I set up hosting at a local colocation center using a $400 server I picked up on eBay (great deal – dual quad-core Xeons and 72GB of RAM – plenty to grow with).

Things progressed and I ended up announcing it to the public around the end of May 2015. It only had 5 million pages and the indexing algorithms were still pretty terrible, but it started getting some Human traffic.

And the bots discovered it. Every link analyzer SEO app in the world decided that WbSrch was a juicy crawl target. I considered blocking them since the SEO industry is complete garbage, but they were a decent source of Human traffic, and most of the traffic came from webmasters who would check to see whether their sites were indexed and run a few searches.

As it grew, maintenance became more time-consuming. I wanted to keep it from being too porn-heavy, from being full of Chinese and Russian sites, and pages categorized by language so the German-language front-end would only return pages in German.

After a year and a half of running the site as a hobby project, I decided to put it away because the mission had been accomplished – an index of 10 million pages, and it worked about as well as any other mid-1990’s search engine. For a few months the front page just pointed to Yandex.com (a Russian search engine – the third-best search engine after Google and Bing).

Well, one unanswered question kept nagging at me: “What if I could turn this into a real business?”

So I turned it back on, and started working on it pretty hard. I read a bunch of textbooks on information retrieval, text processing, statistical language processing, and a bunch of other search-related topics that I knew nothing about when I started the project.

Almost immediately the drives in the server failed.

So I replaced them and rebuilt the server. Didn’t lose much other than a week or two of crawl data because I had a backup, and source control. A few months later the drive controller failed, but there was no data loss – just a day of downtime and a lot of swearing.

As it grew and improved, I also started running some advertising, trying to build the audience and increase traffic. Visits were pretty cheap, but not very sticky. The bar is extremely high for getting someone to switch to a new search engine.

Still, there was some traffic, on the order of a consistent 5-digit number of pageviews per month. So I tried monetizing using a few different ad networks (around ten). The best one was able to earn about $3 per month. When you can’t use ad networks that don’t let you link to porn, gambling, or torrent sites but don’t want to advertise porn, gambling, or torrents, your income is low. Abysmal. $.05 CPM on the high end.

With the math for getting new visitors figured out (3-7 cents per click depending on the channel), and the math for monetizing those visitors figured out (about 5 cents for every 500 visitors), it was clear that it would cost $5 to earn $0.01. If those users were really sticky and would return over and over again, then maybe it would be worth the price. But they weren’t.

I also ran a crowdfunding campaign to gauge interest/demand. I raised a little money for hardware upgrades, but more importantly I learned a little more about how much people just don’t care about having another search option. I did manage to get one donation from someone I didn’t already know, but only one.

At this point I had about 47 million pages indexed and the search engine had grown to 3 servers. It had crawled only a tiny fraction of the internet, but it was still possible to find what you were looking for much of the time. It’s surprising how well you can do with a small index if you focus mainly on the most popular sites.

But to take things to the next level of quality I would need to build a system able to handle at least a billion pages.

That’s where things get expensive. I had only spent about $8,000 on WbSrch so far. Did I want to spend another $50,000 to get to that next level where users might be a bit stickier and ad revenue might be better (it tends to be lower when you’re low-volume — when you have enough traffic that algorithms can optimize, it gets better). Maybe it would only cost $2 to earn $0.01 and those users would return often enough that I could earn another $0.01.

And that’s where I decided to shut everything down. Math doesn’t lie.

Call it failure to validate. There is no search engine business to be had for me. Maybe someone else could do it. Like DuckDuckGo. Interestingly, they didn’t start with their own crawl. And they’ve partnered with Microsoft for advertising. So they’re essentially a privacy-focused variant of Bing with a different UI. That’s good for them, but the interesting part is developing your own proprietary technology, your own crawler and algorithms. Otherwise, one day Microsoft could decide that having an API is inconvenient and shut down your entire business.

At some point Apple will decide that having a search engine is important and build or buy one. Maybe they’re already building one if the rumors are to be believed.

Anyhow, it’s a little bit sad that it didn’t work, and a little bit sad that I spent all that time on it, but I did get smarter. And not just code. I learned a lot more about marketing and advertising in the process.

So now it’s on to the next thing.

(Player Profile) Michael Manring

Michael Manring is a pretty well-known soloist among bass players, but less so among mainstream music listeners.

He’s known for his custom fretless Zon Hyperbass guitar shown in this video performance/interview from Bass Player LIVE! 2013:

He was the youngest of four in a musical family. He took classes at the Berklee College of Music and studied with Jaco Pastorius. A very technical player,  his style includes use of the e-bow, changing tunings mid-song, slapping, popping, muting, and two-handed tapping. To understand his style, it helps to know that he considers the bass guitar a very expressive instrument. He develops techniques that expand on that expressiveness, including quite heavy use of alternative tunings.

Much of Michael’s music could be considered instrumental “calm jazz” that is often filed as New Age or Adult Alternative, but he has a variety of styles and sometimes plays loud, upbeat, bouncy, funky music. He considers his work to be genre agnostic and doesn’t worry about fitting into any particular category.

His music recordings are very prolific, with of hundreds of collaborations and guest appearances with artists such as Alex Skolnick, Montreux, Jeff Loomis, and Paolo Giordano thanks in part to his role as house bassist with Windham Hill Records. He has also released a number of solo studio albums.

Original solo work (links to Amazon):

1986 Unusual Weather
1989 Toward the Center of the Night
1991 Drastic Measures
1994 Thonk
1995 Up Close 21
1998 The Book of Flame
2005 Soliloquy

(Video) Davie504 Plays 100 Amazing Bass Lines

“Davie504” is an Italian bass player with a great selection of YouTube videos.

In this video he plays 100 famous bass riffs in 13 minutes:

There’s also a sequel “100 Amazing Bass Lines 2” that will show up as a related video.

It includes riffs from a nice variety of bands, though it does have a heavier selection of Red Hot Chili Peppers and Jamiroquai.

If you want to learn a particular riff, you can click the gear on the player and change the video player to half speed. That’s a great feature of YouTube that not everyone knows about.

Quora Answer: My son got an offer from a 1-year-old startup by some very senior folks from Google. The pay is good and product idea is good, but it’s a startup. He asked for our advice on this. What are some suggestions from people from relevant fields?

I originally wrote this as an answer to a question on Quora.

I joined a startup founded by ex-Google people in 2010 as the third engineer. I worked there for 2 years, had a great time, learned a ton, got my first patent out of the process, and the experience is one of the best I’ve ever had. I wouldn’t trade it for anything.

The only reason I didn’t stay is because California wasn’t for me, but the startup is still going, and it feels like my shares are a pocket full of scratch-off lottery tickets. They could end up failing, but they won’t.

1-year-old is probably the best time to join a startup with solid backing, and even though being ex-Google doesn’t necessarily make you smarter or more likely to succeed than anyone else, it sure makes it easier to get funded.

In the worst scenario, they’ll run out of money and he’ll have to spend three or four weeks looking for a job. If he has any skill and/or motivation, he’ll be so much smarter and better for the experience that it’ll be really hard to NOT get hired for something new.

That’s the big fear fallacy with startups — that they’ll fail and you won’t be able to get a job. But all of the things you go through in making a serious go of building a startup make you so much better at what you do that there’s nothing to fear. I like to think of the experience factor as being 2 to 1. If you spend 2 years in a startup it’s a similar learning experience to spending 4 years at a stable company.

The fatigue factor is also similar. I feel like software can be a very mentally taxing and draining endeavor, so a 6-month sabbatical every 5 years is almost a must unless you’re in a perfect environment. Being in a startup shortens that “wear out” time, so you might need to take a break every 2-3 years in order to remain healthy and sane.

The most important thing to do if you’re considering joining a startup is to make sure the people running it are somewhat-well-adjusted grownups, or at least likely to become so in the near future.

Quora Answer: What is the minimum number of pages a modern general search engine would have to index to be useful?

I originally wrote this as an answer to a question on Quora.

“Useful” is a very subjective question. People who frequently ask deep and complex language-and-algorithm-specific software engineering questions will require different levels of depth than those who travel a lot and just want to find good prices on airfare and the top 5 restaurants and hotels in each city.

If you have /really/ good algorithms, you can build a search engine that is “good” for people who don’t require much depth with about 100 million pages. For people who require depth, you could probably be pretty useful at about 1 billion pages.

This depends HEAVILY on what you choose to include and exclude. Are these pages all in a single language? Or is this just 100 pages each from the top 1 million sites regardless of language and content?

Even though the web is phenomenally huge, much of it is duplication and/or computer-generated spam. There are millions of sites that are just scrapes/dumps of other sites (especially Wikipedia) and indexing 1000 copies of Wikipedia with different CSS isn’t going to get you very far.

Think about the sites you visit regularly, and about those that regularly turn up in searches. How many of those useful sites are below the top 100,000? Does it matter if there are 100 million+ domains when 99.9% of your needs are covered by the top 0.1%? With a smaller index, choosing what you leave out is pretty important.

There’s a site I like to play with when trying to find obscure results, it’s fun for experimenting with and it helps you understand how much the size/quality of your index affects your results: Million Short