Machine Learning Newsletter

Mining a VC

Fred Wilson is one of the most popular VCs(rightly so) who is based in NY and he has a really great blog, where he talks about pretty much anything with a focus on startups, venture capital and technology. No surprise here.

His first post dates back to 23rd September 2003. He has been blogging for more than 11 years.

How come?

I have started at Axial recently and doing topic modeling to improve our search and recommendations as well as to build some other features that are driven by documents and text.

What gives?

To see some of the topics over time as his posts are somehow topical and could shed some light on hot industries as he also uses his blog as a reach tool for entrepreneurs. This of course follows their investments as well, if he is writing on a particular industry(bitcoin, mesh networks), chances are either they invested in some technology(bitcoin) or looking for companies(mesh networks) that they wanted to invest.

What does a VC write anyway?

Mostly Venture Capital presumably, but also introduces his portfolio companies, tries to raise awareness of new and cool technologies(bitcoin) and talks about his professional and personal life. That being said, there is no quantitative measure which topics he focuses on and how much. By measuring the topic frequency and variation over time, you could get a sense of which topics are actually trending as well. Scroll for that a bit.

What did you use?

All of his blog posts from start(09, 2003) to 31st December 2014(including). He writes a post per day and occasionally he may repeat the some of his posts (e.g. when he is on vacation), but other than posts should be unique and mostly they are medium-long length articles which is great for topic modeling. Earlier posts have the same date and I filter out the posts so that every date would have only a single post as I do not want to give biases of dates that have more than 1 post. I also filtered out the documents that have no words in it(some of the posts have only images apparently, #tumblrstyle).

Ok, got it. Topics, but what are they?

You could think of topics as specific word groups that are high likely to occur together. The algorithm(LDA) makes plausible and practical assumptions in text, and topics end up being logically plausible word groups that you could get the what the gist of those documents, it may seem like a keyword extraction tool, but not really.

  1. Burn Rate: business cash year market bn money revenue financial flow company rate revenue markets stock income balance mm costs interest capital time years pay price post numbers asset
  2. Venture Capital: venture capital business companies investment money vc investors fund entrepreneur years company market invest firm portfolio investing good early firms funds year entrepreneur deals time partners investor
  3. Company - Team: company team product companies job people business work startup jobs portfolio building build management person hire culture engineering great make ceo talent customer organization startups employees founder process founders
  4. News: news times media read post jason york hes jeff internet story wall ny street tom world john piece interesting content good steve great jarvis david techmeme
  5. Search - Internet: google search yahoo data traffic internet number comscore web mm users audience top services page month microsoft service day googles visits results market blog past share chart year
  6. Family: gotham gal great kids day family originally uploaded josh back night today park morning photo nyc friends home fredwilson street place ride city jessica west fun year yesterday side
  7. Social Media: twitter facebook social people service howard friends foursquare tumblr media users lindzon myspace services mr web apps zynga app follow networking top profile network build game tweet
  8. Internet Regulation: internet patent government industry health rules innovation patents bill companies public care act policy net law issue fcc spectrum healthcare system technology broadband congress access cable startup issues reform
  9. Music: music radio itunes listen rhapsody ipod lastfm listening audio mp play sonos service digital internet services player song streaming system buy download podcast apple machine file free playlist home
  10. Ads - Business Models: ad advertising business online model free content media paid revenue service pay etsy money network sell market internet buy customers marketing run cost make networks company adsense tacoda
  11. Multimedia: video tv radio youtube game boxee watch content hd digital videos show play watching games shows big watched internet channel satellite today player netflix cable media jets night broadcast
  12. Email: email spam mail emails gmail inbox return send aol reply messaging service path problem reputation list spyware messages message address called delete bad sender account aim priority credit filter
  13. Politics: bush country obama hes president vote political kerry war iraq debate election world party america politics policy win today democrats plan government state american system clark republican campaign house
  14. Technology: web services open data internet world companies technology software market business build model network company built big system users platform service networks large important source information interesting post building
  15. Valuation: company stock price equity deal options financing mm sale investors employees employee option valuation transaction companies term debt common sell buyer public pay bridge founders round business preferred series
  16. Culture: book read reading books story seth art technology steven tom writing revolution published film society science life written early culture chapter words world write chris digital kindle inspired age
  17. Company - Management: board company ceo meeting business management meetings boards matt make entrepreneurs companies process vc ceos team work people important advice entrepreneur face members person good time plan investors
  18. NYC - Community: nyc school event students education year class community talk project tech program donors choose women public pm schools startup high kickstarter meetup projects week heres entrepreneurs code learning group
  19. Live Music - Event: record song great music songs love show called band week records rock favorite live listen top year night list heard podcast mp album bob put listening guitar played version
  20. Traveling: nyc europe trip valley city san silicon flight world wine paris country london travel york francisco place startup china area day ny boston visit wifi plane power back countries
  21. Comments: post blog comment posts avc disqus community read union square yesterday ventures day posted blogging week discussion tumblr link web talk great morning past today heres brad blogs
  22. Mobile Apps: mobile phone app android google iphone blackberry apple device web phones devices browser home ipad screen wifi windows data desktop experience voice ios laptop feature love nexus os

If you quickly glance the words following the topics, you could immediately get a sense of the topic. The labels that precedes the words are the topic labels and I came up with. Some of the topics, they may not represent the topics very accurately but my hope is that they abstracted most of the content so that instead of writing 20-30 words all over the place, I could refer them via topic labels.

So, in total 22 Topics?

I originally put wanted to have 24 topics based on visual inspection on different number of topics, 24 seemed to produce nice coherent topics. However, I need to further process the topics because of the reason explained below. Long story short, I removed two topics based on they are not high quality. If you want more details, read the following section, if you do not, skip it.

What is the catch(tell me as if I am 5)?

If there are two problems(not automated sense, otherwise not problems really) in topic models; one of them is naming topics automatically and the other one is to determine the topic numbers to begin with. There are various measures to evaluate topic: coherence, corpus distance to evaluate the topic “quality”. I only took the topics that have high coherence and low corpus distances topics as valid topics. One could introduce extensive stop words to improve topics in general, but I choose to prune the invalid topics after I run. The following topic is the only one that gets pruned and you will understand why it is low coherence and have very small distance to corpus: time people good back make years thing work post things great day made long year past hard week big put find times world idea blog ago point end part

and another topic similar to the above one get filtered out due to its has low coherence and similar to corpus.

Which topic is most dominant?

Topic Frequency Percentage

Larger Image

Comments topic has also elements of community and his venture firm(USV) as well. So, it is not very surprising. As he invests in tech companies and most of the stuff about portfolio companies also fall in the technology(e.g. bitcoin), it is not surprising that the second common topic is technology.

Most Representative Words, but what about posts?

I could also get the most representative individual posts as every post is a mixture of various topics. However, this does not mean the following documents only have a single dominant topic. It may have two or more significant topics, but the one that is associated with is the most significant topic. Otherwise, Burn Rate topic is somehow related to Valuation topic as they have overlapping works and phrases. I could also get the documents that are focused individual posts that are all about one topic. Here are the most representative posts for each topic:

  1. Burn Rate: Burn Rate (I know, it is amazing)
  2. Venture Capital: The Difference Between Small Funds and Large Funds
  3. Company Team: VP Engineering vs CTO
  4. News: Exploding TV
  5. Search - Internet: Microsoft Search Continued
  6. Family: Easter Sunday
  7. Social Media: Building Better Social Graphs
  8. Internet Regulation: The SHIELD Act
  9. Music: Transatlanticism
  10. Ads - Business Models: Feature Friday: Etsy Gift Cards
  11. Multimedia: Exploding Radio (continued)
  12. Email: The Spam Crisis is Over
  13. Politics: Jeff Jarvis on Clark
  14. Technology: Data Integration
  15. Valuation: Restricted Stock vs Options When We Are “Under Water”
  16. Culture: VC Cliche of the Week
  17. Company - Management: The Executive Session
  18. NYC Community: SHIP @ St Joseph’s – A Summer Coding Program In NYC
  19. Live Music: Joss Stone
  20. Traveling: Je change de langue
  21. Comments: MBA Mondays: Next Topics
  22. Mobile Apps: Feature Friday: Mobile Data Usage Tracking

Topic Changes over Time

I could look at the topic distribution of the documents over time in order to measure if topics are trending or not. However, the trend for the topics are far more interesting as I am not necessarily interested in the topic proportions on a daily basis but rather medium term. Luckily, there are a lot of different ways to do trend estimation. I will use Hodrick-Presscott Filter in order to get the medium range trend signal.

Winner Topics

Some of the topics are unsurprising winner like Internet Regulation and Mobile Apps but do you find any suprising/interesting topics?

Internet Regulation

Internet Regulation

Larger Image

Net neurality hit some nerves right there.

Mobile Apps

Mobile Apps

Larger Image

Mobile apps do not seem to even flatten in the near future.

NYC - Community

NYC Community

Larger Image

Making money is good, but so is community.

Technology

Technology

Larger Image

Even though he says he does not invest in technology but networks or the things that technology enables, apparently he cannot seems to write anything but technology.

Comments

Comments

Larger Image

This topic is one of the most surprising topic, yes USV invested in Disqus, they care a lot about community, audience but it feels to me it shouldn’t be this much, but apparently comments topic is big.

Loser Topics

News, Email and Live Music are the obvious losers. Politics is big one as well.

Email

Email

Larger Image

Email is not fun anymore, not even fun to complain about. So 2000.

News

News

Larger Image

Live Music

Live Music

Larger Image

Music

Music

Larger Image

Politics

Politics

Larger Image

Not so winner, but not so loser either topics

Ads - Business Models

Ads Business Models

Larger Image

Burn Rate

Burn Rate

Larger Image

Company Management

Company Management

Larger Image

Company Team

Company Team

Larger Image

Culture

Culture

Larger Image

Family

Family

Larger Image

Multimedia

Multimedia

Larger Image

Search

Search

Larger Image

Social Media

Social Media

Larger Image

Note that USV invested in Tumblr at then end of 2008, they invested in Twitter, Foursquare in 2009.

Traveling

Traveling

Larger Image

Valuation

Valuation

Larger Image

Venture Capital

Politics

Larger Image

comments powered by Disqus