Machine Learning Newsletter

PyData Silicon Valley 2014

Pydata

After every Python-related event, I keep telling myself that other language people have no idea what they are missing.

Yes, we have GIL(Global Interpreter Lock), yes unicode is painful especially if you are doing a lot text-related stuff, yes Python is slow, yes static typing may prevent most of the bugs I (people) have, yet Python community is increasing day by day. And last week, PyData SV 2014 was no exception.

I attended all three days(tutorials + conferences), maybe not surprisingly most talks were about tooling and data infrastructure except tutorials. Even if web development in Python will decrease over time(we have no reason to believe so), Python seems to be a big player in data in future. Even in "Big Data", people who want to write Python could get away with writing Python thanks to a bunch of ports for Python. This is also a huge gain for data processing in Python in general.

In Sunday talks, most of the talks especially before noon is about how to speed up the Python. Most people do not want to write Javascript but they needed it in the browser, so they came up with a bunch of languages that compiles to Javascript; Typescript, Coffeescript, Dart, Clojurescript ... For Python, most people want to write Python, and they came up with a bunch of interesting things to speed up Python; static compilers, JIT, numba, parakeet and so on. However, I think it is time to confess also one of the disadvantage of Python that it is slow. Otherwise, people should not come up these solutions to overcome its limitation. Similarly, if Javascript were a nice language, people should not invent all those languages in order not to write Javascript to write Javascript.(this sentence makes sense when you read twice) Before provacating a language war which definitely Python will win, let's look at the tutorials and conferences notes that I took.

Friday (Tutorials)

IPython 2.0

Presenters: Brian Granger and Jonathan Frederic

They introduced IPython and IPython notebook with a beginner level and then showed what changed 2.0 and whatnot. I was using IPython 1.0 prior to this event and now I am using IPython 2.0 which brings the modal structure to the table. If you are a Vim guy like me, that is just perfect. Edit and normal mode is very similar to Vim. Directories are also easy to change in the ipython notebook. If the modal environment is the first thing that they changed in IPython 2.0, the second thing is the interactivity. @interactive and @interact decorators provide an interactive environment on top of Ipython notebook.

Some of the things I did not like: adding cell on top of some cell is only done shortcut(a where you are not in edit mode), you cannot do with UI. This is not a disadvantage per se for people who are already using notebook, but for beginners with this modal structure, I would expect some people have quite difficult time to figure out hows. Considering Python 2 to 3, I wonder Python sacrifices easy entry in order to empower current users.(Some people end up learning Ruby because of this incompatibility problem).
On my point of view, if I did not attend this tutorial, my switch to Ipython notebook may get delayed even further. Now, quite happily using the most of the shortcuts, notebook feels quite like Vim. (if only the shortcuts were the same!, we should have :w or at least / seriously)

Ipython Notebook

If data science has anything to with other than software engineering, it is IPython notebook. Software engineers live on raw code whereas data scientists, social scientists and experimental people need richer representations to understand their experiments, conclusions and what data tell in general. Software engineers can get away with code and comments whereas other people need paragraphs, images, audios, videos and graphs for their experiments. They need to also communicate their findings to other people(not necessarily technical people). This gap is filled with Ipython notebook quite successfully. It is so successful that most of the presenters(either in tutorial or talks) opened an ipython notebook or references an ipython notebook in their talks in the conference. If an ecosystem needs a great environment to live, data people in Python have such environment thanks to IPython and notebook.

Gradient Boosted Regression Trees in scikit-learn

Presenter: Peter Prettenhofer

Peter showed how awesome gradient boosted regression trees are. Apart from their base function(piecewise functions) which I did not like one bit, they have quite nice properties such as robust to noise, handle categorical and scale different features with ease. His IPython notebook covers not just regression trees but also low bias-high variance, high bias-low variance (deviance plot in the notebook) and parameter search to optimize the parameters. They are all in Scikit-learn. I took(read stole) one of his ideas around when a robust nonparametric tries to learn a function over the observations, one does not need to worry about the outliers and noise. This idea could be exploited in outlier detection as well. Instead of looking the problem as outlier detection, we could designa robust regression model and exclude the ones(outliers) that model gives low probability. I used Gaussian processes in this post to do so. He put up all of the tutorial material(slides+notebook) in here.

Designing and Deploying Online Experiments with PlanOut

Presenter: Eytan Bakshy

Recently, Facebook opensourced their online experiment framework Planout. If you read the paper behind of the framework, this is a response of many of the shortcomings of the online experiments.

Separation of the experiment from production code is the biggest advantage and also firing up different experiments in different timeline for different "segments" seems quite nice. With Python syntax and what it provides out of the box, it seemed quite easy to use. But need to experiment a little bit to see its advantages. Some of the features(GUI) are missing in the open-source release where paper shows also some implementation detail around that.

He put up all of the ipython notebooks in here, if you get the source code of the Planout, you would have this material as well.

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

Presenter: Portia Burton

She gave an overview of what machine learning is and introduced Scikit-Learn. She also gave an overview of K Nearest Neighbor method and showed a demonstration on the Iris flower data set.

She put up all of the material(slides+notebooks) here.

K-means Clustering with Scikit-Learn

Presenter: Sarah Guido

She put a strong emphasis on parameter selection on K-Means as parameter selection for unsupervised learning algorithms are ill-defined. She introduced Silhouette Score which I did not know before. The score is somehow related to both intra-variance and inter-variance cluster. She put up IPython notebook and slides.

These are the ones that I attended.

ggplot

I really also wanted to attend ggplot for Python but I attended Planout instead. However, yhat did an awesome job to put all of the notebooks in the web and also to github. Notebooks are quite amazing. They are not just for ggplot per se but for general workflow for data analysis with an emphasis on explaratory data analysis. They are also putting really nice ipython notebooks on their blog, I highly recommend following their blog if that type of data analysis seems interesting to you.

Now, the talks. First, on saturday talks, I was more sane and motivated. That motivation somehow decreased on Sunday. So Sunday talks, it is not you, it is me. Let's get started.

Saturday Talks

Why Python is awesome for any scale data?

Presenter: Burc Alpat

This talk was about how and why Python should be used in the data processing environments and how Facebook adopts the Python as one of their main language. He also gave various pointers to other talks. Mainly focusing on the strengths of Python and whenever it feels it is not the best option, use something else and try to connect with Python.

Why Python for Machine Learning?

  1. You want to get results quickly
  2. You will need lots of functionality
  3. C++
  4. DSL
  5. You cannot do it all by yourself.

Some pointers

Note that Dataswarm is not open source.

C/C++ and language interoperability

Build Tools

Dependency Managements

These are mostly dependency management systems where if you do not want to deal with dependency hell, you may want to use some sort of dependency management system.

Pants seems quite useful. If you are doing interoperation with C/C++, get a build system instead of depending on the venv and pip which seems not working quite well with other languages.

DSL

  • DSL's can be useful for speaking different languages.
  • Dataswarm is a dependency graph description language.
  • Python is good for DSL's as well.

A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook

Presenter: Jason Sundram

This talk is a purely tooling talk where Jason introduced different visualization libraries and told how Facebook is using visualization for data. Their setup is quite interesting, they are using Tornado for framework and a bunch of different js libraries based on D3 that reveals multi aspects of data they have.

  • Fresh Data
  • More pixels
  • Make the data smaller

CrossFilter

In memory database

Some other Pointers

  • Tornado + gzip + csv
  • Cubism.js
  • Cubism.js Slides by Mike Bostock
  • Different touchpoints require asynchronous api calls. Tornado provides asynchronicity out of the box, very convenient.
  • DC.js is a multidimensional javascript library.
  • These tools are mostly for multidimensional aspects of the data. Not necessarily a specific part of it, but rather for different aspects of it.
  • For GeoVisualization: Leaflet.js
  • ZeroMQ for queing.

Thrift usage is all of the talks in Facebook, should be very important for internal api talks between different services. Its polyglot nature should provide a very nice common structure where different languages talk to between each other.

Summary

To make a successful "big data" visualization:

  • Big => small
  • Real-time
  • More Pixels

Dark Data: A Data Scientist's Exploration of the Unknown

Presenter: Rob Witoff

This was one of those talks how we use Pyhon. Ipython notebook unsurprisingly all over the talk. They built a private notebook viewer where people could share their notebooks across NASA, sounds super useful.(Do first that SQL query, then process in such a way, then run the algorithm in the x directory, sounds familiar? Notebook solves that problem)

  • Even NASA depends on AWS for their data infrastructure. Maybe, this is not that surprising.
  • Also the sharing data and visualizing data to understand is another aspect of his presentation.

Visualization Tools and Pointers

Summary

  • Liberate your dark data
  • Enable your engineers
  • Grow data scientists

If we cannot learn all of our data, we cannot imagine what we know beyond of our imagination.

DataPad: Python-powered Business Intelligence

Presenter: Wes McKinney

He mainly talked about their product Datapad, browser based analytics tool. He also talked about some of the shortcomings of current data analysis tools.

How to speed things up?

  • Badger analytics engine. (it is not open source, though).
  • Purpose-built in-memory analytical query processor

What they are using?

  • Query routing
  • Load balancing
  • comms: gevent + websockets
  • comressed tabular data store
  • external data source connectors

Sentiment Classification Using scikit-learn

Presenter: Ryan Rosario.

In this talk Ryan gave an overview what sentiment classification means, gave pointers to the state of art which is unsurprisingly a recurrent neural network architecture (so-called "deep learning"). He did not go into this detail, in Facebook, they used Naive Bayes using the implementation of Scikit-learn. Their approach is gather a lot of labeled data, using bag of words as features and chi-squared feature selection to choose best features and Naive Bayes for classifier, they improved their sentiment classification on what they had previously. He also talked a little bit about their data infrastructure Dataswarm.

Sentiment

  • Some information around emotional state of the user.

Lexicon Based Approaches

  • Wordnet
  • Lexinet

Machine Learning Based Approaches

  • SVM and Naive Bayes are most frequenly used classifiers.
  • Lexical features such as quantifiers and negtion
  • Positive to negative words converted to a statistical measure.
  • Sarcasm and humor cannot be captured from the beginning

Deep Learning (State of the Art)

Sentiwordnet Adaptation

  • Opinion mining => looks at the sentences in the form of positivity, negativity and objectivity
  • Lexical analyiss is very slow, Part of Speech Taggin is slow
  • Not very accurate either

Feature Selection

  • Chi-Squared Feature selection, for version 1, they used 1000s and then extended to millions.

Roc Curve to visualize the true-positive and false-positive system

Python Ecosystem

  • Dataswarm, Presto and Hive

System Architecture

  • Dataswarm provides the training/test set divide
  • Feature selection, training, model, testing, then store the data MySQL or dashboard. Pretty straightforward workflow.

Summary

  • With a ton of self-labeled data and basic algorithm overperforms lexical parsing.

Using Python to Find a Bayesian Network Describing Your Data

Presenter: Bartek Wilczynski

He introduced Bayesian Network Module in Python and mainly talked about Bayesian networks. The module seems quite easy to use from the command line as well, but for most of the libraries, need to experiment.

What is a Bayesian Network?

  • DAG => Directed Acyclic Graph, without cycles
  • Nodes for representing random variables
  • Edges for representing dependencies

How can we find the Best Bayesian Network?

  • Bayesian statistics (BDe)
  • Information theoretic (MDL)
  • Hypothesis testing

Dynamic Bayesian Network

Up and Down the Python Data and Web visualization

Presenter: Rob Story

The most interesting IPython notebook is in this conference, hands down. He introduced a comprehensive comparison of visualization libraries in Python. See his notebook in below. Especially, mpld3 seems quite interesting. However, no matter how Pythonistas try to make it work, if visualization depends on browser or more importantly interactivity, Javascript is nearly only solution. So, in the presentation, I was thinking maybe we should not try so hard to make Python to be Javascript and made peace with Javascript.

Seaborn

  • Jointplot is very good if you want to visualize two variable distribution.

Interactivity

  • If you want to create an interactive visualization, use @interactive.
  • Especially, switching to prior distributions in the MCMC models, this is gem.

Other Pointer

  • Leaflet is for all of the geospatial visualization, Jason told the same library as well. Should be the lingua franca.

Data Analysis with SciDB-Py

Presenter: Chris Beaumont

He talked about SciDB which is built-in support for matrix operations in database. SciDB seems interesting. Think about numpy arrays but in harddisk and you could manipulate however you like. Since you can do linear algebra operations, it seems to be a perfect fit for big data as its memory usage is small. It does not have to load the whole matrix into the memory.

Some notes

  • There is no Mac OSX support, yet.
  • You could write numpy expressions to the database except its syntax is a little different
  • Able to display the data directly from db is quite nice.
  • Github Repo
  • SciDB-Py

Sunday Talks

Speed without drag

Presenter: Saul Diez-Guerra

He introduced a number of libraries(static compilers, JIT, LLVMs) to speed up the Python. We need to first of course write good code and then try to compensate the inherent slowness of Python with those libraries.

Pointer

Pythran: Static Compiler for High Performance

Presenter: Mehdi Amini

This presentation was also about how to speed up the Python. He introduced his library and gave promising benchmarks and showed some under the hood stuff to explain how they speed up Python. Since the coverage is quite large(except introspection and couple of small number of modules), they could improve the execution speed.

Data Science at Berkeley

Presenter: Joshua Bloom

  • Core principles
  • Academic pursuit or a skillset to be trained? => nice questions

Nice headers (Not necessarily orthogonal to each other but rather different aspects of tackling a problem):

Different Approaches

  • Data Driven or Theory Driven
  • Bayesian vs Frequentist
  • Parametric vs Nonparametric

They used scipy.sparse package to speed the some of their computation up. They also used PyMC for their research.

Pointers

Especiall Data Analysis Recipes paper is great, written in a sarcastic way and harshly criticizes the current/traditional methods in a Bayesian point of view. Need to definitely take a look at the book as well!

Real-time streams and logs with Storm and Kafka

Presenter: Andrew Montalenti and Keith Bourgoin

They talked about how they are parsing logs with Python and introduced a library they wrote (Streamparse) in house. They also explained how they architectured their system with Storm and Kafka.

Pointers

How to build a SQL-based data warehouse for a trillion rows in Python

Presenter: Ville Tuulos

He talked about how they built a data warehouse for big data in Python. He also gave some pointers what they are using some of the visualization libraries. Some portion of the talk was about compression and efficiency of the compression using different methods.

Pointers

  • Slides
  • Low-latency, in-memory, fully-sql-compliant data warehouse
  • Backbone.d3
  • They used information theoretic approach to compress the signal.
  • Efficient encoding brings the compression. Also, the data is somehow sparse.
  • It looks very much like numpy operation(very interesting how it overlaps with SciDB)
  • Efficient compression library
  • Run-length encoding
  • Probabilistic Data Structures
  • Huffman Encoding

It seems very interesting to me that the methods that are used for video and image compression could be applicable. Not necessarily multimedia.

Outlier Detection in Time Series Signals

I gave this presentation and talked about outlier detection using Median Filtering, Fast Fourier Transform (read Discrete Fourier Transform) and Markov Chain Monte Carlo.

Pointers

Dataswarm

Presenter: Mike Starr

He talked about Dataswarm framework and how they are using it. Mainly focusing on data infrastructure and communication between services.

Pipeline

  • Actions
  • Event Logging
  • Data Pipelines
  • Reports / Apps

Dataswarm

  • Dependency graph description language(similar to Luigi)
  • Dependency between tasks is much concise than Luigi. Luigi is quite verbose comparing to this tpe of dependency description.
comments powered by Disqus