PyData Silicon Valley 2014



home · about · subscribe

May 12, 2014 · -
Pydata

After every Python-related event, I keep telling myself that other language people have no idea what they are missing.

Yes, we have GIL(Global Interpreter Lock), yes unicode is painful especially if you are doing a lot text-related stuff, yes Python is slow, yes static typing may prevent most of the bugs I (people) have, yet Python community is increasing day by day. And last week, PyData SV 2014 was no exception.

I attended all three days(tutorials + conferences), maybe not surprisingly most talks were about tooling and data infrastructure except tutorials. Even if web development in Python will decrease over time(we have no reason to believe so), Python seems to be a big player in data in future. Even in “Big Data”, people who want to write Python could get away with writing Python thanks to a bunch of ports for Python. This is also a huge gain for data processing in Python in general.

In Sunday talks, most of the talks especially before noon is about how to speed up the Python. Most people do not want to write Javascript but they needed it in the browser, so they came up with a bunch of languages that compiles to Javascript; Typescript, Coffeescript, Dart, Clojurescript … For Python, most people want to write Python, and they came up with a bunch of interesting things to speed up Python; static compilers, JIT, numba, parakeet and so on. However, I think it is time to confess also one of the disadvantage of Python that it is slow. Otherwise, people should not come up these solutions to overcome its limitation. Similarly, if Javascript were a nice language, people should not invent all those languages in order not to write Javascript to write Javascript.(this sentence makes sense when you read twice) Before provacating a language war which definitely Python will win, let’s look at the tutorials and conferences notes that I took.

Friday (Tutorials)

IPython 2.0

Presenters: Brian Granger and Jonathan Frederic

They introduced IPython and IPython notebook with a beginner level and then showed what changed 2.0 and whatnot. I was using IPython 1.0 prior to this event and now I am using IPython 2.0 which brings the modal structure to the table. If you are a Vim guy like me, that is just perfect. Edit and normal mode is very similar to Vim. Directories are also easy to change in the ipython notebook. If the modal environment is the first thing that they changed in IPython 2.0, the second thing is the interactivity. @interactive and @interact decorators provide an interactive environment on top of Ipython notebook.

Some of the things I did not like: adding cell on top of some cell is only done shortcut(a where you are not in edit mode), you cannot do with UI. This is not a disadvantage per se for people who are already using notebook, but for beginners with this modal structure, I would expect some people have quite difficult time to figure out hows. Considering Python 2 to 3, I wonder Python sacrifices easy entry in order to empower current users.(Some people end up learning Ruby because of this incompatibility problem).
On my point of view, if I did not attend this tutorial, my switch to Ipython notebook may get delayed even further. Now, quite happily using the most of the shortcuts, notebook feels quite like Vim. (if only the shortcuts were the same!, we should have :w or at least / seriously)

Ipython Notebook

If data science has anything to with other than software engineering, it is IPython notebook. Software engineers live on raw code whereas data scientists, social scientists and experimental people need richer representations to understand their experiments, conclusions and what data tell in general. Software engineers can get away with code and comments whereas other people need paragraphs, images, audios, videos and graphs for their experiments. They need to also communicate their findings to other people(not necessarily technical people). This gap is filled with Ipython notebook quite successfully. It is so successful that most of the presenters(either in tutorial or talks) opened an ipython notebook or references an ipython notebook in their talks in the conference. If an ecosystem needs a great environment to live, data people in Python have such environment thanks to IPython and notebook.

Gradient Boosted Regression Trees in scikit-learn

Presenter: Peter Prettenhofer

Peter showed how awesome gradient boosted regression trees are. Apart from their base function(piecewise functions) which I did not like one bit, they have quite nice properties such as robust to noise, handle categorical and scale different features with ease. His IPython notebook covers not just regression trees but also low bias-high variance, high bias-low variance (deviance plot in the notebook) and parameter search to optimize the parameters. They are all in Scikit-learn. I took(read stole) one of his ideas around when a robust nonparametric tries to learn a function over the observations, one does not need to worry about the outliers and noise. This idea could be exploited in outlier detection as well. Instead of looking the problem as outlier detection, we could designa robust regression model and exclude the ones(outliers) that model gives low probability. I used Gaussian processes in this post to do so. He put up all of the tutorial material(slides+notebook) in here.

Designing and Deploying Online Experiments with PlanOut

Presenter: Eytan Bakshy

Recently, Facebook opensourced their online experiment framework Planout. If you read the paper behind of the framework, this is a response of many of the shortcomings of the online experiments.

Separation of the experiment from production code is the biggest advantage and also firing up different experiments in different timeline for different “segments” seems quite nice. With Python syntax and what it provides out of the box, it seemed quite easy to use. But need to experiment a little bit to see its advantages. Some of the features(GUI) are missing in the open-source release where paper shows also some implementation detail around that.

He put up all of the ipython notebooks in here, if you get the source code of the Planout, you would have this material as well.

Know Thy Neighbor: An Introduction to Scikit-learn and K-NN

Presenter: Portia Burton

She gave an overview of what machine learning is and introduced Scikit-Learn. She also gave an overview of K Nearest Neighbor method and showed a demonstration on the Iris flower data set.

She put up all of the material(slides+notebooks) here.

K-means Clustering with Scikit-Learn

Presenter: Sarah Guido

She put a strong emphasis on parameter selection on K-Means as parameter selection for unsupervised learning algorithms are ill-defined. She introduced Silhouette Score which I did not know before. The score is somehow related to both intra-variance and inter-variance cluster. She put up IPython notebook and slides.

These are the ones that I attended.

ggplot

I really also wanted to attend ggplot for Python but I attended Planout instead. However, yhat did an awesome job to put all of the notebooks in the web and also to github. Notebooks are quite amazing. They are not just for ggplot per se but for general workflow for data analysis with an emphasis on explaratory data analysis. They are also putting really nice ipython notebooks on their blog, I highly recommend following their blog if that type of data analysis seems interesting to you.

Now, the talks. First, on saturday talks, I was more sane and motivated. That motivation somehow decreased on Sunday. So Sunday talks, it is not you, it is me. Let’s get started.

Saturday Talks

Why Python is awesome for any scale data?

Presenter: Burc Alpat

This talk was about how and why Python should be used in the data processing environments and how Facebook adopts the Python as one of their main language. He also gave various pointers to other talks. Mainly focusing on the strengths of Python and whenever it feels it is not the best option, use something else and try to connect with Python.

Why Python for Machine Learning?

  1. You want to get results quickly
  2. You will need lots of functionality
  3. C++
  4. DSL
  5. You cannot do it all by yourself.

Some pointers

Note that Dataswarm is not open source.

C/C++ and language interoperability

Build Tools

jDependency Managements

These are mostly dependency management systems where if you do not want to deal with dependency hell, you may want to use some sort of dependency management system.

Pants seems quite useful. If you are doing interoperation with C/C++, get a build system instead of depending on the venv and pip which seems not working quite well with other languages.

DSL

A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook

Presenter: Jason Sundram

This talk is a purely tooling talk where Jason introduced different visualization libraries and told how Facebook is using visualization for data. Their setup is quite interesting, they are using Tornado for framework and a bunch of different js libraries based on D3 that reveals multi aspects of data they have.

CrossFilter

In memory database

Some other Pointers

Thrift usage is all of the talks in Facebook, should be very important for internal api talks between different services. Its polyglot nature should provide a very nice common structure where different languages talk to between each other.

Summary

To make a successful “big data” visualization:

Dark Data: A Data Scientist’s Exploration of the Unknown

Presenter: Rob Witoff

This was one of those talks how we use Pyhon. Ipython notebook unsurprisingly all over the talk. They built a private notebook viewer where people could share their notebooks across NASA, sounds super useful.(Do first that SQL query, then process in such a way, then run the algorithm in the x directory, sounds familiar? Notebook solves that problem)

Visualization Tools and Pointers

Summary

If we cannot learn all of our data, we cannot imagine what we know beyond of our imagination.

DataPad: Python-powered Business Intelligence

Presenter: Wes McKinney

He mainly talked about their product Datapad, browser based analytics tool. He also talked about some of the shortcomings of current data analysis tools.

How to speed things up?

What they are using?

Sentiment Classification Using scikit-learn

Presenter: Ryan Rosario.

In this talk Ryan gave an overview what sentiment classification means, gave pointers to the state of art which is unsurprisingly a recurrent neural network architecture (so-called “deep learning”). He did not go into this detail, in Facebook, they used Naive Bayes using the implementation of Scikit-learn. Their approach is gather a lot of labeled data, using bag of words as features and chi-squared feature selection to choose best features and Naive Bayes for classifier, they improved their sentiment classification on what they had previously. He also talked a little bit about their data infrastructure Dataswarm.

Sentiment

Lexicon Based Approaches

Machine Learning Based Approaches

Deep Learning (State of the Art)

Sentiwordnet Adaptation

Feature Selection

Roc Curve to visualize the true-positive and false-positive system

Python Ecosystem

System Architecture

Summary

Using Python to Find a Bayesian Network Describing Your Data

Presenter: Bartek Wilczynski

He introduced Bayesian Network Module in Python and mainly talked about Bayesian networks. The module seems quite easy to use from the command line as well, but for most of the libraries, need to experiment.

What is a Bayesian Network?

How can we find the Best Bayesian Network?

Dynamic Bayesian Network

Up and Down the Python Data and Web visualization

Presenter: Rob Story

The most interesting IPython notebook is in this conference, hands down. He introduced a comprehensive comparison of visualization libraries in Python. See his notebook in below. Especially, mpld3 seems quite interesting. However, no matter how Pythonistas try to make it work, if visualization depends on browser or more importantly interactivity, Javascript is nearly only solution. So, in the presentation, I was thinking maybe we should not try so hard to make Python to be Javascript and made peace with Javascript.

Seaborn

Interactivity

Other Pointer

Data Analysis with SciDB-Py

Presenter: Chris Beaumont

He talked about SciDB which is built-in support for matrix operations in database. SciDB seems interesting. Think about numpy arrays but in harddisk and you could manipulate however you like. Since you can do linear algebra operations, it seems to be a perfect fit for big data as its memory usage is small. It does not have to load the whole matrix into the memory.

Some notes

Sunday Talks

Speed without drag

Presenter: Saul Diez-Guerra

He introduced a number of libraries(static compilers, JIT, LLVMs) to speed up the Python. We need to first of course write good code and then try to compensate the inherent slowness of Python with those libraries.

Pointer

Pythran: Static Compiler for High Performance

Presenter: Mehdi Amini

This presentation was also about how to speed up the Python. He introduced his library and gave promising benchmarks and showed some under the hood stuff to explain how they speed up Python. Since the coverage is quite large(except introspection and couple of small number of modules), they could improve the execution speed.

Data Science at Berkeley

Presenter: Joshua Bloom

Nice headers (Not necessarily orthogonal to each other but rather different aspects of tackling a problem):

Different Approaches

They used scipy.sparse package to speed the some of their computation up. They also used PyMC for their research.

Pointers

Especiall Data Analysis Recipes paper is great, written in a sarcastic way and harshly criticizes the current/traditional methods in a Bayesian point of view. Need to definitely take a look at the book as well!

Real-time streams and logs with Storm and Kafka

Presenter: Andrew Montalenti and Keith Bourgoin

They talked about how they are parsing logs with Python and introduced a library they wrote (Streamparse) in house. They also explained how they architectured their system with Storm and Kafka.

Pointers

How to build a SQL-based data warehouse for a trillion rows in Python

Presenter: Ville Tuulos

He talked about how they built a data warehouse for big data in Python. He also gave some pointers what they are using some of the visualization libraries. Some portion of the talk was about compression and efficiency of the compression using different methods.

Pointers

It seems very interesting to me that the methods that are used for video and image compression could be applicable. Not necessarily multimedia.

Outlier Detection in Time Series Signals

I gave this presentation and talked about outlier detection using Median Filtering, Fast Fourier Transform (read Discrete Fourier Transform) and Markov Chain Monte Carlo.

Pointers

Dataswarm

Presenter: Mike Starr

He talked about Dataswarm framework and how they are using it. Mainly focusing on data infrastructure and communication between services.

Pipeline

Dataswarm

All Rights Reserved

Copyright, 2020