After every Python-related event, I keep telling myself that other language people have no idea what they are missing.
Yes, we have GIL(Global Interpreter Lock), yes unicode is painful especially if you are doing a lot text-related stuff, yes Python is slow, yes static typing may prevent most of the bugs I (people) have, yet Python community is increasing day by day. And last week, PyData SV 2014 was no exception.
I attended all three days(tutorials + conferences), maybe not surprisingly most talks were about tooling and data infrastructure except tutorials. Even if web development in Python will decrease over time(we have no reason to believe so), Python seems to be a big player in data in future. Even in "Big Data", people who want to write Python could get away with writing Python thanks to a bunch of ports for Python. This is also a huge gain for data processing in Python in general.
Presenters: Brian Granger and Jonathan Frederic
They introduced IPython and IPython notebook with a beginner level and then showed what changed 2.0 and whatnot. I was using IPython 1.0 prior to this event and now I am using IPython 2.0 which brings the modal structure to the table. If you are a Vim guy like me, that is just perfect.
normal mode is very similar to Vim. Directories are also easy to change in the ipython notebook. If the modal environment is the first thing that they changed in IPython 2.0, the second thing is the interactivity.
@interact decorators provide an interactive environment on top of Ipython notebook.
Some of the things I did not like: adding cell on top of some cell is only done shortcut(
a where you are not in edit mode), you cannot do with UI. This is not a disadvantage per se for people who are already using notebook, but for beginners with this modal structure, I would expect some people have quite difficult time to figure out hows. Considering Python 2 to 3, I wonder Python sacrifices easy entry in order to empower current users.(Some people end up learning Ruby because of this incompatibility problem).
On my point of view, if I did not attend this tutorial, my switch to Ipython notebook may get delayed even further. Now, quite happily using the most of the shortcuts, notebook feels quite like Vim. (if only the shortcuts were the same!, we should have
:w or at least
If data science has anything to with other than software engineering, it is IPython notebook. Software engineers live on raw code whereas data scientists, social scientists and experimental people need richer representations to understand their experiments, conclusions and what data tell in general. Software engineers can get away with code and comments whereas other people need paragraphs, images, audios, videos and graphs for their experiments. They need to also communicate their findings to other people(not necessarily technical people). This gap is filled with Ipython notebook quite successfully. It is so successful that most of the presenters(either in tutorial or talks) opened an ipython notebook or references an ipython notebook in their talks in the conference. If an ecosystem needs a great environment to live, data people in Python have such environment thanks to IPython and notebook.
Gradient Boosted Regression Trees in scikit-learn
Presenter: Peter Prettenhofer
Peter showed how awesome gradient boosted regression trees are. Apart from their base function(piecewise functions) which I did not like one bit, they have quite nice properties such as robust to noise, handle categorical and scale different features with ease. His IPython notebook covers not just regression trees but also low bias-high variance, high bias-low variance (deviance plot in the notebook) and parameter search to optimize the parameters. They are all in Scikit-learn. I took(read stole) one of his ideas around when a robust nonparametric tries to learn a function over the observations, one does not need to worry about the outliers and noise. This idea could be exploited in outlier detection as well. Instead of looking the problem as outlier detection, we could designa robust regression model and exclude the ones(outliers) that model gives low probability. I used Gaussian processes in this post to do so. He put up all of the tutorial material(slides+notebook) in here.
Designing and Deploying Online Experiments with PlanOut
Presenter: Eytan Bakshy
Separation of the experiment from production code is the biggest advantage and also firing up different experiments in different timeline for different "segments" seems quite nice. With Python syntax and what it provides out of the box, it seemed quite easy to use. But need to experiment a little bit to see its advantages. Some of the features(GUI) are missing in the open-source release where paper shows also some implementation detail around that.
He put up all of the ipython notebooks in here, if you get the source code of the Planout, you would have this material as well.
Know Thy Neighbor: An Introduction to Scikit-learn and K-NN
Presenter: Portia Burton
She gave an overview of what machine learning is and introduced Scikit-Learn. She also gave an overview of K Nearest Neighbor method and showed a demonstration on the Iris flower data set.
She put up all of the material(slides+notebooks) here.
K-means Clustering with Scikit-Learn
Presenter: Sarah Guido
She put a strong emphasis on parameter selection on K-Means as parameter selection for unsupervised learning algorithms are ill-defined. She introduced Silhouette Score which I did not know before. The score is somehow related to both intra-variance and inter-variance cluster. She put up IPython notebook and slides.
These are the ones that I attended.
I really also wanted to attend ggplot for Python but I attended Planout instead. However, yhat did an awesome job to put all of the notebooks in the web and also to github. Notebooks are quite amazing. They are not just for ggplot per se but for general workflow for data analysis with an emphasis on explaratory data analysis. They are also putting really nice ipython notebooks on their blog, I highly recommend following their blog if that type of data analysis seems interesting to you.
Now, the talks. First, on saturday talks, I was more sane and motivated. That motivation somehow decreased on Sunday. So Sunday talks, it is not you, it is me. Let's get started.
Why Python is awesome for any scale data?
Presenter: Burc Alpat
This talk was about how and why Python should be used in the data processing environments and how Facebook adopts the Python as one of their main language. He also gave various pointers to other talks. Mainly focusing on the strengths of Python and whenever it feels it is not the best option, use something else and try to connect with Python.
Why Python for Machine Learning?
- You want to get results quickly
- You will need lots of functionality
- You cannot do it all by yourself.
- Scaling Machine Learning in Python
- Scaling Machine Learning in Pyton Slides
- Python Sparts by Facebook using Thrift
- PrestoDB by Facebook
- Tornado from Facebook
Note that Dataswarm is not open source.
C/C++ and language interoperability
These are mostly dependency management systems where if you do not want to deal with dependency hell, you may want to use some sort of dependency management system.
Pants seems quite useful. If you are doing interoperation with C/C++, get a build system instead of depending on the
pip which seems not working quite well with other languages.
- DSL's can be useful for speaking different languages.
- Dataswarm is a dependency graph description language.
- Python is good for DSL's as well.
A Full Stack Approach to Data Visualization: Terabytes (and Beyond) at Facebook
Presenter: Jason Sundram
This talk is a purely tooling talk where Jason introduced different visualization libraries and told how Facebook is using visualization for data. Their setup is quite interesting, they are using Tornado for framework and a bunch of different js libraries based on D3 that reveals multi aspects of data they have.
- Fresh Data
- More pixels
- Make the data smaller
- CrossFilter in browsing D3 based visualization library.
- CrossFilter Tutorial
- CrossFilter Slides Tutorial
In memory database
Some other Pointers
- Tornado + gzip + csv
- Cubism.js Slides by Mike Bostock
- Different touchpoints require asynchronous api calls. Tornado provides asynchronicity out of the box, very convenient.
- These tools are mostly for multidimensional aspects of the data. Not necessarily a specific part of it, but rather for different aspects of it.
- For GeoVisualization: Leaflet.js
- ZeroMQ for queing.
Thrift usage is all of the talks in Facebook, should be very important for internal api talks between different services. Its polyglot nature should provide a very nice common structure where different languages talk to between each other.
To make a successful "big data" visualization:
- Big => small
- More Pixels
Dark Data: A Data Scientist's Exploration of the Unknown
Presenter: Rob Witoff
This was one of those talks how we use Pyhon. Ipython notebook unsurprisingly all over the talk. They built a private notebook viewer where people could share their notebooks across NASA, sounds super useful.(Do first that SQL query, then process in such a way, then run the algorithm in the x directory, sounds familiar? Notebook solves that problem)
- Even NASA depends on AWS for their data infrastructure. Maybe, this is not that surprising.
- Also the sharing data and visualizing data to understand is another aspect of his presentation.
Visualization Tools and Pointers
- Liberate your dark data
- Enable your engineers
- Grow data scientists
If we cannot learn all of our data, we cannot imagine what we know beyond of our imagination.
DataPad: Python-powered Business Intelligence
Presenter: Wes McKinney
He mainly talked about their product Datapad, browser based analytics tool. He also talked about some of the shortcomings of current data analysis tools.
How to speed things up?
- Badger analytics engine. (it is not open source, though).
- Purpose-built in-memory analytical query processor
What they are using?
- Query routing
- Load balancing
- comms: gevent + websockets
- comressed tabular data store
- external data source connectors
Sentiment Classification Using scikit-learn
Presenter: Ryan Rosario.
In this talk Ryan gave an overview what sentiment classification means, gave pointers to the state of art which is unsurprisingly a recurrent neural network architecture (so-called "deep learning"). He did not go into this detail, in Facebook, they used Naive Bayes using the implementation of Scikit-learn. Their approach is gather a lot of labeled data, using bag of words as features and chi-squared feature selection to choose best features and Naive Bayes for classifier, they improved their sentiment classification on what they had previously. He also talked a little bit about their data infrastructure Dataswarm.
- Some information around emotional state of the user.
Lexicon Based Approaches
Machine Learning Based Approaches
- SVM and Naive Bayes are most frequenly used classifiers.
- Lexical features such as quantifiers and negtion
- Positive to negative words converted to a statistical measure.
- Sarcasm and humor cannot be captured from the beginning
Deep Learning (State of the Art)
- Stanford Stentiment Treebank
- Deep Learning Tutorial
- Original Paper for Sentiment Analysis
- Recursive Neural Tensor Network (RNTN). => Recursive Neural Network
- Opinion mining => looks at the sentences in the form of positivity, negativity and objectivity
- Lexical analyiss is very slow, Part of Speech Taggin is slow
- Not very accurate either
- Chi-Squared Feature selection, for version 1, they used 1000s and then extended to millions.
Roc Curve to visualize the true-positive and false-positive system
- Dataswarm, Presto and Hive
- Dataswarm provides the training/test set divide
- Feature selection, training, model, testing, then store the data MySQL or dashboard. Pretty straightforward workflow.
- With a ton of self-labeled data and basic algorithm overperforms lexical parsing.
Using Python to Find a Bayesian Network Describing Your Data
Presenter: Bartek Wilczynski
He introduced Bayesian Network Module in Python and mainly talked about Bayesian networks. The module seems quite easy to use from the command line as well, but for most of the libraries, need to experiment.
What is a Bayesian Network?
- DAG => Directed Acyclic Graph, without cycles
- Nodes for representing random variables
- Edges for representing dependencies
How can we find the Best Bayesian Network?
- Bayesian statistics (BDe)
- Information theoretic (MDL)
- Hypothesis testing
Dynamic Bayesian Network
- Describe also temporal dependencies
- Causal links only go forward in time
- The breaks the problem of cycles as we now have two versions "before" and "after" variables
- Homepage for the Project
- Visualization Library for Graph Theory and Analysis
Up and Down the Python Data and Web visualization
Presenter: Rob Story
- Jointplot is very good if you want to visualize two variable distribution.
- If you want to create an interactive visualization, use
- Especially, switching to prior distributions in the MCMC models, this is gem.
- Leaflet is for all of the geospatial visualization, Jason told the same library as well. Should be the lingua franca.
Data Analysis with SciDB-Py
Presenter: Chris Beaumont
He talked about SciDB which is built-in support for matrix operations in database. SciDB seems interesting. Think about numpy arrays but in harddisk and you could manipulate however you like. Since you can do linear algebra operations, it seems to be a perfect fit for big data as its memory usage is small. It does not have to load the whole matrix into the memory.
- There is no Mac OSX support, yet.
- You could write numpy expressions to the database except its syntax is a little different
- Able to display the data directly from db is quite nice.
- Github Repo
Speed without drag
Presenter: Saul Diez-Guerra
He introduced a number of libraries(static compilers, JIT, LLVMs) to speed up the Python. We need to first of course write good code and then try to compensate the inherent slowness of Python with those libraries.
Pythran: Static Compiler for High Performance
Presenter: Mehdi Amini
This presentation was also about how to speed up the Python. He introduced his library and gave promising benchmarks and showed some under the hood stuff to explain how they speed up Python. Since the coverage is quite large(except introspection and couple of small number of modules), they could improve the execution speed.
Data Science at Berkeley
Presenter: Joshua Bloom
- Core principles
- Academic pursuit or a skillset to be trained? => nice questions
Nice headers (Not necessarily orthogonal to each other but rather different aspects of tackling a problem):
- Data Driven or Theory Driven
- Bayesian vs Frequentist
- Parametric vs Nonparametric
scipy.sparsepackage to speed the some of their computation up. They also used
PyMCfor their research.
- Python Bootcamp
- Bayesian Network For Graphs
- Data Analysis Recipes
- Data Analysis Recipes Paper
- Data Analysis Recipes Book
Especiall Data Analysis Recipes paper is great, written in a sarcastic way and harshly criticizes the current/traditional methods in a Bayesian point of view. Need to definitely take a look at the book as well!
Real-time streams and logs with Storm and Kafka
Presenter: Andrew Montalenti and Keith Bourgoin
They talked about how they are parsing logs with Python and introduced a library they wrote (Streamparse) in house. They also explained how they architectured their system with Storm and Kafka.
- Distributed Task Queue
- Distributed Realtime Computation
- Distributed Messaging System
- Demo Repo
How to build a SQL-based data warehouse for a trillion rows in Python
Presenter: Ville Tuulos
He talked about how they built a data warehouse for big data in Python. He also gave some pointers what they are using some of the visualization libraries. Some portion of the talk was about compression and efficiency of the compression using different methods.
- Low-latency, in-memory, fully-sql-compliant data warehouse
- They used information theoretic approach to compress the signal.
- Efficient encoding brings the compression. Also, the data is somehow sparse.
- It looks very much like numpy operation(very interesting how it overlaps with SciDB)
- Efficient compression library
- Run-length encoding
- Probabilistic Data Structures
- Huffman Encoding
It seems very interesting to me that the methods that are used for video and image compression could be applicable. Not necessarily multimedia.
Outlier Detection in Time Series Signals
I gave this presentation and talked about outlier detection using Median Filtering, Fast Fourier Transform (read Discrete Fourier Transform) and Markov Chain Monte Carlo.
Presenter: Mike Starr
He talked about Dataswarm framework and how they are using it. Mainly focusing on data infrastructure and communication between services.
- Event Logging
- Data Pipelines
- Reports / Apps
- Dependency graph description language(similar to Luigi)
- Dependency between tasks is much concise than Luigi. Luigi is quite verbose comparing to this tpe of dependency description.