Pydata Silicon Valley 2014 Part 2

July 20, 2014 · ∞ -

The Pydata Silicon Valley videos are put into Youtube, thanks to Facebook, Continuum and Parse.ly. Check it out. They are pretty great.

I already wrote about(kind of wrap-up) the presentations that I attended here. This one will be similar to the ones that I could not attend but watch the videos.

I somehow managed to miss some of the most interesting talks but could follow the presentation much better through their videos on Youtube than real-life so it kind of worked better than I expected.

ggplot

Presenter: Greg Lamb

As I already mentioned in the first part, this tutorial introduces ggplot visualization library which is based on Grammar of Graphics.

What is ggplot?

Based on Grammar of Graphics
Implementation in Python(from R implementation, ggplot2)

Matplotlib

Advantages

Mature, it has been around for some time
Integration with Ipython Notebook
Customizable
Community

Disadvantages

API, learning curve, syntax, default themes

d3.js is like a Mona Lisa painting where ggplot is like a camera. You buy the painting to show off where camera is more like a utility.

This comparison quite good, generally the plots(either ggplot or matplotlib) are either ad-hoc analysises or for internal usage. If you are producing something consumer facing, you should be using much more suited library for the job. From the conference, I actually used ggplot for couple of projects and I liked its syntax and api as well as the integration with pandas. From a 2-3 years matplotlib user, I think it will be great library when it reaches maturity. However, I think ggplot is more a competitor to seaborn rather than matplotlib. Matlotlib is a plotting library in a much broader sense(it has support for maps, images, 3D visualizations as well) where ggplot and seaborn are more targeted to a specific subset of visualizations. The subset is quite important, though.

That being said, it attacks the disadvantages of matplotlib quite well. It has great API, concise syntax and much better default themes.

Syntax

Especially, syntax of ggplot is great.


p = ggplot(mtcars, aes(x=’wt’, y=’mpg’))

This is what step. This is where you say I want to bind wt to x axis and mpg to y axis. Then, in the how step, you will say how


p + geom_point()

This produces a scatterplot. If you want to produce a density graph, you could change it p + geom_density() without changing what step. This is pretty great. If you want to put also an averaging window over the variable, p + geom_point() + stat_smooth().

This layered approach provides couple of benefits. All of the layers can be treated independently as visualization is mainly defined in what step rather than how. Therefore, you could easily experiment and play with different layers on your data visualization. Icing on the cake, faceting becomes very easy with ggplot as it has a faceting utility out of the box. I am big fan of faceting, in order to understand some categorical variable on the parameters and to be able to see the effect, so it is a great tool to do exploratory data analysis. If you want to learn more, check out the documentation.

Functional Performance with Core Data Structures

Presenter: Matthew Rocklin

The premise of this talk is the following two:

Pure Python is not slow
Python is a decent language.

In order to prove the first one, he provides benchmarks to show that Python is not “that” slow comparing to other languages. For second one, he introduces a functional programming library Toolz

The library is actually pretty great. He was influenced by Clojure which I wrote how great it is here. The library promotes laziness, functions, pure functions, composability; all of the good stuff that are common in functional programming languages. Surprisingly, some of the functions are actually same with standard library function like; groupby in itertools, filter and he claimed functions are actually faster than standard library equivalents. Some of the functions are not in the Python like take, interleave, or pipe. pipe is pretty good, it provides a similar mechanism with ->> clojure where you could chain a bunch of functions and does not sacrifice readability and execution order.

pipe(3, double, str)

to get the string representation of 3.0. This is a great way to compose functions.

Another example he gave is much more impressive:

pipe(‘data/tale-of-two-cities.txt’, open,
                                    map(str.strip),
                                    map(str.split),
                                    concat,
                                    map(stem),
                                    frequencies)

This is pretty great.

He also gave another library direction which implements the same functionality of Toolz in C, which is more efficient, called unsurprisingly Cytoolz.

I am quite happy to see there are great libraries like these two to make at least a dent for functional programming in Python.

He then summarizes why Functional programming is becoming more relevant as following:

Good software practices
Composition promotes scalable software systems
Purity, composition promotes testing
Sequence abstractions support streaming computation(if you have low memory, you have no other choice anyway)
Purity, serializability promotes parallel and concurrent processing.

Querying your database in natural languages

Presenter: Daniel Moisset

This is one of the talk that I regretted most that I did not attend. Not only it is quite interesting on its own, one of previous project that I did(querying yelp restaurants with natural language) is directly related.

Generally, speaking or natural language in writing provides a much easy interface to user. However, the systems that process these inputs are in their early periods so they cannot really process the inputs very effectively. Therefore, you’d see all those radiobuttons, sliders, checkboxes and text input area for user to enter structured information rather than user typing “show me the closest five restaurants that serve Turkish food and accepts credit card”. Question types vary quite a lot for different people and even the order, verbs and nouns are not something most people agree on. Therefore, most companies provide “intuitive” interfaces to get structuresd data, to make it easier sometimes they help, too.(autocomplete) Quepy attacks the problem in a somehow different angle. First, it tries to find what is being filtered and then, try to return the query based on the filtering parameters. As I mentioned later, tackling the question variation in a regular expression way is somehow limited but since building different of versions of questions is easy, it kind of compensates its limitation by providing ease of use. Other than that, installing and integrating the library in your application is quite easy.(some of the commands resemble Django as well). It is a very nice library.

Introduction

Data is everywhere
Collecting data is not the problem but what to do with is the real problem
Any operation starts with selecting and filtering data.

Natural Language Queries

Very accessible, trivial to learn
Weak is mos applications as it has limited list of queries that it can handle

Quepy

They propose Quepy which is an open-sourced framework to translate natural language questions into a database query langauge.
Support for database query language is somehow limited(SparQL and MQL).

Approach

First it parses the query
Matches and creates an intermediate representation
Then, based on representation, it generates query and Domain Specific Language

Parsing

It is done in a word level
- is: is/be/VBZ
- swallows: swallows/swallow/NNS
Question Rule => Regular expressions
- Token(“what”) + Lemma (“be”) + Question Pos(“DT”) + Pos(“NN”) => The word “what” followed by any verb “of to be” veryb, optionally followed by a word determines the count(“all”, “every”), followed by one or more nouns.

In my opinion, the biggest disadvantage is question rule. As the regular expressions are very limited in terms of structure, and it seems only could “answer” Wh questions for specific forms. “Give me”, “Find me” order forms, yes-no questions, subordinate clauses(this corresponds to “where” in sql langauge) do not exist. Otherwise, library especially in terms of usage seems very good. However, various question forms could be easily added to the question types using QuestionTemplate. However, I was still expecting, very common question type support out of the box for a natural langauge translator to database query.(Only this one is quite hard problem to be fair)

Intermediate Representation

Graph-like, with some known values and some holes(x0, x1, x2). Always has a “root”.
Similar to knowledge databases(like freebase)
Easy to build from python code.

DSL

Single concepts with fixed relations or entities.

class WhatIs(QuestionTemplate):
    regex = Lemma(“what”) + Lemma(“be”) + \
        Question(POS(“DT”)) + Thing() + Question(POS(“.”))

    def interpret(self, match):
        return DefinitionOf(match, thing)

This Python code is great in terms of several aspects. First is very easy to construct question types in this from. Second, it is very expressive and readable what type of question types that you cover if you want to see what types of questions you already covered. For fixed type DSLs, it resembles a lot like freebase.

class IsPerson(FixedType):
    fixed_type = “/people/person”
    fixed_type_relation = “/type/object/type”

Apps: Gluing it all together

You build a Python package with

quepy startapp myapp

There you add dsl and questions templates
Then configure it editing myapp/settings.py(output query language, data encoding)

Resembles a lot like Django.

app = quepy.install(“myapp”)
question = “what is love”
target, query, metadata = app.get_query(question)
db.execute(query)

The Good Things

Effort to add questions template is small and the benefit is linear with respect to effort.
Good for industry applications.
Low specialization requireed to extend
Human work is very parallelizable
Easy to get many people to work on questions.
Better for domain specific databases.

Future Directions

Testing this under databases.
Improving performance
Collecting uncovered questions, add machine learning to learn new patterns(Yes, especially trying to cover all of the forms from a set of question forms would be very valuable)

PyAlgoViz: Python Algorithm Visualization in the browser

Presenter: Chriss Laffra

He presents a very nice visualization environment PyAlgoViz where any developer could visualize execution steps of the algorithm or algorithm in general in a visualization rich way. This kind of opens up the algorithms black box and makes them more memorable through visualizations as well as nice way to learn what they are in the first place. Debugging would be also easier in this type of environment if your algorithm does not produce the output that you are expecting.

Building an Army of Data Collecting Posts in Python

Presenter: Freedom Dumlao

This presentation’s focus is more on architecture than anything else. He took a real-world problem and try to tackle it. Architecturing the whole framework bit by bit was a nice approach to build the system.

I really like the presentation style(along with James Horey) where they present the solutions in a story.(his friend and a beginner Python developer) This makes somehow easier to both understand the problem, domain, contextand how the proposed method solves the given problem.

OpenGraph Protocol

This protocol enables any webpage to become a rich object in a social graph.
The reason when you share something, open graph title, url and image is used in the the social media that you are sharing with.

RQ

They used RQ to handle job queues with Redis to store the files.

3 Services

ElasticSearch (it is used like a database, ), qbox.io
Task Queue (RQ & Redis) => Redis to Go
Compute (Heroku)

Rules to keep in mind

Don’t do DOS to your neighbors
Look at robots.txt if you are permitted
Don’t be a jerk.

Presenter: James Horey

I had the chance to talk with him on Saturday, his presentation was also like him, quite awesome. If you want to understand what Docker or why it is necessary/beneficial, definitely watch his talk. It is a beginner talk for infrastructure. This is also one of the most organized presentation that I saw as well so it is very easy to follow as well. He eventually introduces his library managing containers Ferry, which makes it easy to manage the containers.

Docker

Encapsulates applications in isolated containers.
Makes it easy and safe to distribute applications.
Easy to get started.
Nice way to isolate the runtime environment.

Docker allows public images to be shared freely.

Using docker gives isolated environments advantages and very low overhead and disadvantages.

Cassandra

Highly scalable and fault-tolerant.
Great for storing streaming data(sensors, messages)

Ferry

Specify the containers that constitue your application in YAML.
Support for Hadoop, Cassandra, GlusterFS, and OpenMPI.
It’s a little bit like pip for your Docker-based runtime environment.

Advantages of using Ferry

Even simple applications can be complicated to install and run.
Docker solves this by encapsulating the entire runtime environment in lightweight containers.
Ferry enables plugging in external big data dependencies.

Crushing the Head of the Snake

Presenter: Robert Brewer

It was an optimization talks where he walks through what to do improving the speed of the code.

timeit’s default argument is 1000000, nobody wants to wait that long to profile. It has also overhead, so for benchmarking, that needs to be subtracted from the end result.

Timer(“xrange(1000)”).repeat(3)

In this run, take the minimum of three as the others may have some other background or some other overhead processes are added.

Some Rule of Thumbs

Use Python builtins. As they are written in C, the ones that you will be using from external libraries would be slower(written in Python).
Reduce function calls. There is some overhead in calling functions in Python.
Vector operations can be done in Numpy.

Parallelization

from multiprocessin import Pool
def run():
    results = Pool().map(run_one, range(segments))
    result = stddev(results)
    return result

Introducing x-ray: extended arrays for scientific datasets

Presenter: Stephan Hoyer

X-ray is a library that provides a rich datastructure for weather data which handles multidimensional data.

Why it is necessary?

5 dimensional weather forecasts:

labeled
big
binary
high-dimensioanl
not very natural to put in a pandas dataframe

What we want?

Like numpy.ndarray, but better
N-dimensional
Handles missing data
Labeled dimensions(axes)
- “time”, “latitude”, “longitude”
- not “index” and “columns”
Labeled coordinates(ticks)
Lazy array acess(out of memory data)

Xray Design Goals

Labeld, N-dimensional arrays for scientific data
Don’t rebuild the wheel
- reuse pandas indexing
- copy the pandas API
- Interoperate with the rest of the scientific task.
- Implement a proven data model.

Xray Data Model

DataArray
- N-dimensional
- labeled dimensions (axes)
- labeled coordinates (indices)
- homogeneous dtype
Dataset
- a dict-like container of DataArrays
- shared dimensions and coordinates
Both keep track of arbitrary metadata (attributes)

Pointers

Generators Will Free Your Mind

Presenter: James Powell

James shows us why usage of generators may change our programming style or at least how we approach problems. I think a better title for this talk “Functional Programming Will Free Your Mind” as most of the concepts and some portion of talk is about why functional programming is better than imperative style. Some of the code he writes, especially how he tackles the formatting problem is quite mind-opening. Especially, I come to like adding predicates instead of if else into code whenever I want to add some feature to the existing code. This not only makes less assumptions on the data that you are getting but also makes it easier to read and more maintainable code. Generators, lazy evaluations part that he stresses on is also great. Instead of giving me the all data give me as much as I want. If I want to process the data in a batch fashion, let me handle that for myself.(less assumption, memory efficiency). It was a quite great talk.

Themes

Generators are a very useful way of modeling problems in Python
The ways we model problems using generators is fundamentally different

Functional Programming

Modalities could be injected through functional programming.

Template Usage

template = ‘{region}: {align} - {profit}’.format
line = template(region=region, profit=profit, align=align)

What is generator & Coroutine?

def fib(a=1, b=1):
    while True:
        yield a
        a, b = b, a + b

from itertools import islice
list(islice(fib(), 10))

Nice Util Functions for Functional Programming

from itertools import islice, izip, tee
nwise = lambda g, n: izip(**islice(g, i, None) for i, g in enumerate(tee(g, n)))
ratio = lambda g: (y/x fro x,y in nwise(g, 2))
list(islice(ratio(fib(), 5)))
from itertools import takewhile, dropwhile
print(list(islice(fib(), 10)))
print(list(takewhile(lambda x: x < 10, fib(1)>)))

Pointers

Ipython Notebook-Slides

How to Spy with Python

Presenter: Lynn Root

This talk was about how to sniff packages around Python, at least for most part of it. Otherwise, she presented a great historical context what an NSA is, how it may be doing the surveillance, using metadata, and whatnot. This not only provides a much more richer context than what is available, but also provides where it started and where it is growing/going.

Prism

Planning tool for Resource Integration, Synchronization and Management
Mines electronic data for the purpose of mass surveillance.
Collects intelligence that passes through US servers
Targets foreigners, but is elusive about data on US citizens.
Only collects metadata(supposedly).

What is XKeyScore?

Digital Network Intelligence Exploitation System
Federated Query System of completely unfiltered data
500-700 servers, as of 2008
Gives users ability to query for email addresses, a target’s activity, phone numbers, HTTP traffic, extract file attachments, etc.

Technologies

scapy: packet sniffing & manipulation
pygeoip: API for GEOIP databases
geojson: bindings & utilities for GeoJSON
python-nmap: wrapper around nmap port scanner

Wrap-Up

Long historical precedence for mass surveillance with very little oversight or restriction.
Safe to assume they spy everyone.

Pointers

Hustle: A column Oriented, Distributed Event Database

Presenter: Tim Spurway

Hustle is another distributed database. It is relational and claims to do super fast queries and distributed writes(inserts). It is an interesting concept and its domain query language is Python based. Seeing Python to be different DSL’s of data processing frameworks, it was not a big suprise for me seeing a database solely depends on Python for all querying tasks.

Hustle Features

Open Source
Column Oriented, pipelined execution = Fast queries
Embarrasingly distributed => Disco MIR pipelines, DDFS
Embarrasingly fast = LMDB bt Tree
Relational = Join Large Tables
Partitioned = Smart, Automatic Sharding
Query Language: Python DSL
Advanced Compression; store even more
Command Line Interface: Ad-hoc queries

Column Orientation

As opposed to traditional row oriented DBMSes
Typical use-case is extremely large datasets.
Very efficient column aggreggation
Excellent compression characteristics.
Works well with Bitmap indices

Insert into Hustle

Distributed inserts - support massive write overload(millions recors/sec)
Two steps:
- Create standalone Marble file (LMDB)
- Push Marble to DDFS (cheap, cluster-to-local)
An insert can create and push any number of partitioned Marbles.
CSV, JSON, fixed field supported
Bulk append semantics - pay careful attention to ETL.

Query Hustle

Python DSL - operator overloads in where clause
Aggregating Functions/ Grouping (h-sum, h-count, h-avg)
Efficient, distributed join
Order, limit, desc, distinct
Nested queries
Extensible query language

Hustle Query Execution

Hustle queries are distributed Disco pipeline jobs
Query optimizer and replication minimize data movement in cluster.

Pointers

Source Code

Socialite Python Integrated Query Langauge for Big Data Analysis

Presenter: Jiwon Seo

Another big data processing framework and its query language could be integrated into Python seamlessly as you could use the variables of Python in the query language or vice versa. It claims to be faster than Graphlab and Not necessarily than Spark I guess, it is memory-in computation, so in that aspect it differs from Hadoop-based frameworks.

Why Another Big Data Platform?

Problems in existing platforms:

Too difficult(low-level primitives)
Inefficient(not network bound)
Too many (sub) framework

Introducing Socialite

Socialite is a high level language:

Easy and efficient
Compiled to optimized code
Python integration(Jython)
Good for
- Graph Analysis
- Data Mining
- Relational Queries

Outline

Concepts in Socialite
- Distributed Tables
- Python Integration (Jython and Python)
System overview
Analysis Algorithms
- Shortest paths, PageRank
- K-Means, Logistic Regression
Evaluation

Distributed In-Memory Tables

Primary data structure in Socialite
Column oriented storage

Python Integration

Socialite queries in Python code
- Queries are quoted in backtick
Python functions, variables are accessible in Socialite queries.
Socialite tables are readable from Python.

print(“This is Python code”)
`Foo[int i](String s)
Foo[i](s) :- i=41, s=”the answer”`
v=”Python variable”
`Foo[i](s) :- i=44,s=$func`
for i, s in `Foo[i](s)`:
    print(i,s)

System Optimizations

Custom Memory Allocator (temporary table)
Optimized Serialization
Direct ByteBuffer (network Buffer)
Multiple network channels among workers

Summary

Distributed query language
Integration with Python
Easy and efficient
Algorithms in Socialite
Competitive Performance

Pointers

Project page

Pydata Silicon Valley 2014 Part 2

ggplot

What is ggplot?

Matplotlib

Advantages

Disadvantages

Syntax

Functional Performance with Core Data Structures

Querying your database in natural languages

Introduction

Natural Language Queries

Quepy

Approach

Parsing

Intermediate Representation

DSL

Apps: Gluing it all together

The Good Things

Future Directions

PyAlgoViz: Python Algorithm Visualization in the browser

Building an Army of Data Collecting Posts in Python

OpenGraph Protocol

RQ

3 Services

Rules to keep in mind

Ferry Share and Deploy Big Data Applications with Docker

Docker

Cassandra

Ferry

Advantages of using Ferry

Crushing the Head of the Snake

Some Rule of Thumbs

Parallelization

Introducing x-ray: extended arrays for scientific datasets

Why it is necessary?

What we want?

Xray Design Goals

Xray Data Model

Pointers

Generators Will Free Your Mind

Themes

Functional Programming

Template Usage

What is generator & Coroutine?

Nice Util Functions for Functional Programming

Pointers

How to Spy with Python

Prism

What is XKeyScore?

Technologies

Wrap-Up

Pointers

Hustle: A column Oriented, Distributed Event Database

Hustle Features

Column Orientation

Insert into Hustle

Query Hustle

Hustle Query Execution

Pointers

Socialite Python Integrated Query Langauge for Big Data Analysis

Why Another Big Data Platform?

Introducing Socialite

Outline

Distributed In-Memory Tables

Python Integration

System Optimizations

Summary

Pointers