Pydata Silicon Valley 2014 Part 2



home · about · subscribe

July 20, 2014 · -
Pydata Silicon Valley 2014

The Pydata Silicon Valley videos are put into Youtube, thanks to Facebook, Continuum and Parse.ly. Check it out. They are pretty great.

I already wrote about(kind of wrap-up) the presentations that I attended here. This one will be similar to the ones that I could not attend but watch the videos.

I somehow managed to miss some of the most interesting talks but could follow the presentation much better through their videos on Youtube than real-life so it kind of worked better than I expected.

ggplot

Presenter: Greg Lamb

As I already mentioned in the first part, this tutorial introduces ggplot visualization library which is based on Grammar of Graphics.

What is ggplot?

Matplotlib

Advantages

Disadvantages

d3.js is like a Mona Lisa painting where ggplot is like a camera. You buy the painting to show off where camera is more like a utility.

This comparison quite good, generally the plots(either ggplot or matplotlib) are either ad-hoc analysises or for internal usage. If you are producing something consumer facing, you should be using much more suited library for the job. From the conference, I actually used ggplot for couple of projects and I liked its syntax and api as well as the integration with pandas. From a 2-3 years matplotlib user, I think it will be great library when it reaches maturity. However, I think ggplot is more a competitor to seaborn rather than matplotlib. Matlotlib is a plotting library in a much broader sense(it has support for maps, images, 3D visualizations as well) where ggplot and seaborn are more targeted to a specific subset of visualizations. The subset is quite important, though.

That being said, it attacks the disadvantages of matplotlib quite well. It has great API, concise syntax and much better default themes.

Syntax

Especially, syntax of ggplot is great.


p = ggplot(mtcars, aes(x=’wt’, y=’mpg’))

This is what step. This is where you say I want to bind wt to x axis and mpg to y axis. Then, in the how step, you will say how


p + geom_point()

This produces a scatterplot. If you want to produce a density graph, you could change it p + geom_density() without changing what step. This is pretty great. If you want to put also an averaging window over the variable, p + geom_point() + stat_smooth().

This layered approach provides couple of benefits. All of the layers can be treated independently as visualization is mainly defined in what step rather than how. Therefore, you could easily experiment and play with different layers on your data visualization. Icing on the cake, faceting becomes very easy with ggplot as it has a faceting utility out of the box. I am big fan of faceting, in order to understand some categorical variable on the parameters and to be able to see the effect, so it is a great tool to do exploratory data analysis. If you want to learn more, check out the documentation.

Functional Performance with Core Data Structures

Presenter: Matthew Rocklin

The premise of this talk is the following two:

  1. Pure Python is not slow
  2. Python is a decent language.

In order to prove the first one, he provides benchmarks to show that Python is not “that” slow comparing to other languages. For second one, he introduces a functional programming library Toolz

The library is actually pretty great. He was influenced by Clojure which I wrote how great it is here. The library promotes laziness, functions, pure functions, composability; all of the good stuff that are common in functional programming languages. Surprisingly, some of the functions are actually same with standard library function like; groupby in itertools, filter and he claimed functions are actually faster than standard library equivalents. Some of the functions are not in the Python like take, interleave, or pipe. pipe is pretty good, it provides a similar mechanism with ->> clojure where you could chain a bunch of functions and does not sacrifice readability and execution order.

pipe(3, double, str)

to get the string representation of 3.0. This is a great way to compose functions.

Another example he gave is much more impressive:

pipe(‘data/tale-of-two-cities.txt’, open,
                                    map(str.strip),
                                    map(str.split),
                                    concat,
                                    map(stem),
                                    frequencies)

This is pretty great.

He also gave another library direction which implements the same functionality of Toolz in C, which is more efficient, called unsurprisingly Cytoolz.

I am quite happy to see there are great libraries like these two to make at least a dent for functional programming in Python.

He then summarizes why Functional programming is becoming more relevant as following:

Querying your database in natural languages

Presenter: Daniel Moisset

This is one of the talk that I regretted most that I did not attend. Not only it is quite interesting on its own, one of previous project that I did(querying yelp restaurants with natural language) is directly related.

Generally, speaking or natural language in writing provides a much easy interface to user. However, the systems that process these inputs are in their early periods so they cannot really process the inputs very effectively. Therefore, you’d see all those radiobuttons, sliders, checkboxes and text input area for user to enter structured information rather than user typing “show me the closest five restaurants that serve Turkish food and accepts credit card”. Question types vary quite a lot for different people and even the order, verbs and nouns are not something most people agree on. Therefore, most companies provide “intuitive” interfaces to get structuresd data, to make it easier sometimes they help, too.(autocomplete) Quepy attacks the problem in a somehow different angle. First, it tries to find what is being filtered and then, try to return the query based on the filtering parameters. As I mentioned later, tackling the question variation in a regular expression way is somehow limited but since building different of versions of questions is easy, it kind of compensates its limitation by providing ease of use. Other than that, installing and integrating the library in your application is quite easy.(some of the commands resemble Django as well). It is a very nice library.

Introduction

Natural Language Queries

Quepy

Approach

  1. First it parses the query
  2. Matches and creates an intermediate representation
  3. Then, based on representation, it generates query and Domain Specific Language

Parsing

In my opinion, the biggest disadvantage is question rule. As the regular expressions are very limited in terms of structure, and it seems only could “answer” Wh questions for specific forms. “Give me”, “Find me” order forms, yes-no questions, subordinate clauses(this corresponds to “where” in sql langauge) do not exist. Otherwise, library especially in terms of usage seems very good. However, various question forms could be easily added to the question types using QuestionTemplate. However, I was still expecting, very common question type support out of the box for a natural langauge translator to database query.(Only this one is quite hard problem to be fair)

Intermediate Representation

DSL

class WhatIs(QuestionTemplate):
    regex = Lemma(“what”) + Lemma(“be”) + \
        Question(POS(“DT”)) + Thing() + Question(POS(“.”))

    def interpret(self, match):
        return DefinitionOf(match, thing)

This Python code is great in terms of several aspects. First is very easy to construct question types in this from. Second, it is very expressive and readable what type of question types that you cover if you want to see what types of questions you already covered. For fixed type DSLs, it resembles a lot like freebase.

class IsPerson(FixedType):
    fixed_type =/people/person”
    fixed_type_relation =/type/object/type”

Apps: Gluing it all together

quepy startapp myapp

Resembles a lot like Django.

app = quepy.install(“myapp”)
question = “what is love”
target, query, metadata = app.get_query(question)
db.execute(query)

The Good Things

Future Directions

PyAlgoViz: Python Algorithm Visualization in the browser

Presenter: Chriss Laffra

He presents a very nice visualization environment PyAlgoViz where any developer could visualize execution steps of the algorithm or algorithm in general in a visualization rich way. This kind of opens up the algorithms black box and makes them more memorable through visualizations as well as nice way to learn what they are in the first place. Debugging would be also easier in this type of environment if your algorithm does not produce the output that you are expecting.

Building an Army of Data Collecting Posts in Python

Presenter: Freedom Dumlao

This presentation’s focus is more on architecture than anything else. He took a real-world problem and try to tackle it. Architecturing the whole framework bit by bit was a nice approach to build the system.

I really like the presentation style(along with James Horey) where they present the solutions in a story.(his friend and a beginner Python developer) This makes somehow easier to both understand the problem, domain, contextand how the proposed method solves the given problem.

OpenGraph Protocol

RQ

They used RQ to handle job queues with Redis to store the files.

3 Services

  1. ElasticSearch (it is used like a database, ), qbox.io
  2. Task Queue (RQ & Redis) => Redis to Go
  3. Compute (Heroku)

Rules to keep in mind

  1. Don’t do DOS to your neighbors
  2. Look at robots.txt if you are permitted
  3. Don’t be a jerk.

Ferry Share and Deploy Big Data Applications with Docker

Presenter: James Horey

I had the chance to talk with him on Saturday, his presentation was also like him, quite awesome. If you want to understand what Docker or why it is necessary/beneficial, definitely watch his talk. It is a beginner talk for infrastructure. This is also one of the most organized presentation that I saw as well so it is very easy to follow as well. He eventually introduces his library managing containers Ferry, which makes it easy to manage the containers.

Docker

Docker allows public images to be shared freely.

Using docker gives isolated environments advantages and very low overhead and disadvantages.

Cassandra

Ferry

Advantages of using Ferry

Crushing the Head of the Snake

Presenter: Robert Brewer

It was an optimization talks where he walks through what to do improving the speed of the code.

timeit’s default argument is 1000000, nobody wants to wait that long to profile. It has also overhead, so for benchmarking, that needs to be subtracted from the end result.

Timer(“xrange(1000)”).repeat(3)

In this run, take the minimum of three as the others may have some other background or some other overhead processes are added.

Some Rule of Thumbs

Parallelization

from multiprocessin import Pool
def run():
    results = Pool().map(run_one, range(segments))
    result = stddev(results)
    return result

Introducing x-ray: extended arrays for scientific datasets

Presenter: Stephan Hoyer

X-ray is a library that provides a rich datastructure for weather data which handles multidimensional data.

Why it is necessary?

5 dimensional weather forecasts:

What we want?

Xray Design Goals

  1. Labeld, N-dimensional arrays for scientific data
  2. Don’t rebuild the wheel
    • reuse pandas indexing
    • copy the pandas API
    • Interoperate with the rest of the scientific task.
    • Implement a proven data model.

Xray Data Model

Pointers

Generators Will Free Your Mind

Presenter: James Powell

James shows us why usage of generators may change our programming style or at least how we approach problems. I think a better title for this talk “Functional Programming Will Free Your Mind” as most of the concepts and some portion of talk is about why functional programming is better than imperative style. Some of the code he writes, especially how he tackles the formatting problem is quite mind-opening. Especially, I come to like adding predicates instead of if else into code whenever I want to add some feature to the existing code. This not only makes less assumptions on the data that you are getting but also makes it easier to read and more maintainable code. Generators, lazy evaluations part that he stresses on is also great. Instead of giving me the all data give me as much as I want. If I want to process the data in a batch fashion, let me handle that for myself.(less assumption, memory efficiency). It was a quite great talk.

Themes

Functional Programming

Modalities could be injected through functional programming.

Template Usage

template = ‘{region}: {align} - {profit}’.format
line = template(region=region, profit=profit, align=align)

What is generator & Coroutine?

def fib(a=1, b=1):
    while True:
        yield a
        a, b = b, a + b

from itertools import islice
list(islice(fib(), 10))

Nice Util Functions for Functional Programming

from itertools import islice, izip, tee
nwise = lambda g, n: izip(**islice(g, i, None) for i, g in enumerate(tee(g, n)))
ratio = lambda g: (y/x fro x,y in nwise(g, 2))
list(islice(ratio(fib(), 5)))
from itertools import takewhile, dropwhile
print(list(islice(fib(), 10)))
print(list(takewhile(lambda x: x < 10, fib(1)>)))

Pointers

How to Spy with Python

Presenter: Lynn Root

This talk was about how to sniff packages around Python, at least for most part of it. Otherwise, she presented a great historical context what an NSA is, how it may be doing the surveillance, using metadata, and whatnot. This not only provides a much more richer context than what is available, but also provides where it started and where it is growing/going.

Prism

What is XKeyScore?

Technologies

Wrap-Up

Pointers

Hustle: A column Oriented, Distributed Event Database

Presenter: Tim Spurway

Hustle is another distributed database. It is relational and claims to do super fast queries and distributed writes(inserts). It is an interesting concept and its domain query language is Python based. Seeing Python to be different DSL’s of data processing frameworks, it was not a big suprise for me seeing a database solely depends on Python for all querying tasks.

Hustle Features

Column Orientation

Insert into Hustle

Query Hustle

Hustle Query Execution

Pointers

Socialite Python Integrated Query Langauge for Big Data Analysis

Presenter: Jiwon Seo

Another big data processing framework and its query language could be integrated into Python seamlessly as you could use the variables of Python in the query language or vice versa. It claims to be faster than Graphlab and Not necessarily than Spark I guess, it is memory-in computation, so in that aspect it differs from Hadoop-based frameworks.

Why Another Big Data Platform?

Problems in existing platforms:

Introducing Socialite

Socialite is a high level language:

Outline

Distributed In-Memory Tables

Python Integration

print(“This is Python code”)
`Foo[int i](String s)
Foo[i](s) :- i=41, s=”the answer”`
v=”Python variable”
`Foo[i](s) :- i=44,s=$func`
for i, s in `Foo[i](s)`:
    print(i,s)

System Optimizations

Summary

Pointers

All Rights Reserved

Copyright, 2020