Machine Learning Newsletter

Pydata Silicon Valley 2014 Part 2

Pydata Silicon Valley 2014 The Pydata Silicon Valley videos are put into Youtube, thanks to Facebook, Continuum and Parse.ly. Check it out. They are pretty great.

I already wrote about(kind of wrap-up) the presentations that I attended here. This one will be similar to the ones that I could not attend but watch the videos.

I somehow managed to miss some of the most interesting talks but could follow the presentation much better through their videos on Youtube than real-life so it kind of worked better than I expected.

ggplot

Presenter: Greg Lamb

As I already mentioned in the first part, this tutorial introduces ggplot visualization library which is based on Grammar of Graphics.

What is ggplot?

  • Based on Grammar of Graphics
  • Implementation in Python(from R implementation, ggplot2)

Matplotlib

Advantages
  • Mature, it has been around for some time
  • Integration with Ipython Notebook
  • Customizable
  • Community
Disadvantages
  • API, learning curve, syntax, default themes

d3.js is like a Mona Lisa painting where ggplot is like a camera. You buy the painting to show off where camera is more like a utility.

This comparison quite good, generally the plots(either ggplot or matplotlib) are either ad-hoc analysises or for internal usage. If you are producing something consumer facing, you should be using much more suited library for the job. From the conference, I actually used ggplot for couple of projects and I liked its syntax and api as well as the integration with pandas. From a 2-3 years matplotlib user, I think it will be great library when it reaches maturity. However, I think ggplot is more a competitor to seaborn rather than matplotlib. Matlotlib is a plotting library in a much broader sense(it has support for maps, images, 3D visualizations as well) where ggplot and seaborn are more targeted to a specific subset of visualizations. The subset is quite important, though.

That being said, it attacks the disadvantages of matplotlib quite well. It has great API, concise syntax and much better default themes.

Syntax

Especially, syntax of ggplot is great.

p = ggplot(mtcars, aes(x=wt, y=mpg))

This is what step. This is where you say I want to bind wt to x axis and mpg to y axis. Then, in the how step, you will say how

p + geom_point()

This produces a scatterplot. If you want to produce a density graph, you could change it p + geom_density() without changing what step. This is pretty great. If you want to put also an averaging window over the variable, p + geom_point() + stat_smooth().

This layered approach provides couple of benefits. All of the layers can be treated independently as visualization is mainly defined in what step rather than how. Therefore, you could easily experiment and play with different layers on your data visualization. Icing on the cake, faceting becomes very easy with ggplot as it has a faceting utility out of the box. I am big fan of faceting, in order to understand some categorical variable on the parameters and to be able to see the effect, so it is a great tool to do exploratory data analysis. If you want to learn more, check out the documentation.

Functional Performance with Core Data Structures

Presenter: Matthew Rocklin

The premise of this talk is the following two:

  1. Pure Python is not slow
  2. Python is a decent language.

In order to prove the first one, he provides benchmarks to show that Python is not “that” slow comparing to other languages. For second one, he introduces a functional programming library Toolz

The library is actually pretty great. He was influenced by Clojure which I wrote how great it is here. The library promotes laziness, functions, pure functions, composability; all of the good stuff that are common in functional programming languages. Surprisingly, some of the functions are actually same with standard library function like; groupby in itertools, filter and he claimed functions are actually faster than standard library equivalents. Some of the functions are not in the Python like take, interleave, or pipe. pipe is pretty good, it provides a similar mechanism with ->> clojure where you could chain a bunch of functions and does not sacrifice readability and execution order.

pipe(3, double, str)

to get the string representation of 3.0. This is a great way to compose functions.

Another example he gave is much more impressive:

pipe(data/tale-of-two-cities.txt, open,
                                    map(str.strip),
                                    map(str.split),
                                    concat,
                                    map(stem),
                                    frequencies)

This is pretty great.

He also gave another library direction which implements the same functionality of Toolz in C, which is more efficient, called unsurprisingly Cytoolz.

I am quite happy to see there are great libraries like these two to make at least a dent for functional programming in Python.

He then summarizes why Functional programming is becoming more relevant as following:

  • Good software practices
  • Composition promotes scalable software systems
  • Purity, composition promotes testing
  • Sequence abstractions support streaming computation(if you have low memory, you have no other choice anyway)
  • Purity, serializability promotes parallel and concurrent processing.

Querying your database in natural languages

Presenter: Daniel Moisset

This is one of the talk that I regretted most that I did not attend. Not only it is quite interesting on its own, one of previous project that I did(querying yelp restaurants with natural language) is directly related.

Generally, speaking or natural language in writing provides a much easy interface to user. However, the systems that process these inputs are in their early periods so they cannot really process the inputs very effectively. Therefore, you’d see all those radiobuttons, sliders, checkboxes and text input area for user to enter structured information rather than user typing “show me the closest five restaurants that serve Turkish food and accepts credit card”. Question types vary quite a lot for different people and even the order, verbs and nouns are not something most people agree on. Therefore, most companies provide “intuitive” interfaces to get structuresd data, to make it easier sometimes they help, too.(autocomplete) Quepy attacks the problem in a somehow different angle. First, it tries to find what is being filtered and then, try to return the query based on the filtering parameters. As I mentioned later, tackling the question variation in a regular expression way is somehow limited but since building different of versions of questions is easy, it kind of compensates its limitation by providing ease of use. Other than that, installing and integrating the library in your application is quite easy.(some of the commands resemble Django as well). It is a very nice library.

Introduction

  • Data is everywhere
  • Collecting data is not the problem but what to do with is the real problem
  • Any operation starts with selecting and filtering data.

Natural Language Queries

  • Very accessible, trivial to learn
  • Weak is mos applications as it has limited list of queries that it can handle

Quepy

  • They propose Quepy which is an open-sourced framework to translate natural language questions into a database query langauge.
  • Support for database query language is somehow limited(SparQL and MQL).
Approach
  1. First it parses the query
  2. Matches and creates an intermediate representation
  3. Then, based on representation, it generates query and Domain Specific Language
Parsing
  • It is done in a word level
    • is: is/be/VBZ
    • swallows: swallows/swallow/NNS
  • Question Rule => Regular expressions
    • Token(“what”) + Lemma (“be”) + Question Pos(“DT”) + Pos(“NN”) => The word “what” followed by any verb “of to be” veryb, optionally followed by a word determines the count(“all”, “every”), followed by one or more nouns.

In my opinion, the biggest disadvantage is question rule. As the regular expressions are very limited in terms of structure, and it seems only could “answer” Wh questions for specific forms. “Give me”, “Find me” order forms, yes-no questions, subordinate clauses(this corresponds to “where” in sql langauge) do not exist. Otherwise, library especially in terms of usage seems very good. However, various question forms could be easily added to the question types using QuestionTemplate. However, I was still expecting, very common question type support out of the box for a natural langauge translator to database query.(Only this one is quite hard problem to be fair)

Intermediate Representation
  • Graph-like, with some known values and some holes(x0, x1, x2). Always has a “root”.
  • Similar to knowledge databases(like freebase)
  • Easy to build from python code.
DSL
  • Single concepts with fixed relations or entities.
class WhatIs(QuestionTemplate):
    regex = Lemma(what) + Lemma(be) + \
        Question(POS(DT)) + Thing() + Question(POS(.))

    def interpret(self, match):
        return DefinitionOf(match, thing)

This Python code is great in terms of several aspects. First is very easy to construct question types in this from. Second, it is very expressive and readable what type of question types that you cover if you want to see what types of questions you already covered. For fixed type DSLs, it resembles a lot like freebase.

class IsPerson(FixedType):
    fixed_type = /people/person
    fixed_type_relation = /type/object/type

Apps: Gluing it all together

  • You build a Python package with
quepy startapp myapp
  • There you add dsl and questions templates
  • Then configure it editing myapp/settings.py(output query language, data encoding)

Resembles a lot like Django.

app = quepy.install(myapp)
question = what is love
target, query, metadata = app.get_query(question)
db.execute(query)

The Good Things

  • Effort to add questions template is small and the benefit is linear with respect to effort.
  • Good for industry applications.
  • Low specialization requireed to extend
  • Human work is very parallelizable
  • Easy to get many people to work on questions.
  • Better for domain specific databases.
Future Directions
  • Testing this under databases.
  • Improving performance
  • Collecting uncovered questions, add machine learning to learn new patterns(Yes, especially trying to cover all of the forms from a set of question forms would be very valuable)

PyAlgoViz: Python Algorithm Visualization in the browser

Presenter: Chriss Laffra

He presents a very nice visualization environment PyAlgoViz where any developer could visualize execution steps of the algorithm or algorithm in general in a visualization rich way. This kind of opens up the algorithms black box and makes them more memorable through visualizations as well as nice way to learn what they are in the first place. Debugging would be also easier in this type of environment if your algorithm does not produce the output that you are expecting.

Building an Army of Data Collecting Posts in Python

Presenter: Freedom Dumlao

This presentation’s focus is more on architecture than anything else. He took a real-world problem and try to tackle it. Architecturing the whole framework bit by bit was a nice approach to build the system.

I really like the presentation style(along with James Horey) where they present the solutions in a story.(his friend and a beginner Python developer) This makes somehow easier to both understand the problem, domain, contextand how the proposed method solves the given problem.

OpenGraph Protocol

  • This protocol enables any webpage to become a rich object in a social graph.
  • The reason when you share something, open graph title, url and image is used in the the social media that you are sharing with.

RQ

They used RQ to handle job queues with Redis to store the files.

3 Services

  1. ElasticSearch (it is used like a database, ), qbox.io
  2. Task Queue (RQ & Redis) => Redis to Go
  3. Compute (Heroku)

Rules to keep in mind

  1. Don’t do DOS to your neighbors
  2. Look at robots.txt if you are permitted
  3. Don’t be a jerk.

Ferry Share and Deploy Big Data Applications with Docker

Presenter: James Horey

I had the chance to talk with him on Saturday, his presentation was also like him, quite awesome. If you want to understand what Docker or why it is necessary/beneficial, definitely watch his talk. It is a beginner talk for infrastructure. This is also one of the most organized presentation that I saw as well so it is very easy to follow as well. He eventually introduces his library managing containers Ferry, which makes it easy to manage the containers.

Docker

  • Encapsulates applications in isolated containers.
  • Makes it easy and safe to distribute applications.
  • Easy to get started.
  • Nice way to isolate the runtime environment.

Docker allows public images to be shared freely.

Using docker gives isolated environments advantages and very low overhead and disadvantages.

Cassandra

  • Highly scalable and fault-tolerant.
  • Great for storing streaming data(sensors, messages)

Ferry

  • Specify the containers that constitue your application in YAML.
  • Support for Hadoop, Cassandra, GlusterFS, and OpenMPI.
  • It’s a little bit like pip for your Docker-based runtime environment.

Advantages of using Ferry

  • Even simple applications can be complicated to install and run.
  • Docker solves this by encapsulating the entire runtime environment in lightweight containers.
  • Ferry enables plugging in external big data dependencies.

Crushing the Head of the Snake

Presenter: Robert Brewer

It was an optimization talks where he walks through what to do improving the speed of the code.

timeit’s default argument is 1000000, nobody wants to wait that long to profile. It has also overhead, so for benchmarking, that needs to be subtracted from the end result.

Timer(xrange(1000)).repeat(3)

In this run, take the minimum of three as the others may have some other background or some other overhead processes are added.

Some Rule of Thumbs

  • Use Python builtins. As they are written in C, the ones that you will be using from external libraries would be slower(written in Python).
  • Reduce function calls. There is some overhead in calling functions in Python.
  • Vector operations can be done in Numpy.

Parallelization

from multiprocessin import Pool
def run():
    results = Pool().map(run_one, range(segments))
    result = stddev(results)
    return result

Introducing x-ray: extended arrays for scientific datasets

Presenter: Stephan Hoyer

X-ray is a library that provides a rich datastructure for weather data which handles multidimensional data.

Why it is necessary?

5 dimensional weather forecasts:

  • labeled
  • big
  • binary
  • high-dimensioanl
  • not very natural to put in a pandas dataframe

What we want?

  • Like numpy.ndarray, but better
  • N-dimensional
  • Handles missing data
  • Labeled dimensions(axes)
    • “time”, “latitude”, “longitude”
    • not “index” and “columns”
  • Labeled coordinates(ticks)
  • Lazy array acess(out of memory data)

Xray Design Goals

  1. Labeld, N-dimensional arrays for scientific data
  2. Don’t rebuild the wheel
    • reuse pandas indexing
    • copy the pandas API
    • Interoperate with the rest of the scientific task.
    • Implement a proven data model.

Xray Data Model

  • DataArray

    • N-dimensional
    • labeled dimensions (axes)
    • labeled coordinates (indices)
    • homogeneous dtype
  • Dataset

    • a dict-like container of DataArrays
    • shared dimensions and coordinates
  • Both keep track of arbitrary metadata (attributes)

Pointers

Generators Will Free Your Mind

Presenter: James Powell

James shows us why usage of generators may change our programming style or at least how we approach problems. I think a better title for this talk “Functional Programming Will Free Your Mind” as most of the concepts and some portion of talk is about why functional programming is better than imperative style. Some of the code he writes, especially how he tackles the formatting problem is quite mind-opening. Especially, I come to like adding predicates instead of if else into code whenever I want to add some feature to the existing code. This not only makes less assumptions on the data that you are getting but also makes it easier to read and more maintainable code. Generators, lazy evaluations part that he stresses on is also great. Instead of giving me the all data give me as much as I want. If I want to process the data in a batch fashion, let me handle that for myself.(less assumption, memory efficiency). It was a quite great talk.

Themes

  • Generators are a very useful way of modeling problems in Python
  • The ways we model problems using generators is fundamentally different

Functional Programming

Modalities could be injected through functional programming.

Template Usage
template = {region}: {align} - {profit}.format
line = template(region=region, profit=profit, align=align)

What is generator & Coroutine?

def fib(a=1, b=1):
    while True:
        yield a
        a, b = b, a + b

from itertools import islice
list(islice(fib(), 10))

Nice Util Functions for Functional Programming

from itertools import islice, izip, tee
nwise = lambda g, n: izip(**islice(g, i, None) for i, g in enumerate(tee(g, n)))
ratio = lambda g: (y/x fro x,y in nwise(g, 2))
list(islice(ratio(fib(), 5)))
from itertools import takewhile, dropwhile
print(list(islice(fib(), 10)))
print(list(takewhile(lambda x: x < 10, fib(1)>)))
Pointers

How to Spy with Python

Presenter: Lynn Root

This talk was about how to sniff packages around Python, at least for most part of it. Otherwise, she presented a great historical context what an NSA is, how it may be doing the surveillance, using metadata, and whatnot. This not only provides a much more richer context than what is available, but also provides where it started and where it is growing/going.

Prism

  • Planning tool for Resource Integration, Synchronization and Management
  • Mines electronic data for the purpose of mass surveillance.
  • Collects intelligence that passes through US servers
  • Targets foreigners, but is elusive about data on US citizens.
  • Only collects metadata(supposedly).

What is XKeyScore?

  • Digital Network Intelligence Exploitation System
  • Federated Query System of completely unfiltered data
  • 500-700 servers, as of 2008
  • Gives users ability to query for email addresses, a target’s activity, phone numbers, HTTP traffic, extract file attachments, etc.

Technologies

  • scapy: packet sniffing & manipulation
  • pygeoip: API for GEOIP databases
  • geojson: bindings & utilities for GeoJSON
  • python-nmap: wrapper around nmap port scanner

Wrap-Up

  • Long historical precedence for mass surveillance with very little oversight or restriction.
  • Safe to assume they spy everyone.

Pointers

Hustle: A column Oriented, Distributed Event Database

Presenter: Tim Spurway

Hustle is another distributed database. It is relational and claims to do super fast queries and distributed writes(inserts). It is an interesting concept and its domain query language is Python based. Seeing Python to be different DSL’s of data processing frameworks, it was not a big suprise for me seeing a database solely depends on Python for all querying tasks.

Hustle Features

  • Open Source
  • Column Oriented, pipelined execution = Fast queries
  • Embarrasingly distributed => Disco MIR pipelines, DDFS
  • Embarrasingly fast = LMDB bt Tree
  • Relational = Join Large Tables
  • Partitioned = Smart, Automatic Sharding
  • Query Language: Python DSL
  • Advanced Compression; store even more
  • Command Line Interface: Ad-hoc queries

Column Orientation

  • As opposed to traditional row oriented DBMSes
  • Typical use-case is extremely large datasets.
  • Very efficient column aggreggation
  • Excellent compression characteristics.
  • Works well with Bitmap indices

Insert into Hustle

  • Distributed inserts - support massive write overload(millions recors/sec)
  • Two steps:
    • Create standalone Marble file (LMDB)
    • Push Marble to DDFS (cheap, cluster-to-local)
  • An insert can create and push any number of partitioned Marbles.
  • CSV, JSON, fixed field supported
  • Bulk append semantics - pay careful attention to ETL.

Query Hustle

  • Python DSL - operator overloads in where clause
  • Aggregating Functions/ Grouping (h-sum, h-count, h-avg)
  • Efficient, distributed join
  • Order, limit, desc, distinct
  • Nested queries
  • Extensible query language

Hustle Query Execution

  • Hustle queries are distributed Disco pipeline jobs
  • Query optimizer and replication minimize data movement in cluster.

Pointers

Socialite Python Integrated Query Langauge for Big Data Analysis

Presenter: Jiwon Seo

Another big data processing framework and its query language could be integrated into Python seamlessly as you could use the variables of Python in the query language or vice versa. It claims to be faster than Graphlab and Not necessarily than Spark I guess, it is memory-in computation, so in that aspect it differs from Hadoop-based frameworks.

Why Another Big Data Platform?

Problems in existing platforms:

  • Too difficult(low-level primitives)
  • Inefficient(not network bound)
  • Too many (sub) framework

Introducing Socialite

Socialite is a high level language:

  • Easy and efficient
  • Compiled to optimized code
  • Python integration(Jython)
  • Good for
    • Graph Analysis
    • Data Mining
    • Relational Queries

Outline

  • Concepts in Socialite
    • Distributed Tables
    • Python Integration (Jython and Python)
  • System overview
  • Analysis Algorithms
    • Shortest paths, PageRank
    • K-Means, Logistic Regression
  • Evaluation
Distributed In-Memory Tables
  • Primary data structure in Socialite
  • Column oriented storage

Python Integration

  • Socialite queries in Python code
    • Queries are quoted in backtick
  • Python functions, variables are accessible in Socialite queries.
  • Socialite tables are readable from Python.
print(This is Python code)
`Foo[int i](String s)
Foo[i](s) :- i=41, s=the answer”`
v=Python variable
`Foo[i](s) :- i=44,s=$func`
for i, s in `Foo[i](s)`:
    print(i,s)

System Optimizations

  • Custom Memory Allocator (temporary table)
  • Optimized Serialization
  • Direct ByteBuffer (network Buffer)
  • Multiple network channels among workers

Summary

  • Distributed query language
  • Integration with Python
  • Easy and efficient
  • Algorithms in Socialite
  • Competitive Performance

Pointers

comments powered by Disqus