Machine Learning Newsletter

I wish I knew these things when I learned Python

I sometimes found myself asking myself how I cannot know simpler way of doing “this” thing in Python 3. When I seek solution, I of course find much more elegant, efficient and more bug-free code parts over time. In total(not just this post), the total sum of “those” things were far more than I expect/admit, but here is the first crop of features that was not obvious to me and learned later as I sought more efficient/simple/maintainable code.

Dictionary Stuff

Dictionary keys() and items()

You could do various interesting operations in keys and items of dictionaries. They are set-like.

aa = {mike: male, kathy: female, steve: male, hillary: female}
bb = {mike: male, ben: male, hillary: female}

aa.keys() & bb.keys() # {‘mike’, ‘hillary’} # these are set-like
aa.keys() - bb.keys() # {‘kathy’, ‘steve’}
# If you want to get the common key-value pairs in the two dictionaries
aa.items() & bb.items() # {(‘mike’, ‘male’), (‘hillary’, ‘female’)}

Pretty neat!

Checking Existence of a key in dictionary

How many times you write the following code?

dictionary = {}
for k, v in ls:
    if not k in dictionary:
        dictionary[k] = []
    dictionary[k].append(v)

This is not so bad, but why do you need if statement every now and then?

from collections import defaultdict
dictionary = defaultdict(list) # defaults to list
for k, v in ls:
    dictionary[k].append(v)

Much cleaner and there is not an if statement which unnecessarily obscures the code.

Update a dictionary with another dictionary

from itertools import chain
a = {x: 1, y:2, z:3}
b = {y: 5, s: 10, x: 3, z: 6}

# Update a with b 
c = dict(chain(a.items(), b.items()))
c # {‘y’: 5, ‘s’: 10, ‘x’: 3, ‘z’: 6}

Well, this is good but not very concise and terse. Let’s see if we could do better:

c = a.copy()
c.update(b)

Much cleaner and readable!

Getting the maximum! from a dictionary

If you want to get the maximum value in a dictionary, it is straightforward:

aa = {k: sum(range(k)) for k in range(10)}
aa # {0: 0, 1: 0, 2: 1, 3: 3, 4: 6, 5: 10, 6: 15, 7: 21, 8: 28, 9: 36}
max(aa.values()) #36

which works, but what if you need the key, you need to do another lookup based on value. Instead, we could flatten the representation via zip and then return the key-value pair as in the following:

max(zip(aa.values(), aa.keys()))
# (36, 9) => value, key pair

Similarly, if you want to traverse the dictionary by maximum value to minimum value you could do the following:

sorted(zip(aa.values(), aa.keys()), reverse=True)
# [(36, 9), (28, 8), (21, 7), (15, 6), (10, 5), (6, 4), (3, 3), (1, 2), (0, 1), (0, 0)]

Unpacking Arbitrary Number of Items in a List

We could capture arbitrary number of items into a list using * magic:

def compute_average_salary(person_salary):
    person, *salary = person_salary
    return person, (sum(salary) / float(len(salary)))

person, average_salary = compute_average_salary([mike, 40000, 50000, 60000])
person # ‘mike’
average_salary # 50000.0

This was not very interesting, but what if I tell you you could do the following as well:

def compute_average_salary(person_salary_age):
    person, *salary, age = person_salary_age
    return person, (sum(salary) / float(len(salary))), age

person, average_salary, age = compute_average_salary([mike, 40000, 50000, 60000, 42])
age # 42

This is pretty neat.

When you think about a dictionary that has a string key and a list as value, instead of traversing a dictionary and then process the values in a sequential manner, one can use a flat representation(list inside of a list) like this:

# Instead of doing this
for k, v in dictionary.items():
    process(v)

# we are separating head and the rest, and process the values
# as a list similar to the above. head becomes the key value
for head, *rest in ls:
    process(rest)

# if not very clear, consider the following example
aa = {k: list(range(k)) for k in range(5)} # range returns an iterator
aa # {0: [], 1: [0], 2: [0, 1], 3: [0, 1, 2], 4: [0, 1, 2, 3]}
for k, v in aa.items():
    sum(v)

#0
#0
#1
#3
#6

# Instead
aa = [[ii] + list(range(jj)) for ii, jj in enumerate(range(5))]
for head, *rest in aa:
    print(sum(rest))

#0
#0
#1
#3
#6

You could unpack the list into head, *rest, tail as well.

collections Gotta love Counter

collections is one of my favorite standard library in Python. If you need any other data structures than the original default data structures in the Python, you should take a look at it.

One of the essential part of my job on a daily basis is to count things. Mostly words but not necessarily always. One may be tempted to say, you could build a dictionary with words as keys and the number of occurrences as values, and I would agree with you if I did not know about it Counter in the collections(yes, the introduction blurb is because of Counter).

Let’s say you read, Python programming language wikipedia page into a string and then convert into a list(by tokenizing, well sort of):

import re
word_list = list(map(lambda k: k.lower().strip(), re.split(r[;,:(\.\s)]\s*, python_string)))
word_list[:10] # [‘python’, ‘is’, ‘a’, ‘widely’, ‘used’, ‘general-purpose’, ‘high-level’, ‘programming’, ‘language’, ‘[17][18][19]’]

So far, so good. But if I want to count the words in this list:

from collections import defaultdict # again, collections!
dictionary = defaultdict(int)
for word in word_list:
    dictionary[word] += 1

This is not so bad, but if you have Counter, you would use your time in a more meaningful task.

from collections import Counter
counter = Counter(word_list)
# Getting the most common 10 words
counter.most_common(10)
[(the, 164), (and, 161), (a, 138), (python, 138), 
(of, 131), (is, 102), (to, 91), (in, 88), (‘’, 56)]
counter.keys()[:10] # just like a dictionary
[‘’, limited, all, code, managed, multi-paradigm, 
exponentiation, fromosing, dynamic]

This is pretty neat, but if we look at what available methods exist for counter:

dir(counter)
[__add__, __and__, __class__, __cmp__, __contains__, __delattr__, __delitem__, __dict__, 
__doc__, __eq__, __format__, __ge__, __getattribute__, __getitem__, __gt__, __hash__, 
__init__, __iter__, __le__, __len__, __lt__, __missing__, __module__, __ne__, __new__, 
__or__, __reduce__, __reduce_ex__, __repr__, __setattr__, __setitem__, __sizeof__, 
__str__, __sub__, __subclasshook__, __weakref__, clear, copy, elements, fromkeys, get, 
has_key, items, iteritems, iterkeys, itervalues, keys, most_common, pop, popitem, setdefault, 
subtract, update, values, viewitems, viewkeys, viewvalues]

Did you see __add__ and __sub__ methods? Yes, counters support + and - operations! So, if you have a lot of text and want to count words, you do not need Hadoop, you could parallelize a bunch of counters(map) and then sum them together(reduce). You have your mapreduce built on top of counter. You could thank me later.

Flattening Nested Lists

collections has also _chain function which could be used to flatten nested lists:

from collections import chain
ls = [[kk] + list(range(kk)) for kk in range(5)]
flattened_list = list(collections._chain(*ls))

Opening two files simultaneously

If you are processing one file(e.g. line by line) and write the processed line into another file, you may be tempted to write the following:

with open(input_file_path) as inputfile:
    with open(output_file_path, w) as outputfile:
        for line in inputfile:
            outputfile.write(process(line))

Except, you can open multiple files in the same line as in the following:

with open(input_file_path) as inputfile, open(output_file_path, w) as outputfile:
    for line in inputfile:
        outputfile.write(process(line))

This is much more neat.

Finding Monday from a Date

If you have a date and want to normalize(say to previous or next Monday), you could do the following:

import datetime
previous_monday = some_date - datetime.timedelta(days=some_date.weekday())
# Similarly, you could map to next monday as well
next_monday = some_date + date_time.timedelta(days=-some_date.weekday(), weeks=1)

And, that is about it.

Handling HTML

If you scrape websites either for fun or profit, chances are that you face html tags time to time. In order to strip various html tags, you could do use html.parser:

from html.parser import HTMLParser

class HTMLStrip(HTMLParser):


    def __init__(self):
        self.reset()
        self.ls = []

    def handle_data(self, d):
        self.ls.append(d)

    def get_data(self):
        return ‘’.join(self.ls)

    @staticmethod
    def strip(snippet):
        html_strip = HTMLStrip()
        html_strip.feed(snippet)
        clean_text = html_strip.get_data()
        return clean_text

snippet = HTMLStrip.strip(html_snippet)

If you only want to escape html:

escaped_snippet = html.escape(html_snippet)

# Back to html snippets(this is new in Python 3.4)
html_snippet = html.unescape(escaped_snippet)
# and so forth ...
comments powered by Disqus