8th NYAS Machine Learning Symposium 2014

March 28, 2014 · ∞ -

I attended to NYAS 8th Machine Learning Symposium and here are the notes that I took from the event. It may contain errors and mistakes. If you find any, please let me know.
On personal view, it was worse than the previous machine learning symposium(7th) in both posters and also talks. Last year, the posters and talks were much more interesting to me. That being said, I could not visit all of the posters so take my word with a grain of salt.
The abstracts in pdf in here.

Machine Learning for Powers of Good

by Rayid Ghani

Probability of optimization of limited resources for a campaign.
Another important thing: to influence the behavior of the voter and how to make them engage with the campaign.
Because prediction itself is not good enough.
Resource allocation based on who are likely to be influenced. Who are likely to change their mind? Definitely not Texas.
Not voting to Romney and not going to vote, too much work. Do not try to event attempt to do anything.
Focus on the ones either who are indecisive but likely to vote or indecisive about voting but weakly support Obama.
Data Science for Social Good

Following Spotlight Talks

Graph-Based Posterior Regularization for Semi-Supervised Structured Prediction:

Posterior labels for part of speech tags. Graph-based approach, using Laplacian of Graph
Structured Prediction => CRF => local scope, features

Graph-propagation and CRF estimation => Joint objective, then to optimize and look at the KL divergence as well for both world parameters.

Relevant work is Posterior Regularization(PR) Linear Ganchev

EM like algorithm => to converge to the local optimum

She showed that it performs better than both CRF and Graph based approaches in her poster, but she did not compare speed of this approach with CRF or Graph based approaches. It is likely the method is slower than CRF but, I am not very familiar Graph based approaches and joint objective could be quite hard to optimize. So, the speed is could be much worse more than 2 times than CRF.

Learning from Label Proportions(LLP):

It attacks Binary learning problem with an extension of bag approach where bags represent the ratio of the labels that are known but individual labels are unknown. They try to solve the problem in a large margin framework trying to model the instances belonging to a particular label and try to increase margin with the other label(smv-like). - Extension of supervised learning objective with Bag Proportion Loss with model parameters with a proportion loss.

Generalization Error of LLP

Sample complexity of learning is proportional to bag proportion
Instance label prediction error => again depends on the prediction error
Not only the supervised learning objective but also the bag proportions for the labeling matters.

Generative Image Models For Visual Phenotype Modeling

They have genome types of fish and they have features of the fish. In order to learn which genome type has effect on which fish trait, they propose an admixture model which tries to correlate the traits and genome.

Admixture model to correlate the shape variance between geneology of the fishes.
Annotated genome from the shape variance of the fish.
Genome annotates the features of the shape variance of the fish.
Unsupervised learning of the features and joint generative model from fish variance and genome change. => Seems quite novel.

Large Scale Learning - Scaling Graph-based semi supervised-learning

Replace label vectors => count-min-sketch? is a data structure(randomized) stores the counts of items.
MAD exact vs Mad-Sketch => comparison
Junto Toolkit @ Github

Structured Classification Criteria for Deep Learning for Speech Recognition (Second Keynote)

Before this talk, I knew that IBM is strong in deep learning(if I recall correctly, they had a poster last year for speech recognition) but I did not know that they published also strong papers for speech recognition last year. Google and Facebook get a lot of coverage for deep learning and that maybe rightly so, but IBM is also a strong player in the area.

Talk Structure

Acoustic Modeling for speech recoognition:
Structured loss function: we do not care about the loss function per se but how audible it is in the speech. Therefore, the loss function should be structured around audibility of the speech.
Optimization
Speeding for training

Bayesian Modeling for Speech Recognition

Sequence of phones are nice because if have a word to classify and you did not have that sample in the training set, you could “guess” the word from the sequence of phones.

Context affects the acoustic realization of a phone in the speech.

Context-dependent modeling

Condition on adjacent phones, infer the xcontext.
Parameter sharing needed to mitigate data sparsity sahring must generalize to unseen contexts.
Decision tree to get the AA-b, nasal, retroflex, fricative?, too much hand-engineered features.
Basic speech sounds => 1000 to 64K

Structured Loss Functions

Cross-entropy, for training criterion.
A neural network with training error with the cross-entropy.
We do not care about individual phones error but word error.
Cross-Entropy
Bayes-Risk Losses
Hamming distance is a forgiving distance measure for error between HMM sequences.
To represent reference space, lattices generated via constrained recognition.

Stochastic Gradient Optimization

Training GMM with is the fundamental idea.Stochastic Gradient Optimization
Second Order optimization => well researched
Linear conjugate gradient minimizes a quadratic, which can be described by only matrix-vector products. Only linear time and memory are necessary.
CG is not necessary, truncated Newton is good enough.
Hessian Free Optimization
It has reference in this book,too. => most probably, this may explain better.

Speeding Training

Low-rank factorization of output weights
Word error rate generally drops as we increase the number of output targets => factorization to reduce dimension?
It gets faster.

Preconditioning in Sampling

Accelerating Hessian Free Optimization Implicit Preconditioning and Sampling
Geometric optimization reference is also in the above link.

Take-Home Mesages

A Structured loss function instead of cross-entropy
Stochastic gradient on a GPU is faster but distributed Hessian free optimization produces better models.
Low rank factorization of the output weights
Preconditioning and sampling

Hessian-Free Optimization

Learning Guarantees of the Optimization

Convex surrogate did not work
You could learn an interesting loss function.
Related Slides, I could not catch the presenter in the poster.

Large Scale Machine Learning(Accelerated)

Machine learning as an optimization problem.
Stochastic Gradient Methods
Gradient Descent Method Blog Post

Fast Scalable Comment Moderation on NYT

Active Learning at the New York Times

Comment is split into two: metadata and n-grams.
Hash the n-grams, score of the comment for human editor if it will be shown to her. Human moderators will work on whatever the algorithm scores on high.
20% => workload reduction according to the plan

Key Observations

Stochastic Gradient Descent in the origin
Batch Gradient Descent in one direction (between semi-stochatic approaches)
Stochastic Newton Method in the other direction (between second-order methods)
Other fields in Newton method
Coordinate descent is towards simpler methods

Optimal rate is achieved also for testing cos as long as each data point is seen only once.

To learn More about Stochastic Gradient Descent

Nice paper for stochastic gradient methods
Stochastic Quasi-Newton Method performs quite well.

Sorry about other spotlight talks, I am sure they were as interesting as the ones above but I was only able to take notes and follow this much.