8th NYAS Machine Learning Symposium 2014
I attended to NYAS 8th Machine Learning Symposium and here are the notes
that I took from the event. It may contain errors and mistakes. If you
find any, please let me know.
On personal view, it was worse than the previous machine learning
symposium(7th) in both posters and also talks. Last year, the posters and
talks were much more interesting to me. That being said, I could not
visit all of the posters so take my word with a grain of salt.
The abstracts in pdf in here.
Machine Learning for Powers of Good
by Rayid Ghani
 Probability of optimization of limited resources for a campaign.
 Another important thing: to influence the behavior of the voter and how to make them engage with the campaign.
 Because prediction itself is not good enough.
 Resource allocation based on who are likely to be influenced. Who are likely to change their mind? Definitely not Texas.
 Not voting to Romney and not going to vote, too much work. Do not try to event attempt to do anything.
 Focus on the ones either who are indecisive but likely to vote or indecisive about voting but weakly support Obama.
 Data Science for Social Good
Following Spotlight Talks
GraphBased Posterior Regularization for SemiSupervised Structured Prediction:
 Posterior labels for part of speech tags. Graphbased approach, using Laplacian of Graph
 Structured Prediction => CRF => local scope, features
Graphpropagation and CRF estimation => Joint objective, then to optimize and look at the KL divergence as well for both world parameters.
Relevant work is Posterior Regularization(PR) Linear Ganchev
 EM like algorithm => to converge to the local optimum
She showed that it performs better than both CRF and Graph based approaches in her poster, but she did not compare speed of this approach with CRF or Graph based approaches. It is likely the method is slower than CRF but, I am not very familiar Graph based approaches and joint objective could be quite hard to optimize. So, the speed is could be much worse more than 2 times than CRF.
Learning from Label Proportions(LLP):
It attacks Binary learning problem with an extension of bag approach where bags represent the ratio of the labels that are known but individual labels are unknown. They try to solve the problem in a large margin framework trying to model the instances belonging to a particular label and try to increase margin with the other label(smvlike).  Extension of supervised learning objective with Bag Proportion Loss with model parameters with a proportion loss.
Generalization Error of LLP
 Sample complexity of learning is proportional to bag proportion
 Instance label prediction error => again depends on the prediction error
 Not only the supervised learning objective but also the bag proportions for the labeling matters.
Generative Image Models For Visual Phenotype Modeling
They have genome types of fish and they have features of the fish. In order to learn which genome type has effect on which fish trait, they propose an admixture model which tries to correlate the traits and genome.
 Admixture model to correlate the shape variance between geneology of the fishes.
 Annotated genome from the shape variance of the fish.
 Genome annotates the features of the shape variance of the fish.
 Unsupervised learning of the features and joint generative model from fish variance and genome change. => Seems quite novel.
Large Scale Learning  Scaling Graphbased semi supervisedlearning
 Replace label vectors => countminsketch? is a data structure(randomized) stores the counts of items.
 MAD exact vs MadSketch => comparison
 Junto Toolkit @ Github
Structured Classification Criteria for Deep Learning for Speech Recognition (Second Keynote)
Before this talk, I knew that IBM is strong in deep learning(if I recall correctly, they had a poster last year for speech recognition) but I did not know that they published also strong papers for speech recognition last year. Google and Facebook get a lot of coverage for deep learning and that maybe rightly so, but IBM is also a strong player in the area.
Talk Structure
 Acoustic Modeling for speech recoognition:
 Structured loss function: we do not care about the loss function per se but how audible it is in the speech. Therefore, the loss function should be structured around audibility of the speech.
 Optimization
 Speeding for training
Bayesian Modeling for Speech Recognition
Sequence of phones are nice because if have a word to classify and you did not have that sample in the training set, you could "guess" the word from the sequence of phones.
Context affects the acoustic realization of a phone in the speech.
Contextdependent modeling
 Condition on adjacent phones, infer the xcontext.
 Parameter sharing needed to mitigate data sparsity sahring must generalize to unseen contexts.
 Decision tree to get the AAb, nasal, retroflex, fricative?, too much handengineered features.
 Basic speech sounds => 1000 to 64K
Structured Loss Functions
 Crossentropy, for training criterion.
 A neural network with training error with the crossentropy.
 We do not care about individual phones error but word error.
 CrossEntropy
 BayesRisk Losses
 Hamming distance is a forgiving distance measure for error between HMM sequences.
 To represent reference space, lattices generated via constrained recognition.
Stochastic Gradient Optimization
 Training GMM with is the fundamental idea.Stochastic Gradient Optimization
 Second Order optimization => well researched
 Linear conjugate gradient minimizes a quadratic, which can be described by only matrixvector products. Only linear time and memory are necessary.
 CG is not necessary, truncated Newton is good enough.
 Hessian Free Optimization
 It has reference in this book,too. => most probably, this may explain better.
Speeding Training
 Lowrank factorization of output weights
 Word error rate generally drops as we increase the number of output targets => factorization to reduce dimension?
 It gets faster.
Preconditioning in Sampling
 Accelerating Hessian Free Optimization Implicit Preconditioning and Sampling
 Geometric optimization reference is also in the above link.
TakeHome Mesages
 A Structured loss function instead of crossentropy
 Stochastic gradient on a GPU is faster but distributed Hessian free optimization produces better models.
 Low rank factorization of the output weights
 Preconditioning and sampling
Learning Guarantees of the Optimization
 Convex surrogate did not work
 You could learn an interesting loss function.
 Related Slides, I could not catch the presenter in the poster.
Large Scale Machine Learning(Accelerated)
 Machine learning as an optimization problem.
 Stochastic Gradient Methods
 Gradient Descent Method Blog Post
Fast Scalable Comment Moderation on NYT
Active Learning at the New York Times
 Comment is split into two: metadata and ngrams.
 Hash the ngrams, score of the comment for human editor if it will be shown to her. Human moderators will work on whatever the algorithm scores on high.
 20% => workload reduction according to the plan
Role of Optimization in Machine Learning
Key Observations
 Stochastic Gradient Descent in the origin
 Batch Gradient Descent in one direction (between semistochatic approaches)
 Stochastic Newton Method in the other direction (between secondorder methods)
 Other fields in Newton method
 Coordinate descent is towards simpler methods
Optimal rate is achieved also for testing cos as long as each data point is seen only once.
To learn More about Stochastic Gradient Descent

Stochastic QuasiNewton Method performs quite well.
Sorry about other spotlight talks, I am sure they were as interesting as the ones above but I was only able to take notes and follow this much.