All talks at the workshop were video recorded and the videos will be available soon!


The recent advancements in deep learning have revolutionized the field of machine learning, enabling unparalleled performance and many new real-world applications. Yet, the developments that led to this success have often been driven by empirical studies, and little is known about the theory behind some of the most successful approaches. While theoretically well-founded deep learning architectures had been proposed in the past, they came at a price of increased complexity and reduced tractability. Recently, we have witnessed considerable interest in principled deep learning. This led to a better theoretical understanding of existing architectures as well as development of more mature deep models with solid theoretical foundations. In this workshop, we intend to review the state of those developments and provide a platform for the exchange of ideas between the theoreticians and the practitioners of the growing deep learning community. Through a series of invited talks by the experts in the field, contributed presentations, and an interactive panel discussion, the workshop will cover recent theoretical developments, provide an overview of promising and mature architectures, highlight their challenges and unique benefits, and present the most exciting recent results.

Topics of interest include, but are not limited to:

  • Deep architectures with solid theoretical foundations
  • Theoretical understanding of deep networks
  • Theoretical approaches to representation learning
  • Algorithmic and optimization challenges, alternatives to backpropagation
  • Probabilistic, generative deep models
  • Symmetry, transformations, and equivariance
  • Practical implementations of principled deep learning approaches
  • Domain-specific challenges of principled deep learning approaches
  • Applications to real-world problems


Invited Speakers

Sanjeev Arora (Princeton University)

Do GANs Actually Learn the Distribution? Some Theory and Empirics

The Generative Adversarial Nets or GANs framework (Goodfellow et al'14) for learning distributions differs from older ideas such as autoencoders and deep Boltzmann machines in that it scores the generated distribution using a discriminator net, instead of a perplexity-like calculation. It appears to work well in practice, e.g., the generated images look better than older techniques. But how well do these nets learn the target distribution? Our paper 1 (ICML'17) shows GAN training may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. We show theoretically that this can happen even though the 2-person game between discriminator and generator is in near-equilibrium, where the generator appears to have "won" (with respect to natural training objectives). Paper2 (arxiv June 26) empirically tests the whether this lack of generalization occurs in real-life training. The paper introduces a new quantitative test for diversity of a distribution based upon the famous birthday paradox. This test reveals that distributions learnt by some leading GANs techniques have fairly small support (i.e., suffer from mode collapse), which implies that they are far from the target distribution.
Paper 1: "Equilibrium and Generalization in GANs" by Arora, Ge, Liang, Ma, Zhang. (ICML 2017)
Paper 2: "Do GANs actually learn the distribution? An empirical study." by Arora and Zhang (arXiv)

Pedro Domingos (University of Washington)

The Sum-Product Theorem: A Foundation for Learning Tractable Deep Models

Inference in expressive probabilistic models is generally intractable, which makes them difficult to learn and limits their applicability. Sum-product networks are a class of deep models where, surprisingly, inference remains tractable even when an arbitrary number of hidden layers are present. In this talk, I generalize this result to a much broader set of learning problems: all those where inference consists of summing a function over a semiring. This includes satisfiability, constraint satisfaction, optimization, integration, and others. In any semiring, for summation to be tractable it suffices that the factors of every product have disjoint scopes. This unifies and extends many previous results in the literature. Enforcing this condition at learning time thus ensures that the learned models are tractable. I illustrate the power and generality of this approach by applying it to a new type of structured prediction problem: learning a nonconvex function that can be globally optimized in polynomial time. I show empirically that this greatly outperforms the standard approach of learning without regard to the cost of optimization. (Joint work with Abram Friesen)

Surya Ganguli (Stanford University)

On the Beneficial Role of Dynamic Criticality and Chaos in Deep Learning

What does a generic deep function “look like” and how can we understand and exploit such knowledge to obtain practical benefits in deep learning? By combining Riemannian geometry with dynamic mean field theory, we show that generic nonlinear deep networks exhibit an order to chaos phase transition as synaptic weights vary from small to large. In the chaotic phase, deep networks acquire very high expressive power: measures of functional curvature and the ability to disentangle classification boundaries both grow exponentially with depth, but not with width. Moreover, we apply tools from free probability theory to study the propagation of error gradients through generic deep networks. We find, at the phase transition boundary between order and chaos, that not only the norms of gradients, but also angles between pairs of gradients are preserved even in infinitely deep sigmoidal networks with orthogonal weights. In contrast, ReLu networks do not enjoy such isometric propagation of gradients. In turn, this isometric propagation at the edge of chaos leads to training benefits, where very deep sigmoidal networks outperform ReLu networks, thereby pointing to a potential path to resurrecting saturating nonlinearities in deep learning.

Tomaso Poggio (Massachusetts Institute of Technology)

Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: Theoretical Results

In recent years, by exploiting machine learning — in which computers learn to perform tasks from sets of training examples — artificial-intelligence researchers have built impressive systems. Two of my former postdocs — Demis Hassabis and Amnon Shashua — are behind the two main success stories of AI so far: AlphaGo bettering the best human players at Go and Mobileye leading the whole automotive industry towards vision-based autonomous driving. There is, however, little in terms of a theory explaining why deep networks work so well. In this talk I will review an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. II will discuss implications of a few key theorems, together with open problems and conjectures. I will also sketch the vision of the NSF-funded, MIT-based Center for Brains, Minds and Machines which strives to make progress on the science of intelligence by combining machine learning and computer science with neuroscience and cognitive science.

Ruslan Salakhutdinov (Carnegie Mellon University)

Neural Map: Structured Memory for Deep Reinforcement Learning

A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. In this talk, we will introduce a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training. Joint work with Emilio Parisotto.

Nathan Srebro (Toyota Technological Institute at Chicago, University of Chicago)

Geometry, Optimization and Generalization in Multilayer Networks

What is it that enables learning with multi-layer networks? What causes the network to generalize well despite the model class having extremely high capacity? In this talk I will explore these questions through experimentation, analogy to matrix factorization (including some new results on the energy landscape and implicit regularization in matrix factorization), and study of alternate geometries and optimization approaches.

Important Dates

Paper submissionJune 28, 2017 11:59pm PT
Acceptance notificationJuly 29, 2017
ICML registration closesAugust 3, 2017
Final versionAugust 4, 2017
WorkshopAugust 10, 2017 (Sydney, Australia)


The workshop combines invited talks, presentations of contributed papers and an interactive panel discussion. The panel discussion will be interactive, involving the invited speakers, moderated by the organizers based on questions submitted and voted on by the audience using an online service. Both the invited talks and the panel discussion will be recorded, with answers to audience questions posted to the online service. The workshop will feature presentations of contributed papers during poster sessions, with a few selected submissions allotted a 15 minute time slot for an oral presentation (including 3 minutes for questions).

Program of the workshop

Time Event
08:30Welcome and Opening Remarks
08:45 - 10:00Session 1:
08:45Invited Talk 1 - Sanjeev Arora
09:15Contributed Presentation 1 - Towards a Deeper Understanding of Training Quantized Neural Networks
09:30Invited Talk 2 - Surya Ganguli
10:00 - 10:45Poster Session
10:45 - 12:00Session 2:
10:45Invited Talk 3 - Ruslan Salakhutdinov
11:15Invited Talk 4 - Pedro Domingos
11:45Contributed Presentation 2 - LibSPN: A Library for Learning and Inference with Sum-Product Networks and TensorFlow
12:00 - 13:30Lunch
13:30 - 15:00Session 3:
13:30Invited Talk 5 - Tomaso Poggio
14:00Contributed Presentation 3 - Emergence of invariance and disentangling in deep representations
14:15Invited Talk 6 - Nathan Srebro
14:45Contributed Presentation 4 - The Shattered Gradients Problem: If resnets are the answer, then what is the question?
15:00 - 15:45Poster Session
15:45 - 17:30Session 4:
15:45Contributed Presentation 5 - Towards Deep Learning Models Resistant to Adversarial Attacks
16:00Panel Discussion (Ask the workshop anything!)
17:20Closing Remarks and Awards

Contributed Papers

  • Unifying Sum-Product Networks and Submodular Fields [pdf] [sup]: Abram Friesen (University of Washington), Pedro Domingos (University of Washington)
  • AdaNet: Adaptive Structural Learning of Artificial Neural Networks [pdf]: Corinna Cortes (Google), Javier Gonzalvo (Google), Vitaly Kuznetsov (Google), Mehryar Mohri (Courant Institute and Google), Scott Yang (Courant Institute)
  • Tackling Over-pruning in Variational Autoencoders [pdf]: Serena Yeung (Stanford University), Anitha Kannan (Facebook), Yann Dauphin (Facebook), Fei-Fei Li (Stanford University)
  • Emergence of invariance and disentangling in deep representations [pdf]: Alessandro Achille (UCLA), Stefano Soatto (University of California, Los Angeles)
  • The Shattered Gradients Problem: If resnets are the answer, then what is the question? [pdf]: David Balduzzi, Brian McWilliams (Disney Research), Marcus Frean (Victoria University Wellington), John Lewis (SEED, Electronic Arts), Lennox Leary (Victoria University Wellington), Kurt Wan Duo Ma (Victoria University Wellington)
  • Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks [pdf]: David Balduzzi, Brian McWilliams (Disney Research), Tony Butler-Yeoman (Victoria University of Wellington)
  • Practical Lessons of Distributed Deep Learning [pdf]: Jun Yang (Alibaba), Yan Chen (Alibaba), Siyu Wang (Alibaba), Lanbo Li (Alibaba), Chen Meng (Alibaba), Minghui Qiu (Alibaba), Wei Chu (Alibaba)
  • Adversarial Divergences are Good Task Losses for Generative Modeling [pdf]: Gabriel Huang (MILA), Gauthier Gidel (Université de Montréal), Hugo Berard (MILA), Ahmed Touati, Simon Lacoste-Julien (Université de Montréal)
  • Risk Bounds for Transferring Representations With and Without Fine-Tuning [pdf]: Daniel McNamara (Australian National University), Nina Balcan (Carnegie Mellon University)
  • Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscape [pdf]: Lei Wu (Peking University), Zhanxing Zhu (Peking University & BIBDR), Weinan E
  • Sparse-Input Neural Networks for High-dimensional Nonparametric Regression [pdf] [sup]: Jean Feng (University of Washington), Noah Simon (University of Washington)
  • E-RNN: Entangled Recurrent Neural Networks for Causal Prediction [pdf]: Jinsung Yoon (University of California, Los ), Mihaela van der Schaar (University of Oxford)
  • Active Learning what makes a Discrete Sequence Valid [pdf]: Jos Van der Westhuizen (Cambridge University), David Janz (Cambridge University), José Hernández-Lobato (University of Cambridge)
  • Deep Counterfactual Networks with Propensity-Dropout [pdf]: Ahmed Alaa (UCLA), Michael Weisz (Oxford University), Mihaela van der Schaar (Oxford University)
  • Towards a Deeper Understanding of Training Quantized Neural Networks [pdf]: Hao Li (University of Maryland), Soham De (University of Maryland), Zheng Xu (University of Maryland), Christoph Studer (Cornell University), Hanan Samet (University of Maryland), Tom Goldstein (University of Maryland)
  • A Note on Learning Algorithms for Quadratic Assignment with Graph Neural Networks [pdf]: Alex Nowak (NYU), Joan Bruna (New York University), Afonso Bandeira (NYU), Soledad Villar (NYU)
  • The Effects of Memory Replay in Reinforcement Learning [pdf]: Ruishan Liu (Stanford University), James Zou (Stanford University)
  • Towards Deep Learning Models Resistant to Adversarial Attacks [pdf]: Aleksander Mądry (MIT), Aleksandar Makelov (MIT), Ludwig Schmidt (MIT), Dimitris Tsipras (MIT), Adrian Vladu (MIT)
  • On the Expressive Power of Deep Neural Networks [pdf]: Maithra Raghu (Cornell & Google Brain), Ben Poole, Jon Kleinberg, Surya Ganguli, Jascha Sohl-Dickstein
  • SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement [pdf]: Maithra Raghu (Cornell & Google Brain), Justin Gilmer, Jason Yosinski, Jascha Sohl-Dickstein
  • Deep Relaxation: partial differential equations for optimizing deep neural networks [pdf]: Pratik Chaudhari (UCLA), Adam Oberman (McGill University), Stanley Osher (UCLA), Stefano Soatto (UCLA), Guillaume Carlier (Universite Paris IX Dauphine)
  • Towards Understanding the Dynamics of Generative Adversarial Networks [pdf] [sup]: Jerry Li (MIT), Aleksander MÄ…dry (MIT), John Peebles (MIT), Ludwig Schmidt (MIT)
  • Double Continuum Limit of Deep Neural Networks [pdf] [sup]: Sho Sonoda (Waseda university), Noboru Murata (Waseda University)
  • Memorization in Recurrent Neural Networks [pdf]: Tegan Maharaj (MILA (Montreal Polytechic)), David Krueger, Tim Cooijmans
  • An End-to-End Sparse Coding [pdf]: Joey Tianyi Zhou (IHPC Astar), Ivor Tsang (UTS), Sinno Pan (NTU), Zheng Qin (IHPC Astar), Rich Go (IHPC Astar)
  • Resampled Proposal Distributions for Stochastic Variational Inference and Learning [pdf]: Aditya Grover (Stanford University), Ramki Gummadi (Vicarious AI), Stefano Ermon, Miguel Lazaro-Gredil, Dale Schuurmans (University of Alberta)
  • A Deep Learning Approach for Joint Video Frame and Reward Prediction in Atari Games [pdf]: Felix Leibfried (, Nate Kushman, Katja Hofmann
  • Boosted generative models [pdf]: Aditya Grover (Stanford University), Stefano Ermon
  • Flow-GAN: Bridging implicit and prescribed learning in generative models [pdf]: Aditya Grover (Stanford University), Manik Dhar, Stefano Ermon
  • Bayesian Semisupervised Learning with Deep Generative Models [pdf]: Jonathan Gordon (University of Cambridge), José Hernández-Lobato (University of Cambridge)
  • Hierarchical Attribute CNNs [pdf]: Jörn-Henrik Jacobsen (University of Amsterdam), Edouard Oyallon, Stéphane Mallat, Arnold Smeulders
  • Learnable Explicit Density for Continuous Latent Space and Variational Inference [pdf]: Chin-Wei Huang (MILA), Ahmed Touati, Laurent Dinh (Montreal Institute for Learning Algorithms), Michal Drozdzal (Imagia Inc.), Mohammad Havaei (Imagia Inc.), Laurent Charlin (HEC Montreal), Aaron Courville (Université de Montréal)
  • Convolutional Networks for Spherical Signals [pdf]: Taco Cohen (University of Amsterdam), Mario Geiger (University of Amsterdam), Jonas Koehler (University of Amsterdam), Max Welling (University of Amsterdam)
  • Sparse Attentive Backtracking: Towards Efficient Credit Assignment In Recurrent Networks [pdf]: Nan Rosemary Ke (MILA, Polytechnique Montreal), Alex Lamb, Anirudh Goyal (MILA)
  • Synthetic Generation of Local Minima and Saddle Points for Neural Networks [pdf]: Dimitri Marinelli (RIST), Luigi Malagò (RIST)
  • Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data [pdf]: Gintare Dziugaite (University of Cambridge), Daniel Roy (University of Toronto)
  • LibSPN: A Library for Learning and Inference with Sum-Product Networks and TensorFlow [pdf]: Andrzej Pronobis (University of Washington), Avinash Ranganath, Rajesh Rao

Best submissions will be awarded the Google Best Paper Award for the best paper and the Google Best Student Paper Award for the best student paper. Both awards come with a prize of 600$ sponsored by Google.

Program Committee


Andrzej Pronobis is a Research Associate in the Department of Computer Science and Engineering at the University of Washington in Seattle, as well as a Senior Researcher at KTH Royal Institute of Technology in Stockholm, Sweden. His research is at the intersection of robotics, deep learning and computer vision, with focus on perception and spatial understanding mechanisms for mobile robots and their role in the interaction between robots and human environments. His recent interests include application of tractable probabilistic deep models to planning and learning semantic spatial representations. He is a recipient of a prestigious Swedish Research Council Grant for Junior Researchers and a finalist for the Georges Giralt Ph.D. award for the best European Ph.D. thesis in robotics.

Robert Gens is a Research Scientist at Google Seattle. His research interests are in machine learning, deep learning, and computer vision. He received a PhD in Computer Science and Engineering from the University of Washington. He completed an SB in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology in 2009, received an Outstanding Student Paper Award at NIPS 2012, and was supported by the 2014 Google PhD Fellowship in Deep Learning.
Sham Kakade is a Washington Research Foundation Data Science Chair, with a joint appointment in both the Computer Science & Engineering and Statistics departments at the University of Washington. Before joining the University of Washington, Dr. Kakade was a principal research scientist at Microsoft Research, New England. Prior to this, Dr. Kakade was an associate professor at the Department of Statistics, Wharton, University of Pennsylvania and an assistant professor at the Toyota Technological Institute at Chicago. He works on both theoretical and applied questions in machine learning and artificial intelligence, focusing on designing both statistically and computationally efficient algorithms for machine learning, statistics, and artificial intelligence. More broadly, Sham has made various contributions in various areas including statistics, optimization, probability theory, machine learning, algorithmic game theory and economics, and computational neuroscience. He is a recipient of numerous awards and served as a chair for many conferences.
Pedro Domingos is a professor of computer science at the University of Washington and the author of "The Master Algorithm". He is a winner of the SIGKDD Innovation Award, the highest honor in data science. He is a Fellow of the Association for the Advancement of Artificial Intelligence, and has received a Fulbright Scholarship, a Sloan Fellowship, the National Science Foundation’s CAREER Award, and numerous best paper awards. His research spans a wide variety of topics in machine learning, artificial intelligence, and data science, including scaling learning algorithms to big data, maximizing word of mouth in social networks, unifying logic and probability, and deep learning.